Presentation is loading. Please wait.

Presentation is loading. Please wait.

ChIP-seq Methods & Analysis

Similar presentations


Presentation on theme: "ChIP-seq Methods & Analysis"— Presentation transcript:

1 ChIP-seq Methods & Analysis
Gavin Schnitzler

2 ChIP-seq COURSE OUTLINE
Day 1: ChIP techniques, library production, USCS browser tracks Day 2: QC on reads, Mapping binding site peaks, examining read density maps. Day 3: Analyzing peaks in relation to genomic feature, etc. Day 4: Analyzing peaks for transcription factor binding site consensus sequences. Day 5: Variants & advanced approaches.

3 DAY 3 LECTURE OUTLINE (finishing up peak mapping, from Tuesday).
Exploring your peak data with Galaxy & Cistrome Analyzing overlaps between peak sets, with galaxy and in UNIX

4 DAY 2 LECTURE OUTLINE FASTQC (quality control on reads)
Getting your raw data -Exercise: Getting around UNIX, downloading & unpacking Mapping reads to the genome & identifying binding site peaks -Exercise: Running Bowtie & MACs Visualizing your results -Exercise: Custom UCSC browser tracks

5 Let’s try that again (w/ a streamlined proven command set)
Open putty & login to cluster.uit.tufts.edu “mkdir chip” “cd chip” “cp /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/*.gz .“ [make sure to add the final space & period, this tells UNIX to keep the same filename & put it in the current directory] Now repeat this for …/*.txt Do “ls” to see what you got … ip19.fastq --- is the raw data for ERalpha ChIP seq from mouse liver on chrom 19 input19.fastq --- is the raw data for the corresponding input sample workflow1.txt --- This file lists all of the commands you will use to process your raw sequence data, map reads to the genome & map peaks. Do “cat *.txt” Now you have all the commands you will use & can copy & paste (by selecting & then right clicking in Putty).

6 All the commands needed to go from sequence to peaks
gunzip *.fastq.gz bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 ip19.fastq ip19.map bsub -oo input19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 input19.fastq input19.map awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' input19.map > input19.bed awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed module add python/2.6.5 export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 cd *aph/treat gunzip *.gz bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl ip19_trim_norm.bdg all ipvinput19_treat_afterfiting_all.bdg cd ../control bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl input19_trim_norm.bdg all ipvinput19_control_afterfiting_all.bdg

7 Mapping reads to a genome
Understanding the bowtie command (which you’ll have cut & pasted from your screen): bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 ip19.fastq ip19.map bsub –oo ip19.bowtieinfo, submits the batch process and names an output/error file. /cluster/shared/gschni01/bowtie*/bowtie gives the path to the bowtie program & tells unix to run it -n 1 tells Bowtie to accept no more than 1 mismatch between a the first 25 bp of a sequence read & its best homologue in the genome -m 1 tells Bowtie to reject any reads that are identical to more than 1 sequence in the genome (since we wouldn’t know which locus our read really came from) -5 8 tells Bowtie to trim the first 8 (lower quality) bases from the read before mapping --best & --strata tell bowtie to try hard to find the best match [name].fastq is your input file & [name].map specifies the name of the output file.

8 How did Bowtie do? Check your .bowtie info bsub output files:
“head *.bowtieinfo“ … The lines you’re interested in are the ones before the line (after which info of the bsub run itself is given) ==> LiE_ERaIP_chr19.bowtieinfo <== # reads processed: # reads with at least one reported alignment: (99.48%) # reads that failed to align: 554 (0.15%) # reads with alignments suppressed due to -m: 1368 (0.37%) Note that most of the reads aligned to some other sequence in the genome, very few failed to & map also very few had matched more than 1 genomic sequence (-m 1). This is great - but atypical - it only looks this good because I filtered the .fastq files for things that mapped to chr19… The actual data for all chromosomes looks like: # reads processed: # reads with at least one reported alignment: (70.49%) # reads that failed to align: (6.14%) # reads with alignments suppressed due to -m: (23.37%) Reported alignments to 1 output stream(s) Should be very low, unless you have contamination of non-mouse sequence. Typical level due to repeat sequences in mammalian genome

9 cp /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/*.fastq.gz .
gunzip *.fastq.gz bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 ip19.fastq ip19.map bsub -oo input19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 input19.fastq input19.map awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' input19.map > input19.bed awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed module add python/2.6.5 export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 cd *aph/treat gunzip *.gz bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl ip19_trim_norm.bdg all ipvinput19_treat_afterfiting_all.bdg cd ../control bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl input19_trim_norm.bdg all ipvinput19_control_afterfiting_all.bdg

10 Using awk change from .map to .bed format
Understanding your awk command: awk 'OFS='\t' {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed OFS=‘\t’ tells awk to output tab delimited data The print command says: print these data columns in order: #4:chromosome, #5:start_bp, #5:start_bp+length(#6:sequence)=end_bp, #1:identifier, “.” as a placeholder & #3:strand awk would normally print to the screen, but here we redirect the output to create a new .bed file (> can be used for any other UNIX command too!).

11 How do peak-finders map binding sites?
Fragments are of a range of sizes & contain the TF binding site at a (mostly) random position within them. Reads are read (randomly) from left or right edges (sense or antisense) of fragments. Thus peak for sense tags will be 1/2 the fragment length upstream… Binding site position = mid-way between sense tag peak & antisense tag peak. To get binding site peak, shift sense downstream by ½ fragsize & antisense upstream by ½ fragsize. Adapted from slide set by: Stuart M. Brown, Ph.D., Center for Health Informatics & Bioinformatics, NYU School of Medicine & from Jothi, et al. Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data. NAR (2008), 36:

12 cp /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/*.fastq.gz .
gunzip *.fastq.gz bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 ip19.fastq ip19.map bsub -oo input19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 input19.fastq input19.map awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' input19.map > input19.bed awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed module add python/2.6.5 export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 cd *aph/treat gunzip *.gz bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl ip19_trim_norm.bdg all ipvinput19_treat_afterfiting_all.bdg cd ../control bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl input19_trim_norm.bdg all ipvinput19_control_afterfiting_all.bdg

13 Mapping binding peaks w/ MACs
Understanding the commands used for MACS module load python/2.6.5 … This tells the cluster to use an optional version of python. export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH These tell UNIX where to find the necessary libraries to run MACS: Using MACS to identify peaks from ChIP-Seq data. Feng J, Liu T, Zhang Y. Curr Protoc Bioinformatics Jun;Chapter 2:Unit doi: / bi0214s34.

14 MACs parameters Now, let’s run MACs using our input file as control (after –c) and our ip file as the ‘treatment’ or experimental file (after –t). bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 --format=BED tells MACs that the input file is in .bed format --bw=210 tells MACs the expected size of sequenced fragments (before addition of linkers, which add an additional ~90 bp) from which value it attempts to build a model from sense and antisense sequence reads --keep-dup=1 instructs MACS to consider only the first instance of a read starting at any given genomic base pair coordinate & pointing in the same direction – assuming that additional reads starting at the same base pair are due to amplified copies of the same ChIP fragment in the library (by default MACS estimates the number of duplicates that are likely to arise by linear amplification of all fragments from a limited starting sample, and sets the threshold to cut out replicate reads with a much higher number – likely artifacts, but keep-dup=1 is even cleaner) -B tells MACS to make a bedgraph file of read density at each base pair (which can be used to visualize the results on the UCSC browser) & -S tells MACS to make a single .bedgraph file instead of one for each chromosome --name gives the prefix name for all output files.

15 Examine your MACS output
Start with your .macsinfo bsub -oo file. vi LiE_ERaIPvINPUT_chr19.macsinfo Use the arrow keys to go to the top, where you’ll see all of the parameters you put in to run MACs. After some runtime info (including possible warnings, that you can ignore if there are not millions of them), you’ll see: INFO @ Sun, 10 Feb :27:51: #1 total tags in treatment: INFO @ Sun, 10 Feb :27:51: #1 user defined the maximum tags... INFO @ Sun, 10 Feb :27:51: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) INFO @ Sun, 10 Feb :27:51: #1 tags after filtering in treatment: INFO @ Sun, 10 Feb :27:51: #1 Redundant rate of treatment: 0.26 This is useful information. It tells you how many different reads you had (out of all of the reads which mapped to only one place in the mouse genome- from Bowtie). You want this number to be high and the “redundant rate” to be low. (You’ll need the tags after filtering numbers later, so jot them down somewhere)

16 Using duplication levels to estimate your library size
Assuming you have 100 initial fragments in your library (before amplification) & which fragment gets read is random: #seqs read: # diff reads: % duplicated: 9% 27% 33% 43% 55% 69% x-more left in lib: x-more than prev: Thus, if you have low % duplicates (e.g. 9%) in one lane, adding an additional run of the same number of reads will give you 1.6x more, or 2 additional runs will give you 2.2x more (1.6*1.4). …but if you have a high % duplicates (e.g. 43%) adding one more lane will only give you 1.37x more unique reads than you had initially. This indicates that your library has low complexity – probably because too few fragments from your ChIP survived to the library amplification step.

17 MACs ‘shiftsize’ model
Keep scrolling down your .macsinfo file… INFO @ Sun, 10 Feb :27:51: #2 Build Peak Model... INFO @ Sun, 10 Feb :27:51: #2 number of paired peaks: 0 Sun, 10 Feb :27:51: Too few paired peaks (0) so I can not build the model! Broader your MFOLD range parameter may erase this error. If it still can't build the model, please use --nomodel and --shiftsize 100 instead. Sun, 10 Feb :27:51: Process for pairing-model is terminated! Sun, 10 Feb :27:51: #2 Skipped... Sun, 10 Feb :27:51: #2 Use 100 as shiftsize, 200 as fragment length Here MACs tried to estimate the “shift size” for moving sense & antisense reads to get a final peak position, by identifying sets of strong + & - strand peaks at a certain distance from each other. There wasn’t enough info on chromosome 9 to do this, so it made a guess that the fragment size was 200 & shiftsize was is close enough to the actual fragment size of ~150 bp that we can go with this.

18 MACs model file This is the result I got when I ran MACs with all chromosomes #2 Build Peak Model... #2 number of paired peaks: 683 Fewer paired peaks (683) than 1000! Model may not be build well! Lower your MFOLD parameter may erase this warning. Now I will use 683 pairs to build model! finished! predicted fragment length is 125 bps Generate R script for model : LiE_IP_v_INPUT_11_2012_dup1_model.r Call peaks... use control data to filter peak candidates... Finally, 9504 peaks are called! find negative peaks by swapping treat and control Finally, 337 peaks are called! d = estimated fragment size. Actual size ~150 bp, so this is not perfect, suggesting a bit more tweaking could b useful. To generate this file you will need to go into R, and enter: Source(“MACS_output_file.r”), which will generate a .pdf

19 Peaks & negative peaks Keep scrolling down your .macsinfo file until you find… INFO @ Sun, 10 Feb :36:47: #3 Finally, 364 peaks are called! INFO @ Sun, 10 Feb :36:47: #3 find negative peaks by swapping treat and control INFO @ Sun, 10 Feb :36:52: #3 Finally, 36 peaks are called! INFO @ Sun, 10 Feb :36:52: #4 Write output… This is the pay-off, where MACS identifies your ER alpha peak locations! 364 peaks on chromosome 19 (which is ~1/50th of the genome) suggests ~20,000 peaks for the whole genome, which is not bad! Equally critical, MACS now swaps treat & control (pretending your INPUT data is your IP & your ChIP data is your input) and looks again for peaks. The number of “negative” peaks found in this way should be far less than the positive peaks, and the 10:1 ratio here is fine.

20 WinSCP (SFTP/FTP software for Windows): http://winscp. net/eng/index

21 Looking at MACS data in Excel
Using WinSCP move the _peaks.xls file to the PC & open it.

22 Toubleshooting MACs For details on how to troubleshoot many problems in MACs, see the file ChIPseq_analysis_methods_2013_02_11.doc on the cbi website. Briefly… MACs can’t build a model: - Adjust the mfold values (the fold over background ranges MACs considers for paired peaks) - Tell MACs to not build a model, but instead use the shiftsize you specify. Peaks/Negative Peaks ratio is poor or too few peaks are detected: - Adjust model settings to see if you can improve both. Otherwise, you may have to conclude that 1) your library was no good or 2) the factor just doesn’t bind to many places in the genome.

23 Trimming .bdg files With the –B & -S commands, MACS generated a bedGraph file that can be used to visualize your combined read density information (with + & - reads shifted by shiftsize) in the UCSC browser MACS gets too enthusiastic, however, and occasionally places the end of a read past the what the UCSC browser thinks is the end of a chromosome (causing the UCSC browser to reject the whole file). To avoid this, you need to trim your .bdg files to remove anything past chromosome ends.

24 Normalizing .bdg files If you sequenced 100 M reads (A) you may have a peak that is 200 reads at its apex. But if you only took a subsample 10 M reads (B), that peak would be only ~20 reads at its apex. To compare (A) & (B), just divide by the # of million mapped reads… now both peaks have a max of 2. The same is true when comparing across samples: normalizing to “reads per million mapped reads” (RPMR) lets you directly compare peak intensity across samples & conditions.

25 Millions of non-duplicated mapped reads reported in .macsinfo
cp /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/*.fastq.gz . gunzip *.fastq.gz bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 ip19.fastq ip19.map bsub -oo input19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m best --strata mm9 input19.fastq input19.map awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' input19.map > input19.bed awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed module add python/2.6.5 export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 cd *aph/treat gunzip *.gz bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl ip19_trim_norm.bdg all ipvinput19_treat_afterfiting_all.bdg gzip *.bdg cd ../control bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl input19_trim_norm.bdg all ipvinput19_control_afterfiting_all.bdg Millions of non-duplicated mapped reads reported in .macsinfo

26 Uploading to UCSC browser
Use WinSCP to move your .gz compacted .bdg files & the …peaks.bed file the MACs generated to your PC. Go to Select mouse mm9 genome & hit enter Click on add custom tracks Select each of these files & upload them Explore! Get a sense of what the data looks like. Important sanity check: called peaks should be clearly evident in the .bdg data.

27 DAY 3 LECTURE OUTLINE (finishing up peak mapping, from Tuesday).
Exploring your peak data with Galaxy & Cistrome Analyzing overlaps between peak sets, with galaxy and in UNIX

28 Galaxy & Cistrome MAIN GALAXY SITE: https://main.g2.bx.psu.edu/
GALAXY/CISTROME (specialized for ChIP-seq data):

29 Useful Galaxy Tools https://main.g2.bx.psu.edu/
Get data-> upload file: to get your data into Galaxy. For .fastq data files it’s best to give ftp server URL (from right click or control click (for mac) on link provided by your core, will need to sign up for a free account for .ftp file transfers. Liftover: To convert coordinates between genomes or different builds in the same genome (can also do in the UCSC browser) Text manipulation: add lines, rearrange columns, etc. – functional but limited and very unwieldy. Convert formats: Useful, but doesn’t cover everything. Fetch sequences: Get DNA sequence just like USCC table browser Operate on genomic Intervals-> determine intersections between sets of regions, etc.

30 Useful Galaxy Tools https://main.g2.bx.psu.edu/ NGS Tools:
QC and manipulation -> run FASTQC Mapping -> map reads to genome with Bowtie or BWA SAM & BAM: Convert between & manipulate SAM and BAM format files often required for certain programs. Peak Calling: MACS & a few others BED Tools: Convert between BAM & BED and manipulate .bed files. On the face of it, this looks powerful, but it is VERY slow. My quick benchmark, downloading a 1.5 Gb .fastq.gz raw data file that took 13 secs to download to the cluster took >30 minutes to upload to Galaxy.

31 GALAXY/CISTROME http://cistrome.dfci.harvard.edu/ap/root
Galaxy tools specially designed for ChIP-seq analysis. Most things you can find elsewhere, but Cistrome allows easy access to many analyses that give you some quick insights into your data. Sign up for the free account, so we can explore what Cistrome. …“cistrome” is a term coined by Myles Brown’s lab at DFCI, which the genomewide distribution of a transcription factor on chromatin.

32 Cistrome-specific tools
CISTROME TOOLBOX Data Preprocessing: Run MACS & variants, as well as some designed for ChIP-chip. Integrative analysis: CORRELATION-> Venn Diagram (overlaps of 2 to 3 peak coordinate datasets) Use WinSCP to move the 3 .bed files from: /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/ERa_cistrome_beds …to your PC & then upload them to Cistrome. Select Venn Diagram & select these 3 files in the drop-down menus.

33 Cistrome-specific tools
CISTROME TOOLBOX: ASSOCIATION STUDY CEAS (cis element annotation system) Provides quick info on distribution of your TF peaks relative to chromsomes, gene start & end sites, exons, etc. Select CEAS & then select one of your uploaded files (by number) in the dropdown menu.

34 Cistrome-specific tools
CISTROME TOOLBOX: ASSOCIATION STUDY GCA: Gene centered annotation Find the nearest interval in the given intervals set for every annotated coding gene (e.g. where’s the nearest ER binding site for each gene in the genome) peak2gene: Peak Center Annotation Input a peak file, and It will search each peak on UCSC GeneTable to get the refGenes near the peak center (e.g. where’s the nearest gene to each ERa binding site in the genome).

35 Cistrome-specific tools
CISTROME TOOLBOX: ASSOCIATION STUDY Conservation Plot Calculates the PhastCons scores in several intervals sets Functional transcription factor binding sites are expected to have consensus elements for that factor (and/or partnering factors that help recruit it or stabilize it on chromatin). This is less true, of course, for histone modifications, which may spread for some distance from initial recruiting factors. Choose conservation plot & feed one or more of your bed files to it. One way to tell if you have improved your peak calling (e.g. by tweaking MACS parameters) is if the conservation at the center of your peaks goes up).

36 DAY 3 LECTURE OUTLINE (finishing up peak mapping, from Tuesday).
Exploring your peak data with Galaxy & Cistrome Analyzing overlaps between peak sets, with galaxy and in UNIX

37 Overlaps in Cistrome or Galaxy
The Venn Diagram function gave you some indication of the degree of overlap between your .bed file datasets – but this is only a top level analysis. Operate On Genomic Intervals-> Intersect This lets you create a new .bed file which has only the regions that intersect between two datasets. Overlapping Pieces of Intervals: (saves only the regions shared between 1 & 2) Overlapping Intervals: (saves complete intervals from file 1 that overlap anything in file 2)

38 How can we tell whether overlaps are significantly greater than chance?
Go to the cluster & move those same 3 files into your chip folder in your /cluster/shared/userID/chip folder: cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.bed . Now, let’s assess whether the overlap between ERalpha binding sites between liver and aorta is greater than expected by chance using: bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl AoE_all.bed LiE_all.bed –outfile AoE_v_LiE.overlap This program identifies all times when .bed regions in file1 overlap bed regions in file2 & estimates the frequency expected by chance.

39 Assigning a p value p.=.74 p.=2e-10
We have 3 bits of count frequency information: The number of overlaps, the number of regions compared, and the expected background frequency: This type of data is like coin tosses & is ideally suited for a binomial test, which uses “number of matches”, “number of tests” and “expected background frequency” to calculate p. values. If you flip a coin, say 10 times and it comes up heads 6 out of 10 (frequency 0.6 vs. expected 0.5), that would not seem unlikely – and a binomial test would tell you this. However if you flip a coin 1000 times & get heads 600 out of 1000, that would seem a bit odd, and the binomial test would indicate this by saying that the probability of the null hypothesis (that the frequency of heads is 0.5) is low. p.=.74 p.=2e-10

40 A brief forray into R Looking at the overlap program results, we know that there were 1653 overlaps between Aorta & Liver ER sites, out of 8260 Aorta regions tested (our number of tests), and the background frequency was 95.11/8260. To run our binomial test we’ll want to start up the R statistical programming language, by typing: module load R If you just type R now, you get this message: To run R please invoke the following command to run it via LSF's interactive queue: bsub -Ip -q int_public6 R Do what it suggests & you’ll get welcome information in R.

41 Binomial tests for overlaps
Now ask R to run your binomial test by typing: binom.test(1653, 8260, 95.11/8260) The p.value is <2e-16. Very low. So, yes, ER binding sites in liver and aorta overlap more than expected by chance… but ERa is still binding to ~80% different places between these two tissues. Now exit R by typing “q()” & saying “n” to the question about saving. Binomial tests are useful for many different types of count data & they will also give you probabilities for ANTI-enrichment as well as enrichment.

42 Getting R (for your PC) R: http://cran.r-project.org/ RStudio:
Install RStudio after you have installed R. For more info on using R & Unix see: UNIX resources & R resources

43 Overlaps between peaks & genes
In that same file are also .txt files listing the transcription start sites (TSSes) of genes that were up- or down-regulated by estrogen in aorta or liver. Get them by typing: cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.txt . Take a look at one of them using head [name].txt chr Dnahc6 chr C8g chr Tnip3 The file format is (tab-delimited) chromosome, TSS, transcription direction (+=sense) & geneID. You can get all this info easily from the UCSC browser, for individual genes (by hand)… … or you can get this information for all genes & extract what you want for your gene set of interest.. Check out the RNA-seq module for info on making & handling .gtf files.

44 Overlaps between peaks & genes 2
The overlap program can recognize this type of file & will test for overlaps between ChIP-seq peaks and regions around the listed TSSes (default +/-1000 bp). You can also change this range by specifying a –range variable. Find the overlaps between 10-kb regions around TSSes of genes up- or downregulated in each tissue & the corresponding ER binding site data using variations on: bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl Ao_up_TSS.txt AoE_all.bed –outfile Ao_up_v_AiE.overlap Note the number of overlaps, number of genes (= number of tests) and the number of overlaps expected by chance, then start up R and use binomial tests to determine whether there is significant enrichment for each comparison. What conclusions can you draw?

45 An important note on Data Storage
.fastq files are huge (too big for CDs or, for more than a few, your PC hard drive). So are many of the analysis files (like your .map & .bed files). You can request extra storage space on the cluster – for more info go to: Even that fills up fast: I’d recommend buying an external >1 Terabyte hard drive (~$200 or less).

46 Broad IGV (“Integrative Genomics Viewer”), an alternative to UCSC browser
You will need to register, but they don’t send you spam.

47 ChIP-seq Methods & Analysis
Gavin Schnitzler Asst. Prof. Medicine TUSM, Investigator at MCRI, TMC

48 ChIP-seq COURSE OUTLINE
Day 1: ChIP techniques, library production, USCS browser tracks Day 2: QC on reads, Mapping binding site peaks, examining read density maps. Day 3: Analyzing peaks in relation to genomic feature, etc. Day 4: Analyzing peaks for transcription factor binding site consensus sequences. Day 5: Variants & advanced approaches.

49 DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP

50 DAY 3 REMNANT Analyzing overlaps between peak & regulated genes in UNIX

51 How can we test the significance of binding site association w/ regulated genes?
If you haven’t already, go to the cluster & move bed and txt files to your /cluster/shared/userID/chip folder (mkdir chip & cd chip if you don’t have this folder yet): cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.* . The .txt files list the transcription start sites (TSSes) of genes that were up- or down-regulated by estrogen in aorta or liver (by RNA-seq analysis).

52 Overlaps between peaks & genes
Take a look at one of them using head [name].txt chr Dnahc6 chr C8g chr Tnip3 The file format is (tab-delimited) chromosome, TSS, transcription direction (+=sense) & geneID. You can get all this info easily from the UCSC browser, for individual genes (by hand)… … or you can get this information for all genes & extract what you want for your gene set of interest.. Check out the RNA-seq module for info on making & handling .gtf files.

53 Overlaps between peaks & genes 2
The overlap program can recognize this type of file & will test for overlaps between ChIP-seq peaks and regions around the listed TSSes (default +/-1000 bp). You can also change this range by specifying a –range variable. Find the overlaps between 10-kb regions around TSSes of genes up- or downregulated in each tissue & the corresponding ER binding site data using variations on: bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl Ao_up_TSS.txt AoE_all.bed –outfile Ao_up_v_AoE.overlap (these commands are in /cluster/tufts/cbi*/Ch*/Sam*/Fin*/workflow2.txt) Note the number of overlaps (hits), number of genes (tests) and the number of overlaps expected by chance divided by the number of genes (background frequency) provides all the information you need for binomial tests. Note these numbers down for each comparison.

54 Accessing the R statistical language
On the PCs in this room: Start->programs->R To get R for your PC (free): To get RStudio (allows for easier management of R projects): On the cluster type: module load R Then: bsub -Ip -q int_public6 R To exit use the R command q() For more info on using R & Unix see: UNIX resources & R resources

55 Binomial tests in R Use the R command: binom.test(hits, tests, bkg_freq) to address the significance of overlaps you see For Ao_down_TSS.txt vs. AoE.bed: binom.test(118,2, 1.03/118) Which comparisons show significant enrichment. Do any show anti-enrichment?

56 DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP

57 What is PWM? Transcription factor binding sites (TFBSs) are usually slightly variable in their sequences. A positional frequency matrix (PFM) specifies the probability that you will see a given base at each index position of the motif. This is built from sequences known to bind the TF (e.g. 46 sequences for the PFM below). N C A G T Con 16 5 2 3 1 42 6 9 7 4 24 44 19 15 11 10 8 34 31 13 18 39 43 14 21 33 29 12 Pos 3’ 5’ Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University

58 . PFM->normalized PFM->PWM Binding site data
Position frequency matrix (PFM) (also known as raw count matrix) acggcagggTGACCc aGGGCAtcgTGACCc cGGTCGccaGGACCt tGGTCAggcTGGTCt aGGTGGcccTGACCc cTGTCCctcTGACCc aGGCTAcgaTGACGt . cagggagtgTGACCc gagcatgggTGACCa aGGTCAtaacgattt gGAACAgttTGACCc cGGTGAcctTGACCc gGGGCAaagTGACTg Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number of times a particular nucleotide appears at a given position). A normalized PFM, in which each column adds up to a total of one, is a matrix of probabilities for observing each nucleotide at each position (e.g. divide by 46). Position weight matrix (PWM) (also known as position-specific scoring matrix) The normalized PFM is converted to log-scale for efficient computational analysis. To eliminate null values before log-conversion, and to correct for small samples of binding sites, a sampling correction, known as pseudocounts, is added to each cell of the PFM. Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University

59 Converting a PFM into a PWM
Position Weight Matrix for ERE Converting a PFM into a PWM For each matrix element do: A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -2.96 1.62 -0.72 C -1.49 -0.30 1.39 0.78 0.34 0.25 1.76 0.46 G 0.16 1.31 1.44 -0.17 -0.06 0.65 1.79 -0.64 T 0.96 -0.78 1.73 -1.84 0.23 – raw count (PFM matrix element) of nucleotide b in column i N – number of sequences used to create PFM (= column sum) - pseudocounts (correction for small sample size) p(b) - background frequency of nucleotide b Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University

60 Scoring putative EREs by scanning the promoter w/ PWM
G G G T C A G C A T G G C C A A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -2.96 1.62 -0.72 C -1.49 -0.30 1.39 0.78 0.34 0.25 1.76 0.46 G 0.16 1.31 1.44 -0.17 -0.06 0.65 1.79 -0.64 T 0.96 -0.78 1.73 -1.84 0.23 Absolute score of the site =11.57 This is also called “functional depth” Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University

61 Estimating p. values for a match to the matrix
G G G T C A G C A T G G C C A A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -2.96 1.62 -0.72 C -1.49 -0.30 1.39 0.78 0.34 0.25 1.76 0.46 G 0.16 1.31 1.44 -0.17 -0.06 0.65 1.79 -0.64 T 0.96 -0.78 1.73 -1.84 0.23 This sequence had a functional depth (f) of 0.86 The summed probabilities of all sequences with f >=.86 gives the p.value for this sequence = chance that f>=.86 would be achieved by a randomized DNA sequence. Short matrices can reach f > .9 but still have high p. values – thus f is the best measure for short seqs. Long matrices can have very low p. values but still have f< .9 – thus p.value is the best measure for long seqs.

62 DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP

63 Preparing for PWM search
Lauch WinSCP (Start->programs->WinSCP) Navigate to: /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/Final_output_files Pull over the “ipvinput19_peaks.xls” file to the PC. (this is the MACS output file that we generated yesterday) Open it into Excel

64 Making .bed file w/ +/-200 bp around peak summit (where we expect TFBS enrichment will be greatest)
=same row, chr column =start col+summit+200 =start col+summit-200 Copy these 3 columns (without any header row). In WinSCP click on any file on the PC, then on files->new->file & provide a name (“LiE_chr19_400bp.bed”) to edit a new simple text file. Paste, save & close.

65 Making a file of control .bed regions
peak ctrs. control regions chr start end chr start end =peaks:chr =peaks:start-5000 =peaks:end-5000 5000 bp is far enough away to not be part of an enhancer composed of the ER binding site... but is close enough to likely be in the same general chromatin territory (e.g. accessible euchromatin vs. inaccessible heterochromatin) Copy these columns & make a “CTRL_chr19_400bp.bed” file with WinSCP

66 CentDist A TFBS enrichment program designed for ChIP-seq data
Assumes that TFBS-matrix hits will be most highly enriched at the centers of ChIP-seq peaks. Adds PWM score to “peakiness” score (e.g. how much more enriched the TF site is in the center of the peak)  final p. val. Good enrichment poor shape (higher p.val.) Good enrichment OK shape Good enrichment good shape (best p.) Go to: …or (more simply) just google centdist and click on the first link (should end in /centdist/)

67 Run CentDist Give centdist a name for your run
Upload your +/-200 bp .bed file (CentDist does not need a separate background file, instead using flanking sequences at a case-specific optimized distance as background) Check “Jaspar”, “vertebrate” & set max-co-motif distance to 3000 Then click Submit Job On the new window that opens click “turn on autorefresh” so you will be notified when the job ends

68 Jaspar vs. Transfac Jaspar is a freely-available set of TFBS matrices that can be downloaded from jaspar.genereg.net Transfac is a commercial product that you need to pay for the latest release of. A version of Transfac (from ~2006) is available on the cluster (e.g. /cluster/home/g/s/gschni01/vertebrates.mat) Which to use? Both, ideally, but beware that programs like CentDist will give you results from Transfac matrices – and you won’t be able to find out details of what they are.

69 CentDist Results View by factors, put in max number & hit go.
P. Values (based on Score compose of Z0 (enrichment) & Z1 (peakiness) Distribution graph Weblogo representation of Jaspar matrix Shows information content at each position. A,G,C&T 25% each-> 0 bits, only 1 base 100%->2 bits. Bases most highly over-represented relative to chance are largest.

70 How many enriched TF sites are there really?
Matrix hit enrichment for many factors, are all of them real? V$jaspar_HNF4A V$jaspar_NR2F1 V$jaspar_ESR1 Maybe not, notice how similar top sites are to each other and to estrogen response elements (EREs) such as V$jaspar_ESR1

71 Downloading CentDist Results
Click on download as text & save the file somewhere you remember. Open it into excel. Basic summary statistics & a few other things. Many questions unanswered: -What is the fold enrichment over background? -What are the peaks with the greatest enrichment for each factor? -What are the TFBS hit locations in each peak? -Which are the real enriched TFBSes & which are just showing up by homology? -Do certain factors group together into the same same peaks?

72 DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP

73 Storm Storm is a straightforward PWM scanning program that runs in UNIX. Its greatest advantage is that it gives you all of the unprocessed output data, which allows you to do much more powerful analyses! It also allows us to specify thresholds for matches to the matrix – allowing us to use functional depth as well as p. value

74 Getting DNA for Storm To run storm, we first need to get the actual DNA sequence for centers of our peaks (where we expect the greatest enrichment for TFBSes to be). Go to the UCSC genome browser at: genome.ucsc.edu Under genome choose mouse mm9 Then choose add custom track & upload your +/-200 bp .bed file. Click on Tools->Table Browser Select your new track Select output format “sequence” Provide a file name “LiE_chr19_400bp.fa” & hit “get output” Hit ‘get output’ again on the next page Now do the same for your “CTRL_chr19_400bp.bed” file. .fa denotes a simple ‘fasta’ format sequence file.

75 Cleaning up our .fa files
Use WinSCP to move these .fa files and their corresponding .bed files to your …/chip directory. Each entry in the .fa file has a header with special characters in it that confuse storm. All of the commands below are in the file /cluster/tufts/cbi*/Ch*/Sam*/Final*/workflow2.txt… cat this to your screen, to copy & paste commands. To fix this, go to your …/chip directory in Putty & do: perl /cluster/home/g/s/gschni01/perl*/Lax_convert.pl LiE_chr19_400bp.fa > LiE_chr19_400bp_converted.fa To see what has changed use: head *.fa Do the same for your “CTRL_chr19_400bp.fa” file.

76 Running storm First set some path variables:
export CREAD=/cluster/home/g/s/gschni01/cread-0.84 export PATH=$PATH:$CREAD/bin Then run storm for your IP .fa file: bsub -oo LiE_chr19_400bp_p.storminfo storm -p -t s LiE_chr19_400bp_converted.fa -o LiE_chr19_400bp_p.storm /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat And for your control .fa file: bsub -oo CTRL_chr19_400bp_p.storminfo storm -p -t s CTRL_chr19_400bp_converted.fa -o CTRL_chr19_400bp_p.storm /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat Use more to look at one of your .storm output files (space for next page ctrl c to exit)

77 DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP

78 Interpreting Storm data
Run the dme_parse perl program to gather and tabulate your storm data: bsub -oo LiE_chr19_400bp_p.dmeparseinfo perl /cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl LiE_chr19_400bp_p.storm LiE_chr19_400bp.bed peaks bsub -oo CTRL_chr19_400bp_p.dmeparseinfo perl /cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl CTRL_chr19_400bp_p.storm CTRL_chr19_400bp.bed peaks

79 dme_parse outputs …storm.bed file:
Has USCS browser tracks for each TFBS matrix with locations of all hits in bed format. …storm.map file: Lists all input matrices followed by the PFM derived from all of the hits to this matrix from our data. …storm.info file: Summarizes a lot of information about matrix hits Move the .info files to your PC with WinSCP & open them into Excel. File provides summary statistics for # of peaks with 0,1,2,etc. hits, total hits, and normalized hits per 50 bp vs distance from peak center.

80 dme_parse outputs Using the .info file to plot relative density of TFBS hits in aorta IP, liver IP & offset controls:

81 dme_parse outputs Using the .info files to structure binomial tests
Hits= # of matches to each matrix in IP data Tests=# of times storm tested for a match =(# of peaks) * (400 bp length of peaks - matrix length) Background freq= matches to offset conrol peak data/# tests (same as for IP) Using the .info files to determine fractional enrichment Hit frequency in IP data/Hit frequency in offset control

82 dme_parse outputs .freqs file: Number of hits to each matrix for each peak Distribution of hits per peak in offset background establishes # of hits to be p.<=.05 enriched over backgound Allows identification of sites at which a given TFBS may be functionally targeted (candidates for further testing) Can also look for significant overlaps between the peaks with enrichment for 2 different factors - to identify cooperative versus antagonistic interactions. Details on how to do these analyses are in ChIPseq_analysis_methods_2013_02_11 on the cbi website.

83 DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP

84 STAMP Go to www.benoslab.pitt.edu/stamp/index.php
STAMP lets you compare matrices for evolutionary similarities to each other. Go to your CentDist output. Create a new column in which you change the names of the factors to fit with the names in the Jaspar_non_redundant_vertebrate.mat file you used for Storm. =substitute(b2,“V$jaspar_”,”Jaspar$”), & propogate down Select all matrix names w/ p.<.05 & paste them into a new file called “select_mats.txt” in your /chip folder on the cluster using WinSCP.

85 Getting STAMP to help classify our CentDist top hits
perl /cluster/home/g/s/gschni01/perl*/MatrixSelect.pl /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat select_mats.txt select_mats.mat Now, open the select_mats.mat file with WinSCP, copy everything & paste it into STAMP. Keep all the STAMP defaults & hit submit.

86 STAMP Tree This indicates that enrichment of PPARG, RORA, NR4A2 could be just because of their similarity to EREs. Other enriche sites, such as SP1, FoxA2 & Myf fall in separate homology classes. To further distinguish which one is real, you can use the enrichment ratios & p. values (the “real” TFBS should be best in both of these.

87 ChIP-seq Methods & Analysis
Gavin Schnitzler Asst. Prof. Medicine TUSM, Investigator at MCRI, TMC

88 ChIP-seq COURSE OUTLINE
Day 1: ChIP techniques, library production, USCS browser tracks Day 2: QC on reads, Mapping binding site peaks, examining read density maps. Day 3: Analyzing peaks in relation to genomic feature, etc. Day 4: Analyzing peaks for transcription factor binding site consensus sequences. Day 5: Variants & advanced approaches.

89 Day 5 Outline Introduction to variations on ChIP-seq methods
Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 89

90 Next-Generation Sequencing Analysis
“ChIP-Seq is the best thing that happened to ChIP since the antibody.  It is 100x better than ChIP-Chip since it escapes most of the problems of microarray probe hybridization.  Plus it is cheaper, and genome wide.  But ChIP-Seq is only the tip of the iceberg - there are many inventive ways to use a sequencer.”  Quote from intro to Homer software at:

91 Extensions of ChIP-seq
ChIP-Seq: Isolation and sequencing of genomic DNA "bound" by a specific transcription factor, covalently modified histone, or other nuclear protein.  This methodology provides genome-wide maps of factor binding.  Most of HOMER's routines cater to the analysis of ChIP-Seq data. DNase-Seq: Treatment of nuclei with a restriction enzyme such as DNase I will result in cleavage of DNA at accessible regions.  Isolation of these regions and their detection by sequencing allows the creation of DNase hypersensitivity maps, providing information about which regulatory elements are accessible in the genome. (variant technique called FAIRE-seq) MNase-Seq: Micrococcal Nuclease (MNase) is a restriction enzyme that degrades genomic DNA not wrapped around histones.  The remaining DNA represents nucleosomal DNA, and can be sequencing to reveal nucleosome positions along the genome.  This method can also be combined with ChIP to map nucleosomes that contain specific histone modifications. RNA-Seq: Extraction, fragmentation, and sequencing of RNA populations within a sample.  The replacement for gene expression measurements by microarray.  There are many variants on this, such as Ribo-Seq (isolation of ribosomes translating RNA), small RNA-Seq (to identify miRNAs), etc. GRO-Seq: RNA-Seq of nascent RNA.  Transcription is halted, nuclei are isolated, labeled nucleotides are added back, and transcription briefly restarted resulting in labeled RNA molecules.  These newly created, nascent RNAs are isolated and sequenced to reveal "rates of transcription" as opposed to the total number of stable transcripts measured by normal RNA-seq. Hi-C: Genomic interaction assay for understanding genome 3D structure.  This assay is much more specialized - For more information about how to use HOMER to analyze Hi-C data, check out the Hi-C analysis section.

92 Examining long-range interactions by ChIP-seq
Two DNA fragments associated with the same IP’d protein are ligated together. Sequencing identifies both short-range and long range interactions. Nature Reviews Genetics :840

93 Fine scale information from DNAse-seq
Sequencing the ends of DNAse cuts identifies regions of bare DNA. Fine scale analysis of this data can identify individual TF binding sites. Nature Reviews Genetics :840

94 Capturing allele-specific information using SNPs in reads
CTCF binds better to the A variant

95 Mapping CpG DNA methylation patterns
Approaches: IP of DNA fragments using antibodies against meC or meCpG binding proteins. Selection of DNA fragments using methyl-sensitive restriction enzymes. Whole genome bisulfite sequencing. Bormann Chung CA, Boyd VL, McKernan KJ, Fu Y, et al. (2010) Whole Methylome Analysis by Ultra-Deep Sequencing Using Two-Base Encoding. PLoS ONE 5(2): e9320. doi: /journal.pone

96 Mapping nucleosome positions
Approaches: 1) Fragmentation to mononucleosome size by sonication or micrococcal nuclease (MNase)  ChIP w/ antibody against histone modification (H3K4me1) – can map positions of nucleosomes with this mark.  Whole genome sequencing. Nat Struct Mol Biol June; 18(6): 742–746.

97 Plotting ChIP-seq read density versus genomic features
Taking average normalized .bedgraph data relative to TSSes…

98 Using input chromatin read density to measure nucleosome densities
Hypothesis: Sonication mostly cuts in nucleosome free regions or inter-nucleosomal spacers. Thus, read positions give information about nucleosome positions. Initial support: Average normalized .bedgraph data from INPUT sample relative to TSSes recapitulates the low nucleosome occupancy seen genomewide over promoters.

99 Day 5 Outline Introduction to variations on ChIP-seq methods
Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 99

100 Many approaches to TFBS analysis
Outline of the review. The overall goal is to identify transcription factor binding sites on a genome-wide scale. Starting with a few experimentally determined sites, a model of the binding site is constructed which is then used in a genome-wide scan to search for additional instances of the binding site. Besides enhanced motif models, additional, evolutionary, genomic, epigenomic, transcriptomic and proteomic data can be used in an integrative fashion to improve the accuracy of binding site search. Hannenhalli S Bioinformatics 2008;24: Also, Ladunga I. An overview of the computational analyses and discovery of transcription factor binding sites. Methods Mol Biol. 2010;674:1-22. doi: / _1. : Introduction to a set of about a dozen methods papers.

101 The Gibbs sampler approach The EM approach (in MEME etc.)
De Novo Search Algorithms The Gibbs sampler approach Objective: Find conserved segment of length k in n unrelated sequences 1 k 1 1 k 2 1 k n The program will need to run once for each k: e.g. 6 bp, 7 bp, 8 bp sequences, etc. (either automatically, or by hand). From : Lawrence, C. et al.(1993) Detecting Subtle Sequence Signals: A Geibbs Sampler approach to Multiple Alignment. Science The EM approach (in MEME etc.) Expectation Maximization algorithm, proceeds in iterations until E & M converge. For an explanation of the process see Nature Biotechnology 26, (2008). Adapted from:

102 Two de novo search methods
DME is part of the same CREAD package that storm is in (run in UNIX) SEME some of the same refinements as CentDist to do de novo searches:

103 Extensions to Basic Models
Composite Patterns: BioOptimizer: the Bayesian Scoring Function Approach to Motif Discovery Bioinformatics M1 M2 M3 Stop Start Regulatory Modules: De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Nat’l Acad Sci USA, 102, Gene A Gene B Adapted from:

104 Combining Signals and other Data
Motifs Coding regions Expresssion and Motif Regression: Integrating Motif Discovery and Expression Analysis Proc.Natl.Acad.Sci 1.Rank genes by E=log2(expression fold change) 2.Find “many” (hundreds) candidate motifs 3.For each motif pattern m, compute the vector Sm of matching scores for genes with the pattern 4.Regress E on Sm ChIP-on-chip kb information on protein/DNA interaction: An Algorithm for Finding Protein-DNA Interaction Sites with Applications to Chromatin Immunoprecipitation Microarray Experiments Nature Biotechnology, 20, Protein binding in neighborhood Coding regions Adapted from:

105 Assessment of evolutionary conservation
Modules shared across species are most highly rated. For use of evolutionary conservation information w/ individual motifs see: Das & Dai 2007 BMC Bioinformatics 8:S21. For regulatory modules see: Su J, Teichmann SA, Down TA (2010) Assessing Computational Methods of Cis-Regulatory Module Prediction. PLoS Comput Biol 6(12): e doi: /journal.pcbi Adapted from:

106 Integrating data from multiple sources w/ permutation of average ranks
Let’s say we want to combine data from several sources or metrics to decide which are the most relevant enriched TFs. e.g. 1) p.value in CentDist, 2) p.value in Storm & 3) p.value of homologous sequence in DME Establish a ranking metric for each (e.g. 1 best to 10 worst). It doesn’t have to be the same for 1, 2 & 3, but you need to apply the same rank system across different biological conditions. For each TF compute the average rank. (1) (2) (3) (avg) 4 2 3 9 7 8

107 Permutation of average ranks
Now take the same columns of ranks for (1), (2) & (3) and randomize each one separately. (1) (2) (3) (avg) Repeat this several times (until you have thousands of random average ranks & plot frequency vs avg. rank… 2.0 observed 34/10,000 times in permuted averages. Estimated FDR ~3.4e-3 The number of times a given value is observed divided by the total number of iterations gives an estimate of false discovery rate.

108 Day 5 Outline Introduction to variations on ChIP-seq methods
Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 108

109 What if you want to know something from a published dataset, but they’ve only provided the raw data on SRA? Getting data from SRA Go to: Find an experiment by searching, e.g. “encode h1-hesc h3k4me3” Click on the name to the left of the smaller file (1.9M) & then on the downloads tab. Right click on the ftp link for the run & copy the link location. Open putty & login to your account at cluster.uit.tufts.edu Go to your /cluster/shared/[userID]/chip directory & do: wget [pasted URL]

110 Decoding the .sra format
The SRR sra file you now have is in a special file format, but it does have all the original .fastq information in it. To get that info do: bsub /cluster/tufts/cbi*/Ch*/ESC*/sra*/bin/fastq-dump SRR sra [fastq-dump is part of a package of programs for handling .sra files that you can download, unpack & run immediately from your shared directory – at least as far as simple files like fastq-dump are concerned] This gives you the same .fastq format you’re familiar with. Use head to confirm the format, but then you might as well delete the file with rm so as not to clutter up the cluster. After this week you are now ready to do any analysis you want on this data, from mapping reads to the genome (w/ bowtie) to peak calling (w/ MACS), to TFBS analysis.

111 “Liftover” programs to convert between genomes & builds
Several useful tools for this in Cistrome/Galaxy: Liftover/Others Convert between RefSeq, Gene Symbols to Entrez IDs using Bioconductor. Liftover Wig Files Liftover wig files [Galaxy]Convert genome coordinates between assemblies and genomes Extract data from Wiggle Extract data for certain chromosome from a wiggle file Extract data from Bed Extract data for certain chromosome from a BED file In the UCSC genome browser: Tools-> Liftover Choose the starting genome/build & the one you want to convert to. Upload a .bed file w/ the ranges you want & hit go (only works for bed files… may work with bedGraph, although I haven’t confirmed this)

112 Day 5 Outline Introduction to variations on ChIP-seq methods
Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 112

113 Don’t be intimidated! There’s nothing to prevent you from installing a program you want to run in your cluster account. Before you begin, though, type “module available” to see if it’s already installed as a module. Also go to /cluster/tufts/ngsp/ngsp/ to see if it’s installed there. Read the documentation from the creator’s lab, download, unzip &/or unpack the file, read the INSTALL or README files included, & give it a try. You may need to be running a specific version of perl or python, etc. If so, check “module available” to see if it’s installed on the cluster & use “module load [name]” to add it. You may also need to set system variables using “export VARIABLE=$VARIABLE:/new/path”. README files should tell you enough to know what to try. If you get stuck, the cluster support folks are friendly & helpful (and respond moderately fast). Contact them at:

114 A different integrated package of tools to run in UNIX
HOMER Software for motif discovery and next-gen sequencing analysis Mapping to the genome (NOT performed by HOMER, but important to understand) Creation Tag directories, quality control, and normalization. (makeTagDirectory) UCSC visualization (makeUCSCfile, makeBigWig.pl) Peak finding / Transcript detection / Feature identification (findPeaks) Motif analysis (findMotifsGenome.pl) Annotation of Peaks (annotatePeaks.pl) Quantification of Data at Peaks/Regions in the Genome/Histograms and Heatmaps (annotatePeaks.pl) Quantification of Transcripts (analyzeRNA.pl) Additional analysis strategies: General sequence manipulation tools (homerTools) Miscellaneous Tools for Sharing Data between programs, etc. (tagDir2bed.pl, bed2pos.pl, pos2bed.pl ...) Finding overlapping or differentially bound peaks (mergePeaks, getDifferentialPeaks) ChIP-Seq analysis automation (analyzeChIP-Seq.pl) Description of file formats Could be very useful… & with (only a bit of) luck, you’ll be able to install & run them yourself.

115 Installing a program in R
Check out the Key R Commands link at This is not an introduction to programming in R! Instead it gives basic instructions for how to: 1) install & run R packages that may be needed for your research, 2) how to move data files into R 3) how to perform simple edits on this data that may be required by the package & 4) how to output your results. Note: I find that the documentation for R packages is generally quite good.

116 Day 5 Outline Introduction to variations on ChIP-seq methods
Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 116

117 Mastering simple UNIX tools
find, awk, grep, sort, sed & more One line commands to let you search and manipulate large data files w/o writing a program or trying to use the kludgy and limited tools in Galaxy. Find out more at:

118 Programming: Get your feet wet
Perl Tutorials - learn.perl.org learn.perl.org/tutorials/ Many tutorials are available if you are interested in learning Perl. These tutorials are introductions. Beginning Perl (free) - This book is for those new to programming who want to learn with Perl. A ton of Perl programs for you to use/adapt/modify: For learning R: Check out Josh’s links at: Also check out my notes on using R (specifically geared to the minimum you need to install & use existing programs) & a brief reference sheet on Perl at

119 Look at examples, check the web…
If you’re looking for a command in UNIX, R, Perl, Python, etc. do a Google search (for R add “statistical” to your search to specify what you mean). If you’re wondering how to get a program to do something, look at other programs & see how they did it. You don’t need to memorize the language, beyond a few basics, just look at what you (or someone else) did before & copy it.

120 Questions. What would you like to explore
Questions? What would you like to explore? What’s the next bioinformatics challenge in your research?

121 Course evaluation forms…


Download ppt "ChIP-seq Methods & Analysis"

Similar presentations


Ads by Google