ChIP-seq Methods & Analysis

Slides:



Advertisements
Similar presentations
ChIP-seq Data Analysis
Advertisements

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
ChIP-seq Methods & Analysis
ChIP-seq Methods & Analysis
NGS Analysis Using Galaxy
Objectives Understand what MATLAB is and why it is widely used in engineering and science Start the MATLAB program and solve simple problems in the command.
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
RNAseq analyses -- methods
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
NGS data analysis CCM Seminar series Michael Liang:
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
Downloading and Installing Autodesk Inventor Professional 2015 This is a 4 step process 1.Register with the Autodesk Student Community 2.Downloading the.
EDACC Quality Characterization for Various Epigenetic Assays
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Accessing and visualizing genomics data
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Emdeon Office Batch Management Services This document provides detailed information on Batch Import Services and other Batch features.
Regulation of Gene Expression
Introductory RNA-seq Transcriptome Profiling
Lesson: Sequence processing
Development Environment
AP CSP: Cleaning Data & Creating Summary Tables
The Transcriptional Landscape of the Mammalian Genome
Simon v1.0 Motif Searching Simon v1.0.
Project Management: Messages
What is a Hidden Markov Model?
Release Numbers MATLAB is updated regularly
NGS Analysis Using Galaxy
Regulatory Genomics Lab
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Short Read Sequencing Analysis Workshop
Chip – Seq Peak Calling in Galaxy
Parts.cat.com Client training 2016.
GE3M25: Data Analysis, Class 4
Bomgar Remote support software
Adding Assignments and Learning Units to Your TSS Course
Learning Sequence Motif Models Using Expectation Maximization (EM)
Tutorial for using Case It for bioinformatics analyses
Intro to PHP & Variables
De novo Motif Finding using ChIP-Seq
Macrosystems EDDIE: Getting Started + Troubleshooting Tips
GDSS – Digital Signature
Exploring and Understanding ChIP-Seq data
Microsoft Official Academic Course, Access 2016
Epigenetics System Biology Workshop: Introduction
Macrosystems EDDIE: Getting Started + Troubleshooting Tips
Maximize read usage through mapping strategies
Simon V Motif Searching Simon V
Inside a PMI Online Course
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Regulatory Genomics Lab
Getting started – Example 1
Evolution of Alu Elements toward Enhancers
Macrosystems EDDIE: Getting Started + Troubleshooting Tips
Guide: Report results Version of Ladok by the latest update:
Introduction to RNA-Seq & Transcriptome Analysis
Regulatory Genomics Lab
Macrosystems EDDIE: Getting Started + Troubleshooting Tips
Presentation transcript:

ChIP-seq Methods & Analysis Gavin Schnitzler

ChIP-seq COURSE OUTLINE Day 1: ChIP techniques, library production, USCS browser tracks Day 2: QC on reads, Mapping binding site peaks, examining read density maps. Day 3: Analyzing peaks in relation to genomic feature, etc. Day 4: Analyzing peaks for transcription factor binding site consensus sequences. Day 5: Variants & advanced approaches.

DAY 3 LECTURE OUTLINE (finishing up peak mapping, from Tuesday). Exploring your peak data with Galaxy & Cistrome Analyzing overlaps between peak sets, with galaxy and in UNIX

DAY 2 LECTURE OUTLINE FASTQC (quality control on reads) Getting your raw data -Exercise: Getting around UNIX, downloading & unpacking Mapping reads to the genome & identifying binding site peaks -Exercise: Running Bowtie & MACs Visualizing your results -Exercise: Custom UCSC browser tracks

Let’s try that again (w/ a streamlined proven command set) Open putty & login to cluster.uit.tufts.edu “mkdir chip” “cd chip” “cp /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/*.gz .“ [make sure to add the final space & period, this tells UNIX to keep the same filename & put it in the current directory] --- Now repeat this for …/*.txt Do “ls” to see what you got … ip19.fastq --- is the raw data for ERalpha ChIP seq from mouse liver on chrom 19 input19.fastq --- is the raw data for the corresponding input sample workflow1.txt --- This file lists all of the commands you will use to process your raw sequence data, map reads to the genome & map peaks. Do “cat *.txt” Now you have all the commands you will use & can copy & paste (by selecting & then right clicking in Putty).

All the commands needed to go from sequence to peaks gunzip *.fastq.gz bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m 1 -5 8 --best --strata mm9 ip19.fastq ip19.map bsub -oo input19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m 1 -5 8 --best --strata mm9 input19.fastq input19.map awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' input19.map > input19.bed awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed module add python/2.6.5 export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 cd *aph/treat gunzip *.gz bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl ip19_trim_norm.bdg all ipvinput19_treat_afterfiting_all.bdg 0.275955 cd ../control bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl input19_trim_norm.bdg all ipvinput19_control_afterfiting_all.bdg 0.364561

Mapping reads to a genome Understanding the bowtie command (which you’ll have cut & pasted from your screen): bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m 1 -5 8 --best --strata mm9 ip19.fastq ip19.map bsub –oo ip19.bowtieinfo, submits the batch process and names an output/error file. /cluster/shared/gschni01/bowtie*/bowtie gives the path to the bowtie program & tells unix to run it -n 1 tells Bowtie to accept no more than 1 mismatch between a the first 25 bp of a sequence read & its best homologue in the genome -m 1 tells Bowtie to reject any reads that are identical to more than 1 sequence in the genome (since we wouldn’t know which locus our read really came from) -5 8 tells Bowtie to trim the first 8 (lower quality) bases from the read before mapping --best & --strata tell bowtie to try hard to find the best match [name].fastq is your input file & [name].map specifies the name of the output file.

How did Bowtie do? Check your .bowtie info bsub output files: “head *.bowtieinfo“ … The lines you’re interested in are the ones before the ---------- line (after which info of the bsub run itself is given) ==> LiE_ERaIP_chr19.bowtieinfo <== # reads processed: 372435 # reads with at least one reported alignment: 370513 (99.48%) # reads that failed to align: 554 (0.15%) # reads with alignments suppressed due to -m: 1368 (0.37%) Note that most of the reads aligned to some other sequence in the genome, very few failed to & map also very few had matched more than 1 genomic sequence (-m 1). This is great - but atypical - it only looks this good because I filtered the .fastq files for things that mapped to chr19… The actual data for all chromosomes looks like: # reads processed: 23090611 # reads with at least one reported alignment: 16276870 (70.49%) # reads that failed to align: 1416679 (6.14%) # reads with alignments suppressed due to -m: 5397062 (23.37%) Reported 16276870 alignments to 1 output stream(s) Should be very low, unless you have contamination of non-mouse sequence. Typical level due to repeat sequences in mammalian genome

cp /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/*.fastq.gz . gunzip *.fastq.gz bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m 1 -5 8 --best --strata mm9 ip19.fastq ip19.map bsub -oo input19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m 1 -5 8 --best --strata mm9 input19.fastq input19.map awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' input19.map > input19.bed awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed module add python/2.6.5 export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 cd *aph/treat gunzip *.gz bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl ip19_trim_norm.bdg all ipvinput19_treat_afterfiting_all.bdg 0.275955 cd ../control bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl input19_trim_norm.bdg all ipvinput19_control_afterfiting_all.bdg 0.364561

Using awk change from .map to .bed format Understanding your awk command: awk 'OFS='\t' {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed OFS=‘\t’ tells awk to output tab delimited data The print command says: print these data columns in order: #4:chromosome, #5:start_bp, #5:start_bp+length(#6:sequence)=end_bp, #1:identifier, “.” as a placeholder & #3:strand awk would normally print to the screen, but here we redirect the output to create a new .bed file (> can be used for any other UNIX command too!).

How do peak-finders map binding sites? Fragments are of a range of sizes & contain the TF binding site at a (mostly) random position within them. Reads are read (randomly) from left or right edges (sense or antisense) of fragments. Thus peak for sense tags will be 1/2 the fragment length upstream… Binding site position = mid-way between sense tag peak & antisense tag peak. To get binding site peak, shift sense downstream by ½ fragsize & antisense upstream by ½ fragsize. Adapted from slide set by: Stuart M. Brown, Ph.D., Center for Health Informatics & Bioinformatics, NYU School of Medicine & from Jothi, et al. Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data. NAR (2008), 36: 5221-31

cp /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/*.fastq.gz . gunzip *.fastq.gz bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m 1 -5 8 --best --strata mm9 ip19.fastq ip19.map bsub -oo input19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m 1 -5 8 --best --strata mm9 input19.fastq input19.map awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' input19.map > input19.bed awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed module add python/2.6.5 export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 cd *aph/treat gunzip *.gz bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl ip19_trim_norm.bdg all ipvinput19_treat_afterfiting_all.bdg 0.275955 cd ../control bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl input19_trim_norm.bdg all ipvinput19_control_afterfiting_all.bdg 0.364561

Mapping binding peaks w/ MACs Understanding the commands used for MACS module load python/2.6.5 … This tells the cluster to use an optional version of python. export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH These tell UNIX where to find the necessary libraries to run MACS: Using MACS to identify peaks from ChIP-Seq data. Feng J, Liu T, Zhang Y. Curr Protoc Bioinformatics. 2011 Jun;Chapter 2:Unit 2.14. doi: 10.1002/0471250953.bi0214s34.

MACs parameters Now, let’s run MACs using our input file as control (after –c) and our ip file as the ‘treatment’ or experimental file (after –t). bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 --format=BED tells MACs that the input file is in .bed format --bw=210 tells MACs the expected size of sequenced fragments (before addition of linkers, which add an additional ~90 bp) from which value it attempts to build a model from sense and antisense sequence reads --keep-dup=1 instructs MACS to consider only the first instance of a read starting at any given genomic base pair coordinate & pointing in the same direction – assuming that additional reads starting at the same base pair are due to amplified copies of the same ChIP fragment in the library (by default MACS estimates the number of duplicates that are likely to arise by linear amplification of all fragments from a limited starting sample, and sets the threshold to cut out replicate reads with a much higher number – likely artifacts, but keep-dup=1 is even cleaner) -B tells MACS to make a bedgraph file of read density at each base pair (which can be used to visualize the results on the UCSC browser) & -S tells MACS to make a single .bedgraph file instead of one for each chromosome --name gives the prefix name for all output files.

Examine your MACS output Start with your .macsinfo bsub -oo file. vi LiE_ERaIPvINPUT_chr19.macsinfo Use the arrow keys to go to the top, where you’ll see all of the parameters you put in to run MACs. After some runtime info (including possible warnings, that you can ignore if there are not millions of them), you’ll see: INFO @ Sun, 10 Feb 2013 21:27:51: #1 total tags in treatment: 370513 INFO @ Sun, 10 Feb 2013 21:27:51: #1 user defined the maximum tags... INFO @ Sun, 10 Feb 2013 21:27:51: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) INFO @ Sun, 10 Feb 2013 21:27:51: #1 tags after filtering in treatment: 275955 INFO @ Sun, 10 Feb 2013 21:27:51: #1 Redundant rate of treatment: 0.26 This is useful information. It tells you how many different reads you had (out of all of the reads which mapped to only one place in the mouse genome- from Bowtie). You want this number to be high and the “redundant rate” to be low. (You’ll need the tags after filtering numbers later, so jot them down somewhere)

Using duplication levels to estimate your library size Assuming you have 100 initial fragments in your library (before amplification) & which fragment gets read is random: #seqs read: 25 50 75 100 150 200 # diff reads: 23 37 52 63 78 87 % duplicated: 9% 27% 33% 43% 55% 69% x-more left in lib: 4.3 2.7 1.9 1.6 1.3 1.15 x-more than prev: 1.6 1.4 1.2 1.24 1.11 Thus, if you have low % duplicates (e.g. 9%) in one lane, adding an additional run of the same number of reads will give you 1.6x more, or 2 additional runs will give you 2.2x more (1.6*1.4). …but if you have a high % duplicates (e.g. 43%) adding one more lane will only give you 1.37x more unique reads than you had initially. This indicates that your library has low complexity – probably because too few fragments from your ChIP survived to the library amplification step.

MACs ‘shiftsize’ model Keep scrolling down your .macsinfo file… INFO @ Sun, 10 Feb 2013 21:27:51: #2 Build Peak Model... INFO @ Sun, 10 Feb 2013 21:27:51: #2 number of paired peaks: 0 WARNING @ Sun, 10 Feb 2013 21:27:51: Too few paired peaks (0) so I can not build the model! Broader your MFOLD range parameter may erase this error. If it still can't build the model, please use --nomodel and --shiftsize 100 instead. WARNING @ Sun, 10 Feb 2013 21:27:51: Process for pairing-model is terminated! WARNING @ Sun, 10 Feb 2013 21:27:51: #2 Skipped... WARNING @ Sun, 10 Feb 2013 21:27:51: #2 Use 100 as shiftsize, 200 as fragment length Here MACs tried to estimate the “shift size” for moving sense & antisense reads to get a final peak position, by identifying sets of strong + & - strand peaks at a certain distance from each other. There wasn’t enough info on chromosome 9 to do this, so it made a guess that the fragment size was 200 & shiftsize was 100. 200 is close enough to the actual fragment size of ~150 bp that we can go with this.

MACs model file This is the result I got when I ran MACs with all chromosomes #2 Build Peak Model... #2 number of paired peaks: 683 Fewer paired peaks (683) than 1000! Model may not be build well! Lower your MFOLD parameter may erase this warning. Now I will use 683 pairs to build model! finished! predicted fragment length is 125 bps Generate R script for model : LiE_IP_v_INPUT_11_2012_dup1_model.r Call peaks... use control data to filter peak candidates... Finally, 9504 peaks are called! find negative peaks by swapping treat and control Finally, 337 peaks are called! d = estimated fragment size. Actual size ~150 bp, so this is not perfect, suggesting a bit more tweaking could b useful. To generate this file you will need to go into R, and enter: Source(“MACS_output_file.r”), which will generate a .pdf

Peaks & negative peaks Keep scrolling down your .macsinfo file until you find… … INFO @ Sun, 10 Feb 2013 21:36:47: #3 Finally, 364 peaks are called! INFO @ Sun, 10 Feb 2013 21:36:47: #3 find negative peaks by swapping treat and control INFO @ Sun, 10 Feb 2013 21:36:52: #3 Finally, 36 peaks are called! INFO @ Sun, 10 Feb 2013 21:36:52: #4 Write output… This is the pay-off, where MACS identifies your ER alpha peak locations! 364 peaks on chromosome 19 (which is ~1/50th of the genome) suggests ~20,000 peaks for the whole genome, which is not bad! Equally critical, MACS now swaps treat & control (pretending your INPUT data is your IP & your ChIP data is your input) and looks again for peaks. The number of “negative” peaks found in this way should be far less than the positive peaks, and the 10:1 ratio here is fine.

WinSCP (SFTP/FTP software for Windows): http://winscp. net/eng/index

Looking at MACS data in Excel Using WinSCP move the _peaks.xls file to the PC & open it.

Toubleshooting MACs For details on how to troubleshoot many problems in MACs, see the file ChIPseq_analysis_methods_2013_02_11.doc on the cbi website. Briefly… MACs can’t build a model: - Adjust the mfold values (the fold over background ranges MACs considers for paired peaks) - Tell MACs to not build a model, but instead use the shiftsize you specify. Peaks/Negative Peaks ratio is poor or too few peaks are detected: - Adjust model settings to see if you can improve both. Otherwise, you may have to conclude that 1) your library was no good or 2) the factor just doesn’t bind to many places in the genome.

Trimming .bdg files With the –B & -S commands, MACS generated a bedGraph file that can be used to visualize your combined read density information (with + & - reads shifted by shiftsize) in the UCSC browser MACS gets too enthusiastic, however, and occasionally places the end of a read past the what the UCSC browser thinks is the end of a chromosome (causing the UCSC browser to reject the whole file). To avoid this, you need to trim your .bdg files to remove anything past chromosome ends.

Normalizing .bdg files If you sequenced 100 M reads (A) you may have a peak that is 200 reads at its apex. But if you only took a subsample 10 M reads (B), that peak would be only ~20 reads at its apex. To compare (A) & (B), just divide by the # of million mapped reads… now both peaks have a max of 2. The same is true when comparing across samples: normalizing to “reads per million mapped reads” (RPMR) lets you directly compare peak intensity across samples & conditions.

Millions of non-duplicated mapped reads reported in .macsinfo cp /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/*.fastq.gz . gunzip *.fastq.gz bsub -oo ip19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m 1 -5 8 --best --strata mm9 ip19.fastq ip19.map bsub -oo input19.bowtieinfo /cluster/shared/gschni01/bowtie*/bowtie -n 1 -m 1 -5 8 --best --strata mm9 input19.fastq input19.map awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' input19.map > input19.bed awk 'OFS="\t" {print $4, $5, $5+length($6),$1,".",$3}' ip19.map > ip19.bed module add python/2.6.5 export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH export PATH=/cluster/shared/gschni01/bin:$PATH bsub -oo ipvinput19.macsinfo macs14 --format=BED --bw=210 --keep-dup=1 -B -S -c input19.bed -t ip19.bed --name ipvinput19 cd *aph/treat gunzip *.gz bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl ip19_trim_norm.bdg all ipvinput19_treat_afterfiting_all.bdg 0.275955 gzip *.bdg cd ../control bsub perl /cluster/home/g/s/gschni01/perl_programs/Select_bdgs_for_beds.pl input19_trim_norm.bdg all ipvinput19_control_afterfiting_all.bdg 0.364561 Millions of non-duplicated mapped reads reported in .macsinfo

Uploading to UCSC browser Use WinSCP to move your .gz compacted .bdg files & the …peaks.bed file the MACs generated to your PC. Go to http://genome.ucsc.edu Select mouse mm9 genome & hit enter Click on add custom tracks Select each of these files & upload them Explore! Get a sense of what the data looks like. Important sanity check: called peaks should be clearly evident in the .bdg data.

DAY 3 LECTURE OUTLINE (finishing up peak mapping, from Tuesday). Exploring your peak data with Galaxy & Cistrome Analyzing overlaps between peak sets, with galaxy and in UNIX

Galaxy & Cistrome MAIN GALAXY SITE: https://main.g2.bx.psu.edu/ GALAXY/CISTROME (specialized for ChIP-seq data): http://cistrome.dfci.harvard.edu/ap/root

Useful Galaxy Tools https://main.g2.bx.psu.edu/ Get data-> upload file: to get your data into Galaxy. For .fastq data files it’s best to give ftp server URL (from right click or control click (for mac) on link provided by your core, will need to sign up for a free account for .ftp file transfers. Liftover: To convert coordinates between genomes or different builds in the same genome (can also do in the UCSC browser) Text manipulation: add lines, rearrange columns, etc. – functional but limited and very unwieldy. Convert formats: Useful, but doesn’t cover everything. Fetch sequences: Get DNA sequence just like USCC table browser Operate on genomic Intervals-> determine intersections between sets of regions, etc.

Useful Galaxy Tools https://main.g2.bx.psu.edu/ NGS Tools: QC and manipulation -> run FASTQC Mapping -> map reads to genome with Bowtie or BWA SAM & BAM: Convert between & manipulate SAM and BAM format files often required for certain programs. Peak Calling: MACS & a few others BED Tools: Convert between BAM & BED and manipulate .bed files. On the face of it, this looks powerful, but it is VERY slow. My quick benchmark, downloading a 1.5 Gb .fastq.gz raw data file that took 13 secs to download to the cluster took >30 minutes to upload to Galaxy.

GALAXY/CISTROME http://cistrome.dfci.harvard.edu/ap/root Galaxy tools specially designed for ChIP-seq analysis. Most things you can find elsewhere, but Cistrome allows easy access to many analyses that give you some quick insights into your data. Sign up for the free account, so we can explore what Cistrome. …“cistrome” is a term coined by Myles Brown’s lab at DFCI, which the genomewide distribution of a transcription factor on chromatin.

Cistrome-specific tools CISTROME TOOLBOX Data Preprocessing: Run MACS & variants, as well as some designed for ChIP-chip. Integrative analysis: CORRELATION-> Venn Diagram (overlaps of 2 to 3 peak coordinate datasets) Use WinSCP to move the 3 .bed files from: /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/ERa_cistrome_beds …to your PC & then upload them to Cistrome. Select Venn Diagram & select these 3 files in the drop-down menus.

Cistrome-specific tools CISTROME TOOLBOX: ASSOCIATION STUDY CEAS (cis element annotation system) Provides quick info on distribution of your TF peaks relative to chromsomes, gene start & end sites, exons, etc. Select CEAS & then select one of your uploaded files (by number) in the dropdown menu.

Cistrome-specific tools CISTROME TOOLBOX: ASSOCIATION STUDY GCA: Gene centered annotation Find the nearest interval in the given intervals set for every annotated coding gene (e.g. where’s the nearest ER binding site for each gene in the genome) peak2gene: Peak Center Annotation Input a peak file, and It will search each peak on UCSC GeneTable to get the refGenes near the peak center (e.g. where’s the nearest gene to each ERa binding site in the genome).

Cistrome-specific tools CISTROME TOOLBOX: ASSOCIATION STUDY Conservation Plot Calculates the PhastCons scores in several intervals sets Functional transcription factor binding sites are expected to have consensus elements for that factor (and/or partnering factors that help recruit it or stabilize it on chromatin). This is less true, of course, for histone modifications, which may spread for some distance from initial recruiting factors. Choose conservation plot & feed one or more of your bed files to it. One way to tell if you have improved your peak calling (e.g. by tweaking MACS parameters) is if the conservation at the center of your peaks goes up).

DAY 3 LECTURE OUTLINE (finishing up peak mapping, from Tuesday). Exploring your peak data with Galaxy & Cistrome Analyzing overlaps between peak sets, with galaxy and in UNIX

Overlaps in Cistrome or Galaxy The Venn Diagram function gave you some indication of the degree of overlap between your .bed file datasets – but this is only a top level analysis. Operate On Genomic Intervals-> Intersect This lets you create a new .bed file which has only the regions that intersect between two datasets. Overlapping Pieces of Intervals: (saves only the regions shared between 1 & 2) Overlapping Intervals: (saves complete intervals from file 1 that overlap anything in file 2)

How can we tell whether overlaps are significantly greater than chance? Go to the cluster & move those same 3 files into your chip folder in your /cluster/shared/userID/chip folder: cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.bed . Now, let’s assess whether the overlap between ERalpha binding sites between liver and aorta is greater than expected by chance using: bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl AoE_all.bed LiE_all.bed –outfile AoE_v_LiE.overlap This program identifies all times when .bed regions in file1 overlap bed regions in file2 & estimates the frequency expected by chance.

Assigning a p value p.=.74 p.=2e-10 We have 3 bits of count frequency information: The number of overlaps, the number of regions compared, and the expected background frequency: This type of data is like coin tosses & is ideally suited for a binomial test, which uses “number of matches”, “number of tests” and “expected background frequency” to calculate p. values. If you flip a coin, say 10 times and it comes up heads 6 out of 10 (frequency 0.6 vs. expected 0.5), that would not seem unlikely – and a binomial test would tell you this. However if you flip a coin 1000 times & get heads 600 out of 1000, that would seem a bit odd, and the binomial test would indicate this by saying that the probability of the null hypothesis (that the frequency of heads is 0.5) is low. p.=.74 1 5 6 10 p.=2e-10 100 500 600 1000

A brief forray into R Looking at the overlap program results, we know that there were 1653 overlaps between Aorta & Liver ER sites, out of 8260 Aorta regions tested (our number of tests), and the background frequency was 95.11/8260. To run our binomial test we’ll want to start up the R statistical programming language, by typing: module load R If you just type R now, you get this message: To run R please invoke the following command to run it via LSF's interactive queue: bsub -Ip -q int_public6 R Do what it suggests & you’ll get welcome information in R.

Binomial tests for overlaps Now ask R to run your binomial test by typing: binom.test(1653, 8260, 95.11/8260) The p.value is <2e-16. Very low. So, yes, ER binding sites in liver and aorta overlap more than expected by chance… but ERa is still binding to ~80% different places between these two tissues. Now exit R by typing “q()” & saying “n” to the question about saving. Binomial tests are useful for many different types of count data & they will also give you probabilities for ANTI-enrichment as well as enrichment.

Getting R (for your PC) R: http://cran.r-project.org/ RStudio: http://www.rstudio.com/ Install RStudio after you have installed R. For more info on using R & Unix see: http://sites.tufts.edu/cbi/resources/rna-seq-course/ UNIX resources & R resources

Overlaps between peaks & genes In that same file are also .txt files listing the transcription start sites (TSSes) of genes that were up- or down-regulated by estrogen in aorta or liver. Get them by typing: cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.txt . Take a look at one of them using head [name].txt chr6 73171625 - Dnahc6 chr2 25356026 - C8g chr6 65540391 + Tnip3 … The file format is (tab-delimited) chromosome, TSS, transcription direction (+=sense) & geneID. You can get all this info easily from the UCSC browser, for individual genes (by hand)… … or you can get this information for all genes & extract what you want for your gene set of interest.. Check out the RNA-seq module for info on making & handling .gtf files.

Overlaps between peaks & genes 2 The overlap program can recognize this type of file & will test for overlaps between ChIP-seq peaks and regions around the listed TSSes (default +/-1000 bp). You can also change this range by specifying a –range variable. Find the overlaps between 10-kb regions around TSSes of genes up- or downregulated in each tissue & the corresponding ER binding site data using variations on: bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl Ao_up_TSS.txt AoE_all.bed –outfile Ao_up_v_AiE.overlap Note the number of overlaps, number of genes (= number of tests) and the number of overlaps expected by chance, then start up R and use binomial tests to determine whether there is significant enrichment for each comparison. What conclusions can you draw?

An important note on Data Storage .fastq files are huge (too big for CDs or, for more than a few, your PC hard drive). So are many of the analysis files (like your .map & .bed files). You can request extra storage space on the cluster – for more info go to: https://wikis.uit.tufts.edu/confluence/display/TuftsUITResearchComputing/Storage Even that fills up fast: I’d recommend buying an external >1 Terabyte hard drive (~$200 or less).

Broad IGV (“Integrative Genomics Viewer”), an alternative to UCSC browser http://www.broadinstitute.org/igv/ You will need to register, but they don’t send you spam.

ChIP-seq Methods & Analysis Gavin Schnitzler Asst. Prof. Medicine TUSM, Investigator at MCRI, TMC gschnitzler@tuftsmedicalcenter.org 617-636-0615

ChIP-seq COURSE OUTLINE Day 1: ChIP techniques, library production, USCS browser tracks Day 2: QC on reads, Mapping binding site peaks, examining read density maps. Day 3: Analyzing peaks in relation to genomic feature, etc. Day 4: Analyzing peaks for transcription factor binding site consensus sequences. Day 5: Variants & advanced approaches.

DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP

DAY 3 REMNANT Analyzing overlaps between peak & regulated genes in UNIX

How can we test the significance of binding site association w/ regulated genes? If you haven’t already, go to the cluster & move bed and txt files to your /cluster/shared/userID/chip folder (mkdir chip & cd chip if you don’t have this folder yet): cp /cluster/tufts/cbi*/Ch*/Sam*/ER*beds/*.* . The .txt files list the transcription start sites (TSSes) of genes that were up- or down-regulated by estrogen in aorta or liver (by RNA-seq analysis).

Overlaps between peaks & genes Take a look at one of them using head [name].txt chr6 73171625 - Dnahc6 chr2 25356026 - C8g chr6 65540391 + Tnip3 … The file format is (tab-delimited) chromosome, TSS, transcription direction (+=sense) & geneID. You can get all this info easily from the UCSC browser, for individual genes (by hand)… … or you can get this information for all genes & extract what you want for your gene set of interest.. Check out the RNA-seq module for info on making & handling .gtf files.

Overlaps between peaks & genes 2 The overlap program can recognize this type of file & will test for overlaps between ChIP-seq peaks and regions around the listed TSSes (default +/-1000 bp). You can also change this range by specifying a –range variable. Find the overlaps between 10-kb regions around TSSes of genes up- or downregulated in each tissue & the corresponding ER binding site data using variations on: bsub perl /cluster/home/g/s/gschni01/perl*/overlap_1.3.pl Ao_up_TSS.txt AoE_all.bed –outfile Ao_up_v_AoE.overlap (these commands are in /cluster/tufts/cbi*/Ch*/Sam*/Fin*/workflow2.txt) Note the number of overlaps (hits), number of genes (tests) and the number of overlaps expected by chance divided by the number of genes (background frequency) provides all the information you need for binomial tests. Note these numbers down for each comparison.

Accessing the R statistical language On the PCs in this room: Start->programs->R To get R for your PC (free): http://cran.r-project.org/ To get RStudio (allows for easier management of R projects): http://www.rstudio.com/ On the cluster type: module load R Then: bsub -Ip -q int_public6 R To exit use the R command q() For more info on using R & Unix see: http://sites.tufts.edu/cbi/resources/rna-seq-course/ UNIX resources & R resources

Binomial tests in R Use the R command: binom.test(hits, tests, bkg_freq) to address the significance of overlaps you see For Ao_down_TSS.txt vs. AoE.bed: binom.test(118,2, 1.03/118) Which comparisons show significant enrichment. Do any show anti-enrichment?

DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP

What is PWM? Transcription factor binding sites (TFBSs) are usually slightly variable in their sequences. A positional frequency matrix (PFM) specifies the probability that you will see a given base at each index position of the motif. This is built from sequences known to bind the TF (e.g. 46 sequences for the PFM below). N C A G T Con 16 5 2 3 1 42 6 9 7 4 24 44 19 15 11 10 8 34 31 13 18 39 43 14 21 33 29 12 Pos 3’ 5’ Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University

. PFM->normalized PFM->PWM Binding site data Position frequency matrix (PFM) (also known as raw count matrix) acggcagggTGACCc aGGGCAtcgTGACCc cGGTCGccaGGACCt tGGTCAggcTGGTCt aGGTGGcccTGACCc cTGTCCctcTGACCc aGGCTAcgaTGACGt . cagggagtgTGACCc gagcatgggTGACCa aGGTCAtaacgattt gGAACAgttTGACCc cGGTGAcctTGACCc gGGGCAaagTGACTg Given N sequence fragments of fixed length, one can assemble a position frequency matrix (number of times a particular nucleotide appears at a given position). A normalized PFM, in which each column adds up to a total of one, is a matrix of probabilities for observing each nucleotide at each position (e.g. divide by 46). Position weight matrix (PWM) (also known as position-specific scoring matrix) The normalized PFM is converted to log-scale for efficient computational analysis. To eliminate null values before log-conversion, and to correct for small samples of binding sites, a sampling correction, known as pseudocounts, is added to each cell of the PFM. Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University

Converting a PFM into a PWM Position Weight Matrix for ERE Converting a PFM into a PWM For each matrix element do: A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -2.96 1.62 -0.72 C -1.49 -0.30 1.39 0.78 0.34 0.25 1.76 0.46 G 0.16 1.31 1.44 -0.17 -0.06 0.65 1.79 -0.64 T 0.96 -0.78 1.73 -1.84 0.23 – raw count (PFM matrix element) of nucleotide b in column i N – number of sequences used to create PFM (= column sum) - pseudocounts (correction for small sample size) p(b) - background frequency of nucleotide b Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University

Scoring putative EREs by scanning the promoter w/ PWM G G G T C A G C A T G G C C A A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -2.96 1.62 -0.72 C -1.49 -0.30 1.39 0.78 0.34 0.25 1.76 0.46 G 0.16 1.31 1.44 -0.17 -0.06 0.65 1.79 -0.64 T 0.96 -0.78 1.73 -1.84 0.23 Absolute score of the site =11.57 This is also called “functional depth” Adapted from presentation by Victor Jin, Department of Biomedical Informatics, The Ohio State University

Estimating p. values for a match to the matrix G G G T C A G C A T G G C C A A 0.58 -0.44 -0.98 -1.21 -2.29 1.22 -0.60 -2.96 1.62 -0.72 C -1.49 -0.30 1.39 0.78 0.34 0.25 1.76 0.46 G 0.16 1.31 1.44 -0.17 -0.06 0.65 1.79 -0.64 T 0.96 -0.78 1.73 -1.84 0.23 This sequence had a functional depth (f) of 0.86 The summed probabilities of all sequences with f >=.86 gives the p.value for this sequence = chance that f>=.86 would be achieved by a randomized DNA sequence. Short matrices can reach f > .9 but still have high p. values – thus f is the best measure for short seqs. Long matrices can have very low p. values but still have f< .9 – thus p.value is the best measure for long seqs.

DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP

Preparing for PWM search Lauch WinSCP (Start->programs->WinSCP) Navigate to: /cluster/tufts/cbicourse/ChIPseq/Sample_NGS_data/Final_output_files Pull over the “ipvinput19_peaks.xls” file to the PC. (this is the MACS output file that we generated yesterday) Open it into Excel

Making .bed file w/ +/-200 bp around peak summit (where we expect TFBS enrichment will be greatest) =same row, chr column =start col+summit+200 =start col+summit-200 Copy these 3 columns (without any header row). In WinSCP click on any file on the PC, then on files->new->file & provide a name (“LiE_chr19_400bp.bed”) to edit a new simple text file. Paste, save & close.

Making a file of control .bed regions peak ctrs. control regions chr start end chr start end … =peaks:chr =peaks:start-5000 =peaks:end-5000 5000 bp is far enough away to not be part of an enhancer composed of the ER binding site... but is close enough to likely be in the same general chromatin territory (e.g. accessible euchromatin vs. inaccessible heterochromatin) Copy these columns & make a “CTRL_chr19_400bp.bed” file with WinSCP

CentDist A TFBS enrichment program designed for ChIP-seq data Assumes that TFBS-matrix hits will be most highly enriched at the centers of ChIP-seq peaks. Adds PWM score to “peakiness” score (e.g. how much more enriched the TF site is in the center of the peak)  final p. val. Good enrichment poor shape (higher p.val.) Good enrichment OK shape Good enrichment good shape (best p.) Go to: http://biogpu.ddns.comp.nus.edu.sg/~chipseq/webseqtools2/TASKS/Motif_Enrichment/submit.php?email=guest …or (more simply) just google centdist and click on the first link (should end in /centdist/)

Run CentDist Give centdist a name for your run Upload your +/-200 bp .bed file (CentDist does not need a separate background file, instead using flanking sequences at a case-specific optimized distance as background) Check “Jaspar”, “vertebrate” & set max-co-motif distance to 3000 Then click Submit Job On the new window that opens click “turn on autorefresh” so you will be notified when the job ends

Jaspar vs. Transfac Jaspar is a freely-available set of TFBS matrices that can be downloaded from jaspar.genereg.net Transfac is a commercial product that you need to pay for the latest release of. A version of Transfac (from ~2006) is available on the cluster (e.g. /cluster/home/g/s/gschni01/vertebrates.mat) Which to use? Both, ideally, but beware that programs like CentDist will give you results from Transfac matrices – and you won’t be able to find out details of what they are.

CentDist Results View by factors, put in max number & hit go. P. Values (based on Score compose of Z0 (enrichment) & Z1 (peakiness) Distribution graph Weblogo representation of Jaspar matrix Shows information content at each position. A,G,C&T 25% each-> 0 bits, only 1 base 100%->2 bits. Bases most highly over-represented relative to chance are largest.

How many enriched TF sites are there really? Matrix hit enrichment for many factors, are all of them real? V$jaspar_HNF4A V$jaspar_NR2F1 V$jaspar_ESR1 Maybe not, notice how similar top sites are to each other and to estrogen response elements (EREs) such as V$jaspar_ESR1

Downloading CentDist Results Click on download as text & save the file somewhere you remember. Open it into excel. Basic summary statistics & a few other things. Many questions unanswered: -What is the fold enrichment over background? -What are the peaks with the greatest enrichment for each factor? -What are the TFBS hit locations in each peak? -Which are the real enriched TFBSes & which are just showing up by homology? -Do certain factors group together into the same same peaks?

DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP

Storm Storm is a straightforward PWM scanning program that runs in UNIX. Its greatest advantage is that it gives you all of the unprocessed output data, which allows you to do much more powerful analyses! It also allows us to specify thresholds for matches to the matrix – allowing us to use functional depth as well as p. value

Getting DNA for Storm To run storm, we first need to get the actual DNA sequence for centers of our peaks (where we expect the greatest enrichment for TFBSes to be). Go to the UCSC genome browser at: genome.ucsc.edu Under genome choose mouse mm9 Then choose add custom track & upload your +/-200 bp .bed file. Click on Tools->Table Browser Select your new track Select output format “sequence” Provide a file name “LiE_chr19_400bp.fa” & hit “get output” Hit ‘get output’ again on the next page Now do the same for your “CTRL_chr19_400bp.bed” file. .fa denotes a simple ‘fasta’ format sequence file.

Cleaning up our .fa files Use WinSCP to move these .fa files and their corresponding .bed files to your …/chip directory. Each entry in the .fa file has a header with special characters in it that confuse storm. All of the commands below are in the file /cluster/tufts/cbi*/Ch*/Sam*/Final*/workflow2.txt… cat this to your screen, to copy & paste commands. To fix this, go to your …/chip directory in Putty & do: perl /cluster/home/g/s/gschni01/perl*/Lax_convert.pl LiE_chr19_400bp.fa > LiE_chr19_400bp_converted.fa To see what has changed use: head *.fa Do the same for your “CTRL_chr19_400bp.fa” file.

Running storm First set some path variables: export CREAD=/cluster/home/g/s/gschni01/cread-0.84 export PATH=$PATH:$CREAD/bin Then run storm for your IP .fa file: bsub -oo LiE_chr19_400bp_p.storminfo storm -p -t 0.0005 -s LiE_chr19_400bp_converted.fa -o LiE_chr19_400bp_p.storm /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat And for your control .fa file: bsub -oo CTRL_chr19_400bp_p.storminfo storm -p -t 0.0005 -s CTRL_chr19_400bp_converted.fa -o CTRL_chr19_400bp_p.storm /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat Use more to look at one of your .storm output files (space for next page ctrl c to exit)

DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP

Interpreting Storm data Run the dme_parse perl program to gather and tabulate your storm data: bsub -oo LiE_chr19_400bp_p.dmeparseinfo perl /cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl LiE_chr19_400bp_p.storm LiE_chr19_400bp.bed peaks bsub -oo CTRL_chr19_400bp_p.dmeparseinfo perl /cluster/home/g/s/gschni01/perl*/dme_parse5.4.pl CTRL_chr19_400bp_p.storm CTRL_chr19_400bp.bed peaks

dme_parse outputs …storm.bed file: Has USCS browser tracks for each TFBS matrix with locations of all hits in bed format. …storm.map file: Lists all input matrices followed by the PFM derived from all of the hits to this matrix from our data. …storm.info file: Summarizes a lot of information about matrix hits Move the .info files to your PC with WinSCP & open them into Excel. File provides summary statistics for # of peaks with 0,1,2,etc. hits, total hits, and normalized hits per 50 bp vs distance from peak center.

dme_parse outputs Using the .info file to plot relative density of TFBS hits in aorta IP, liver IP & offset controls:

dme_parse outputs Using the .info files to structure binomial tests Hits= # of matches to each matrix in IP data Tests=# of times storm tested for a match =(# of peaks) * (400 bp length of peaks - matrix length) Background freq= matches to offset conrol peak data/# tests (same as for IP) Using the .info files to determine fractional enrichment Hit frequency in IP data/Hit frequency in offset control

dme_parse outputs .freqs file: Number of hits to each matrix for each peak Distribution of hits per peak in offset background establishes # of hits to be p.<=.05 enriched over backgound Allows identification of sites at which a given TFBS may be functionally targeted (candidates for further testing) Can also look for significant overlaps between the peaks with enrichment for 2 different factors - to identify cooperative versus antagonistic interactions. Details on how to do these analyses are in ChIPseq_analysis_methods_2013_02_11 on the cbi website.

DAY 4 OUTLINE Position weight matrices to find transcription factor binding sites (TFBSes) TFBS enrichment in peaks using CentDist TFBS enrichment using Storm in UNIX Mining Storm results Disambiguating similar matrices w/ STAMP

STAMP Go to www.benoslab.pitt.edu/stamp/index.php STAMP lets you compare matrices for evolutionary similarities to each other. Go to your CentDist output. Create a new column in which you change the names of the factors to fit with the names in the Jaspar_non_redundant_vertebrate.mat file you used for Storm. =substitute(b2,“V$jaspar_”,”Jaspar$”), & propogate down Select all matrix names w/ p.<.05 & paste them into a new file called “select_mats.txt” in your /chip folder on the cluster using WinSCP.

Getting STAMP to help classify our CentDist top hits perl /cluster/home/g/s/gschni01/perl*/MatrixSelect.pl /cluster/home/g/s/gschni01/Jaspar_non_redundant_vertebrate.mat select_mats.txt select_mats.mat Now, open the select_mats.mat file with WinSCP, copy everything & paste it into STAMP. Keep all the STAMP defaults & hit submit.

STAMP Tree This indicates that enrichment of PPARG, RORA, NR4A2 could be just because of their similarity to EREs. Other enriche sites, such as SP1, FoxA2 & Myf fall in separate homology classes. To further distinguish which one is real, you can use the enrichment ratios & p. values (the “real” TFBS should be best in both of these.

ChIP-seq Methods & Analysis Gavin Schnitzler Asst. Prof. Medicine TUSM, Investigator at MCRI, TMC gschnitzler@tuftsmedicalcenter.org 617-636-0615

ChIP-seq COURSE OUTLINE Day 1: ChIP techniques, library production, USCS browser tracks Day 2: QC on reads, Mapping binding site peaks, examining read density maps. Day 3: Analyzing peaks in relation to genomic feature, etc. Day 4: Analyzing peaks for transcription factor binding site consensus sequences. Day 5: Variants & advanced approaches.

Day 5 Outline Introduction to variations on ChIP-seq methods Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 89

Next-Generation Sequencing Analysis “ChIP-Seq is the best thing that happened to ChIP since the antibody.  It is 100x better than ChIP-Chip since it escapes most of the problems of microarray probe hybridization.  Plus it is cheaper, and genome wide.  But ChIP-Seq is only the tip of the iceberg - there are many inventive ways to use a sequencer.”  Quote from intro to Homer software at: http://biowhat.ucsd.edu/homer/ngs/index.html

Extensions of ChIP-seq ChIP-Seq: Isolation and sequencing of genomic DNA "bound" by a specific transcription factor, covalently modified histone, or other nuclear protein.  This methodology provides genome-wide maps of factor binding.  Most of HOMER's routines cater to the analysis of ChIP-Seq data. DNase-Seq: Treatment of nuclei with a restriction enzyme such as DNase I will result in cleavage of DNA at accessible regions.  Isolation of these regions and their detection by sequencing allows the creation of DNase hypersensitivity maps, providing information about which regulatory elements are accessible in the genome. (variant technique called FAIRE-seq) MNase-Seq: Micrococcal Nuclease (MNase) is a restriction enzyme that degrades genomic DNA not wrapped around histones.  The remaining DNA represents nucleosomal DNA, and can be sequencing to reveal nucleosome positions along the genome.  This method can also be combined with ChIP to map nucleosomes that contain specific histone modifications. RNA-Seq: Extraction, fragmentation, and sequencing of RNA populations within a sample.  The replacement for gene expression measurements by microarray.  There are many variants on this, such as Ribo-Seq (isolation of ribosomes translating RNA), small RNA-Seq (to identify miRNAs), etc. GRO-Seq: RNA-Seq of nascent RNA.  Transcription is halted, nuclei are isolated, labeled nucleotides are added back, and transcription briefly restarted resulting in labeled RNA molecules.  These newly created, nascent RNAs are isolated and sequenced to reveal "rates of transcription" as opposed to the total number of stable transcripts measured by normal RNA-seq. Hi-C: Genomic interaction assay for understanding genome 3D structure.  This assay is much more specialized - For more information about how to use HOMER to analyze Hi-C data, check out the Hi-C analysis section.

Examining long-range interactions by ChIP-seq Two DNA fragments associated with the same IP’d protein are ligated together. Sequencing identifies both short-range and long range interactions. Nature Reviews Genetics 2012 13:840

Fine scale information from DNAse-seq Sequencing the ends of DNAse cuts identifies regions of bare DNA. Fine scale analysis of this data can identify individual TF binding sites. Nature Reviews Genetics 2012 13:840

Capturing allele-specific information using SNPs in reads CTCF binds better to the A variant

Mapping CpG DNA methylation patterns Approaches: IP of DNA fragments using antibodies against meC or meCpG binding proteins. Selection of DNA fragments using methyl-sensitive restriction enzymes. Whole genome bisulfite sequencing. Bormann Chung CA, Boyd VL, McKernan KJ, Fu Y, et al. (2010) Whole Methylome Analysis by Ultra-Deep Sequencing Using Two-Base Encoding. PLoS ONE 5(2): e9320. doi:10.1371/journal.pone.0009320 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0009320

Mapping nucleosome positions Approaches: 1) Fragmentation to mononucleosome size by sonication or micrococcal nuclease (MNase)  ChIP w/ antibody against histone modification (H3K4me1) – can map positions of nucleosomes with this mark.  Whole genome sequencing. Nat Struct Mol Biol. 2011 June; 18(6): 742–746.

Plotting ChIP-seq read density versus genomic features Taking average normalized .bedgraph data relative to TSSes…

Using input chromatin read density to measure nucleosome densities Hypothesis: Sonication mostly cuts in nucleosome free regions or inter-nucleosomal spacers. Thus, read positions give information about nucleosome positions. Initial support: Average normalized .bedgraph data from INPUT sample relative to TSSes recapitulates the low nucleosome occupancy seen genomewide over promoters.

Day 5 Outline Introduction to variations on ChIP-seq methods Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 99

Many approaches to TFBS analysis Outline of the review. The overall goal is to identify transcription factor binding sites on a genome-wide scale. Starting with a few experimentally determined sites, a model of the binding site is constructed which is then used in a genome-wide scan to search for additional instances of the binding site. Besides enhanced motif models, additional, evolutionary, genomic, epigenomic, transcriptomic and proteomic data can be used in an integrative fashion to improve the accuracy of binding site search. Hannenhalli S Bioinformatics 2008;24:1325-1331 Also, Ladunga I. An overview of the computational analyses and discovery of transcription factor binding sites. Methods Mol Biol. 2010;674:1-22. doi: 10.1007/978-1-60761-854-6_1. : Introduction to a set of about a dozen methods papers.

The Gibbs sampler approach The EM approach (in MEME etc.) De Novo Search Algorithms The Gibbs sampler approach Objective: Find conserved segment of length k in n unrelated sequences 1 k 1 1 k 2 1 k n The program will need to run once for each k: e.g. 6 bp, 7 bp, 8 bp sequences, etc. (either automatically, or by hand). From : Lawrence, C. et al.(1993) Detecting Subtle Sequence Signals: A Geibbs Sampler approach to Multiple Alignment. Science 262.208- The EM approach (in MEME etc.) Expectation Maximization algorithm, proceeds in iterations until E & M converge. For an explanation of the process see Nature Biotechnology 26, 897 - 899 (2008). Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt

Two de novo search methods DME is part of the same CREAD package that storm is in (run in UNIX) SEME some of the same refinements as CentDist to do de novo searches: http://biogpu.d1.comp.nus.edu.sg/~chipseq/webseqtools2/

Extensions to Basic Models Composite Patterns: BioOptimizer: the Bayesian Scoring Function Approach to Motif Discovery Bioinformatics M1 M2 M3 Stop Start Regulatory Modules: De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Nat’l Acad Sci USA, 102, 7079-84 Gene A Gene B Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt

Combining Signals and other Data Motifs Coding regions Expresssion and Motif Regression: Integrating Motif Discovery and Expression Analysis Proc.Natl.Acad.Sci. 100.3339-44 1.Rank genes by E=log2(expression fold change) 2.Find “many” (hundreds) candidate motifs 3.For each motif pattern m, compute the vector Sm of matching scores for genes with the pattern 4.Regress E on Sm ChIP-on-chip - 1-2 kb information on protein/DNA interaction: An Algorithm for Finding Protein-DNA Interaction Sites with Applications to Chromatin Immunoprecipitation Microarray Experiments Nature Biotechnology, 20, 835-39 Protein binding in neighborhood Coding regions Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt

Assessment of evolutionary conservation Modules shared across species are most highly rated. For use of evolutionary conservation information w/ individual motifs see: Das & Dai 2007 BMC Bioinformatics 8:S21. For regulatory modules see: Su J, Teichmann SA, Down TA (2010) Assessing Computational Methods of Cis-Regulatory Module Prediction. PLoS Comput Biol 6(12): e1001020. doi:10.1371/journal.pcbi.1001020 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1001020 Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt

Integrating data from multiple sources w/ permutation of average ranks Let’s say we want to combine data from several sources or metrics to decide which are the most relevant enriched TFs. e.g. 1) p.value in CentDist, 2) p.value in Storm & 3) p.value of homologous sequence in DME Establish a ranking metric for each (e.g. 1 best to 10 worst). It doesn’t have to be the same for 1, 2 & 3, but you need to apply the same rank system across different biological conditions. For each TF compute the average rank. (1) (2) (3) (avg) 1 3 2 2 5 8 5.7 4 2 3 9 7 8 . . . .

Permutation of average ranks Now take the same columns of ranks for (1), (2) & (3) and randomize each one separately. (1) (2) (3) (avg) 8 5 2 5 4 8 5.3 3 7 4.3 1 9 2 4 . . . . Repeat this several times (until you have thousands of random average ranks & plot frequency vs avg. rank… 2.0 observed 34/10,000 times in permuted averages. Estimated FDR ~3.4e-3 10 7 5 3 1 The number of times a given value is observed divided by the total number of iterations gives an estimate of false discovery rate.

Day 5 Outline Introduction to variations on ChIP-seq methods Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 108

What if you want to know something from a published dataset, but they’ve only provided the raw data on SRA? Getting data from SRA Go to: http://www.ncbi.nlm.nih.gov/sra Find an experiment by searching, e.g. “encode h1-hesc h3k4me3” Click on the name to the left of the smaller file (1.9M) & then on the downloads tab. Right click on the ftp link for the run & copy the link location. Open putty & login to your account at cluster.uit.tufts.edu Go to your /cluster/shared/[userID]/chip directory & do: wget [pasted URL]

Decoding the .sra format The SRR227387.sra file you now have is in a special file format, but it does have all the original .fastq information in it. To get that info do: bsub /cluster/tufts/cbi*/Ch*/ESC*/sra*/bin/fastq-dump SRR227387.sra [fastq-dump is part of a package of programs for handling .sra files that you can download, unpack & run immediately from your shared directory – at least as far as simple files like fastq-dump are concerned] This gives you the same .fastq format you’re familiar with. Use head to confirm the format, but then you might as well delete the file with rm so as not to clutter up the cluster. After this week you are now ready to do any analysis you want on this data, from mapping reads to the genome (w/ bowtie) to peak calling (w/ MACS), to TFBS analysis.

“Liftover” programs to convert between genomes & builds Several useful tools for this in Cistrome/Galaxy: Liftover/Others Convert between RefSeq, Gene Symbols to Entrez IDs using Bioconductor. Liftover Wig Files Liftover wig files [Galaxy]Convert genome coordinates between assemblies and genomes Extract data from Wiggle Extract data for certain chromosome from a wiggle file Extract data from Bed Extract data for certain chromosome from a BED file In the UCSC genome browser: Tools-> Liftover Choose the starting genome/build & the one you want to convert to. Upload a .bed file w/ the ranges you want & hit go (only works for bed files… may work with bedGraph, although I haven’t confirmed this)

Day 5 Outline Introduction to variations on ChIP-seq methods Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 112

Don’t be intimidated! There’s nothing to prevent you from installing a program you want to run in your cluster account. Before you begin, though, type “module available” to see if it’s already installed as a module. Also go to /cluster/tufts/ngsp/ngsp/ to see if it’s installed there. Read the documentation from the creator’s lab, download, unzip &/or unpack the file, read the INSTALL or README files included, & give it a try. You may need to be running a specific version of perl or python, etc. If so, check “module available” to see if it’s installed on the cluster & use “module load [name]” to add it. You may also need to set system variables using “export VARIABLE=$VARIABLE:/new/path”. README files should tell you enough to know what to try. If you get stuck, the cluster support folks are friendly & helpful (and respond moderately fast). Contact them at: cluster-support@tufts.edu.

A different integrated package of tools to run in UNIX HOMER Software for motif discovery and next-gen sequencing analysis http://biowhat.ucsd.edu/homer/ngs/index.html Mapping to the genome (NOT performed by HOMER, but important to understand) Creation Tag directories, quality control, and normalization. (makeTagDirectory) UCSC visualization (makeUCSCfile, makeBigWig.pl) Peak finding / Transcript detection / Feature identification (findPeaks) Motif analysis (findMotifsGenome.pl) Annotation of Peaks (annotatePeaks.pl) Quantification of Data at Peaks/Regions in the Genome/Histograms and Heatmaps (annotatePeaks.pl) Quantification of Transcripts (analyzeRNA.pl) Additional analysis strategies: General sequence manipulation tools (homerTools) Miscellaneous Tools for Sharing Data between programs, etc. (tagDir2bed.pl, bed2pos.pl, pos2bed.pl ...) Finding overlapping or differentially bound peaks (mergePeaks, getDifferentialPeaks) ChIP-Seq analysis automation (analyzeChIP-Seq.pl) Description of file formats Could be very useful… & with (only a bit of) luck, you’ll be able to install & run them yourself.

Installing a program in R Check out the Key R Commands link at http://sites.tufts.edu/cbi/resources/chip-seq/ This is not an introduction to programming in R! Instead it gives basic instructions for how to: 1) install & run R packages that may be needed for your research, 2) how to move data files into R 3) how to perform simple edits on this data that may be required by the package & 4) how to output your results. Note: I find that the documentation for R packages is generally quite good.

Day 5 Outline Introduction to variations on ChIP-seq methods Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 116

Mastering simple UNIX tools find, awk, grep, sort, sed & more One line commands to let you search and manipulate large data files w/o writing a program or trying to use the kludgy and limited tools in Galaxy. Find out more at: http://sites.tufts.edu/cbi/resources/rna-seq-course/unix-resources/

Programming: Get your feet wet Perl Tutorials - learn.perl.org learn.perl.org/tutorials/ Many tutorials are available if you are interested in learning Perl. These tutorials are introductions. Beginning Perl (free) - www.perl.org www.perl.org/books/beginning-perl/ This book is for those new to programming who want to learn with Perl. A ton of Perl programs for you to use/adapt/modify: http://www.bioperl.org/wiki/Main_Page For learning R: Check out Josh’s links at: http://sites.tufts.edu/cbi/resources/rna-seq-course/r-resources/ Also check out my notes on using R (specifically geared to the minimum you need to install & use existing programs) & a brief reference sheet on Perl at http://sites.tufts.edu/cbi/resources/chip-seq/

Look at examples, check the web… If you’re looking for a command in UNIX, R, Perl, Python, etc. do a Google search (for R add “statistical” to your search to specify what you mean). If you’re wondering how to get a program to do something, look at other programs & see how they did it. You don’t need to memorize the language, beyond a few basics, just look at what you (or someone else) did before & copy it.

Questions. What would you like to explore Questions? What would you like to explore? What’s the next bioinformatics challenge in your research?

Course evaluation forms…