Presentation is loading. Please wait.

Presentation is loading. Please wait.

ChIP-seq Methods & Analysis

Similar presentations


Presentation on theme: "ChIP-seq Methods & Analysis"— Presentation transcript:

1 ChIP-seq Methods & Analysis
Gavin Schnitzler Asst. Prof. Medicine TUSM, Investigator at MCRI, TMC

2 ChIP-seq COURSE OUTLINE
Day 1: ChIP techniques, library production, USCS browser tracks Day 2: QC on reads, Mapping binding site peaks, examining read density maps. Day 3: Analyzing peaks in relation to genomic feature, etc. Day 4: Analyzing peaks for transcription factor binding site consensus sequences. Day 5: Variants & advanced approaches.

3 Day 5 Outline Introduction to variations on ChIP-seq methods
Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 3

4 Next-Generation Sequencing Analysis
“ChIP-Seq is the best thing that happened to ChIP since the antibody.  It is 100x better than ChIP-Chip since it escapes most of the problems of microarray probe hybridization.  Plus it is cheaper, and genome wide.  But ChIP-Seq is only the tip of the iceberg - there are many inventive ways to use a sequencer.”  Quote from intro to Homer software at:

5 Extensions of ChIP-seq
ChIP-Seq: Isolation and sequencing of genomic DNA "bound" by a specific transcription factor, covalently modified histone, or other nuclear protein.  This methodology provides genome-wide maps of factor binding.  Most of HOMER's routines cater to the analysis of ChIP-Seq data. DNase-Seq: Treatment of nuclei with a restriction enzyme such as DNase I will result in cleavage of DNA at accessible regions.  Isolation of these regions and their detection by sequencing allows the creation of DNase hypersensitivity maps, providing information about which regulatory elements are accessible in the genome. (variant technique called FAIRE-seq) MNase-Seq: Micrococcal Nuclease (MNase) is a restriction enzyme that degrades genomic DNA not wrapped around histones.  The remaining DNA represents nucleosomal DNA, and can be sequencing to reveal nucleosome positions along the genome.  This method can also be combined with ChIP to map nucleosomes that contain specific histone modifications. RNA-Seq: Extraction, fragmentation, and sequencing of RNA populations within a sample.  The replacement for gene expression measurements by microarray.  There are many variants on this, such as Ribo-Seq (isolation of ribosomes translating RNA), small RNA-Seq (to identify miRNAs), etc. GRO-Seq: RNA-Seq of nascent RNA.  Transcription is halted, nuclei are isolated, labeled nucleotides are added back, and transcription briefly restarted resulting in labeled RNA molecules.  These newly created, nascent RNAs are isolated and sequenced to reveal "rates of transcription" as opposed to the total number of stable transcripts measured by normal RNA-seq. Hi-C: Genomic interaction assay for understanding genome 3D structure.  This assay is much more specialized - For more information about how to use HOMER to analyze Hi-C data, check out the Hi-C analysis section.

6 Examining long-range interactions by ChIP-seq
Two DNA fragments associated with the same IP’d protein are ligated together. Sequencing identifies both short-range and long range interactions. Nature Reviews Genetics :840

7 Fine scale information from DNAse-seq
Sequencing the ends of DNAse cuts identifies regions of bare DNA. Fine scale analysis of this data can identify individual TF binding sites. Nature Reviews Genetics :840

8 Capturing allele-specific information using SNPs in reads
CTCF binds better to the A variant

9 Mapping CpG DNA methylation patterns
Approaches: IP of DNA fragments using antibodies against meC or meCpG binding proteins. Selection of DNA fragments using methyl-sensitive restriction enzymes. Whole genome bisulfite sequencing. Bormann Chung CA, Boyd VL, McKernan KJ, Fu Y, et al. (2010) Whole Methylome Analysis by Ultra-Deep Sequencing Using Two-Base Encoding. PLoS ONE 5(2): e9320. doi: /journal.pone

10 Mapping nucleosome positions
Approaches: 1) Fragmentation to mononucleosome size by sonication or micrococcal nuclease (MNase)  ChIP w/ antibody against histone modification (H3K4me1) – can map positions of nucleosomes with this mark.  Whole genome sequencing. Nat Struct Mol Biol June; 18(6): 742–746.

11 Plotting ChIP-seq read density versus genomic features
Taking average normalized .bedgraph data relative to TSSes…

12 Using input chromatin read density to measure nucleosome densities
Hypothesis: Sonication mostly cuts in nucleosome free regions or inter-nucleosomal spacers. Thus, read positions give information about nucleosome positions. Initial support: Average normalized .bedgraph data from INPUT sample relative to TSSes recapitulates the low nucleosome occupancy seen genomewide over promoters.

13 Day 5 Outline Introduction to variations on ChIP-seq methods
Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 13

14 Many approaches to TFBS analysis
Outline of the review. The overall goal is to identify transcription factor binding sites on a genome-wide scale. Starting with a few experimentally determined sites, a model of the binding site is constructed which is then used in a genome-wide scan to search for additional instances of the binding site. Besides enhanced motif models, additional, evolutionary, genomic, epigenomic, transcriptomic and proteomic data can be used in an integrative fashion to improve the accuracy of binding site search. Hannenhalli S Bioinformatics 2008;24: Also, Ladunga I. An overview of the computational analyses and discovery of transcription factor binding sites. Methods Mol Biol. 2010;674:1-22. doi: / _1. : Introduction to a set of about a dozen methods papers.

15 The Gibbs sampler approach The EM approach (in MEME etc.)
De Novo Search Algorithms The Gibbs sampler approach Objective: Find conserved segment of length k in n unrelated sequences 1 k 1 1 k 2 1 k n The program will need to run once for each k: e.g. 6 bp, 7 bp, 8 bp sequences, etc. (either automatically, or by hand). From : Lawrence, C. et al.(1993) Detecting Subtle Sequence Signals: A Geibbs Sampler approach to Multiple Alignment. Science The EM approach (in MEME etc.) Expectation Maximization algorithm, proceeds in iterations until E & M converge. For an explanation of the process see Nature Biotechnology 26, (2008). Adapted from:

16 Two de novo search methods
DME is part of the same CREAD package that storm is in (run in UNIX) SEME some of the same refinements as CentDist to do de novo searches:

17 Extensions to Basic Models
Composite Patterns: BioOptimizer: the Bayesian Scoring Function Approach to Motif Discovery Bioinformatics M1 M2 M3 Stop Start Regulatory Modules: De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Nat’l Acad Sci USA, 102, Gene A Gene B Adapted from:

18 Combining Signals and other Data
Motifs Coding regions Expresssion and Motif Regression: Integrating Motif Discovery and Expression Analysis Proc.Natl.Acad.Sci 1.Rank genes by E=log2(expression fold change) 2.Find “many” (hundreds) candidate motifs 3.For each motif pattern m, compute the vector Sm of matching scores for genes with the pattern 4.Regress E on Sm ChIP-on-chip kb information on protein/DNA interaction: An Algorithm for Finding Protein-DNA Interaction Sites with Applications to Chromatin Immunoprecipitation Microarray Experiments Nature Biotechnology, 20, Protein binding in neighborhood Coding regions Adapted from:

19 Assessment of evolutionary conservation
Modules shared across species are most highly rated. For use of evolutionary conservation information w/ individual motifs see: Das & Dai 2007 BMC Bioinformatics 8:S21. For regulatory modules see: Su J, Teichmann SA, Down TA (2010) Assessing Computational Methods of Cis-Regulatory Module Prediction. PLoS Comput Biol 6(12): e doi: /journal.pcbi Adapted from:

20 Integrating data from multiple sources w/ permutation of average ranks
Let’s say we want to combine data from several sources or metrics to decide which are the most relevant enriched TFs. e.g. 1) p.value in CentDist, 2) p.value in Storm & 3) p.value of homologous sequence in DME Establish a ranking metric for each (e.g. 1 best to 10 worst). It doesn’t have to be the same for 1, 2 & 3, but you need to apply the same rank system across different biological conditions. For each TF compute the average rank. (1) (2) (3) (avg) 4 2 3 9 7 8

21 Permutation of average ranks
Now take the same columns of ranks for (1), (2) & (3) and randomize each one separately. (1) (2) (3) (avg) Repeat this several times (until you have thousands of random average ranks & plot frequency vs avg. rank… 2.0 observed 34/10,000 times in permuted averages. Estimated FDR ~3.4e-3 The number of times a given value is observed divided by the total number of iterations gives an estimate of false discovery rate.

22 Day 5 Outline Introduction to variations on ChIP-seq methods
Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 22

23 What if you want to know something from a published dataset, but they’ve only provided the raw data on SRA? Getting data from SRA Go to: Find an experiment by searching, e.g. “encode h1-hesc h3k4me3” Click on the name to the left of the smaller file (1.9M) & then on the downloads tab. Right click on the ftp link for the run & copy the link location. Open putty & login to your account at cluster.uit.tufts.edu Go to your /cluster/shared/[userID]/chip directory & do: wget [pasted URL]

24 Decoding the .sra format
The SRR sra file you now have is in a special file format, but it does have all the original .fastq information in it. To get that info do: bsub /cluster/tufts/cbi*/Ch*/ESC*/sra*/bin/fastq-dump SRR sra [fastq-dump is part of a package of programs for handling .sra files that you can download, unpack & run immediately from your shared directory – at least as far as simple files like fastq-dump are concerned] This gives you the same .fastq format you’re familiar with. Use head to confirm the format, but then you might as well delete the file with rm so as not to clutter up the cluster. After this week you are now ready to do any analysis you want on this data, from mapping reads to the genome (w/ bowtie) to peak calling (w/ MACS), to TFBS analysis.

25 “Liftover” programs to convert between genomes & builds
Several useful tools for this in Cistrome/Galaxy: Liftover/Others Convert between RefSeq, Gene Symbols to Entrez IDs using Bioconductor. Liftover Wig Files Liftover wig files [Galaxy]Convert genome coordinates between assemblies and genomes Extract data from Wiggle Extract data for certain chromosome from a wiggle file Extract data from Bed Extract data for certain chromosome from a BED file In the UCSC genome browser: Tools-> Liftover Choose the starting genome/build & the one you want to convert to. Upload a .bed file w/ the ranges you want & hit go (only works for bed files… may work with bedGraph, although I haven’t confirmed this)

26 Day 5 Outline Introduction to variations on ChIP-seq methods
Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 26

27 Don’t be intimidated! There’s nothing to prevent you from installing a program you want to run in your cluster account. Before you begin, though, type “module available” to see if it’s already installed as a module. Also go to /cluster/tufts/ngsp/ngsp/ to see if it’s installed there. Read the documentation from the creator’s lab, download, unzip &/or unpack the file, read the INSTALL or README files included, & give it a try. You may need to be running a specific version of perl or python, etc. If so, check “module available” to see if it’s installed on the cluster & use “module load [name]” to add it. You may also need to set system variables using “export VARIABLE=$VARIABLE:/new/path”. README files should tell you enough to know what to try. If you get stuck, the cluster support folks are friendly & helpful (and respond moderately fast). Contact them at:

28 A different integrated package of tools to run in UNIX
HOMER Software for motif discovery and next-gen sequencing analysis Mapping to the genome (NOT performed by HOMER, but important to understand) Creation Tag directories, quality control, and normalization. (makeTagDirectory) UCSC visualization (makeUCSCfile, makeBigWig.pl) Peak finding / Transcript detection / Feature identification (findPeaks) Motif analysis (findMotifsGenome.pl) Annotation of Peaks (annotatePeaks.pl) Quantification of Data at Peaks/Regions in the Genome/Histograms and Heatmaps (annotatePeaks.pl) Quantification of Transcripts (analyzeRNA.pl) Additional analysis strategies: General sequence manipulation tools (homerTools) Miscellaneous Tools for Sharing Data between programs, etc. (tagDir2bed.pl, bed2pos.pl, pos2bed.pl ...) Finding overlapping or differentially bound peaks (mergePeaks, getDifferentialPeaks) ChIP-Seq analysis automation (analyzeChIP-Seq.pl) Description of file formats Could be very useful… & with (only a bit of) luck, you’ll be able to install & run them yourself.

29 Installing a program in R
Check out the Key R Commands link at This is not an introduction to programming in R! Instead it gives basic instructions for how to: 1) install & run R packages that may be needed for your research, 2) how to move data files into R 3) how to perform simple edits on this data that may be required by the package & 4) how to output your results. Note: I find that the documentation for R packages is generally quite good.

30 Day 5 Outline Introduction to variations on ChIP-seq methods
Extensions & variations on TFBS analysis Analyzing published data & across platforms Downloading & installing programs Writing your own programs 30

31 Mastering simple UNIX tools
find, awk, grep, sort, sed & more One line commands to let you search and manipulate large data files w/o writing a program or trying to use the kludgy and limited tools in Galaxy. Find out more at:

32 Programming: Get your feet wet
Perl Tutorials - learn.perl.org learn.perl.org/tutorials/ Many tutorials are available if you are interested in learning Perl. These tutorials are introductions. Beginning Perl (free) - This book is for those new to programming who want to learn with Perl. A ton of Perl programs for you to use/adapt/modify: For learning R: Check out Josh’s links at: Also check out my notes on using R (specifically geared to the minimum you need to install & use existing programs) & a brief reference sheet on Perl at

33 Look at examples, check the web…
If you’re looking for a command in UNIX, R, Perl, Python, etc. do a Google search (for R add “statistical” to your search to specify what you mean). If you’re wondering how to get a program to do something, look at other programs & see how they did it. You don’t need to memorize the language, beyond a few basics, just look at what you (or someone else) did before & copy it.

34 Questions. What would you like to explore
Questions? What would you like to explore? What’s the next bioinformatics challenge in your research?

35 Course evaluation forms…


Download ppt "ChIP-seq Methods & Analysis"

Similar presentations


Ads by Google