ChIP-seq Robert J. Trumbly

ChIP-seq Robert J. Trumbly
Department of Biochemistry and Cancer Biology Block Health Science 448, UTHSC

ChIP-seq ChIP-seq (chromatin immunoprecipitation followed by DNA sequencing) has become the preferred method for analyzing protein-DNA interactions and chromatin structure on a genomic scale ChIP-seq has become practical because of rapid developments in NGS (next generation sequencing)

NGS The transition from microarrays to NGS creates not just more data but a different type of data Microarray data are analog: how much expression (signal) for a gene? NGS data are digital: e.g., which splicing variant is expressed?

NGS RNA-seq: can detect splicing variants, allelic expression, novel mRNAs ChIP-seq: can detect differential binding to allelic variants, leading to information about binding specificity

Park, Oct 2009

NF-kappaB from JASPAR. Mulitple sources combined

TFs: sharp binding sites
RNA Pol II: sharp and extended Histone modifications: extended domains Park, Oct 2009

Park, Oct 2009

ChIP-seq and RNA-seq analysis
Pepke et al., Nature Methods 6:S22-S

This example shows a workflow for the analysis of data from chromatin immunoprecipitation followed by sequencing (ChIP–seq). This analysis can be done by a bench scientist using current resources, and a similar strategy could be used for other types of next-generation sequencing data. Blue boxes show steps that can be performed using Galaxy. Integration or cross-sectioning of data can often be done in the University of California-Santa Cruz (UCSC) Genome Browser or by joining lists in Galaxy (purple box). Downstream steps, such as known motif analysis and Gene Ontology analysis, can be achieved with online or stand-alone tools (orange boxes). Galaxy can also be used to establish analytical pipelines for calling SNPs that could then be integrated into sequencing-based data, such as reads from ChIP–seq. CEAS, Cis-regulatory Element Annotation System; MACS, Model-based Analysis of ChIP–Seq; TSS, transcription start site. Hawkins 2010

Furey TS Nat Rev Genet. 2012 Dec;13(12):840-52.

FASTQ files @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Output of NGS usually in FASTQ files, containing millions of short reads Line followed by sequence id Line 2: sequence Line 3: +, sometimes followed by text Line 4: quality score for each base, encoded as ASCII symbol

Quality scores Phred quality score, Q = -10 log10p, where p = the probability that the corresponding base call is incorrect. Example: p = 0.001, log(0.001) = -3 Q = - 10 X -3 = 30 For the FASTQ file, an offset of 33 (for the most common encoding) is added to the raw quality score, and the ASCII symbol corresponding to that number is stored and displayed. There are several variations on the quality score encoding, so programs that interpret the scores must know the proper version

Short-read aligners Most computationally demanding step: aligning millions of short reads to genome Popular programs: Bowtie BWA Both programs use the Burrows-Wheeler transform for efficient alignment Programs run on Unix platforms Common output files are SAM or BAM(binary SAM for data compression)

Example of extended CIGAR and the pileup output.
SAM file Example of extended CIGAR and the pileup output. (a) Alignments of one pair of reads and three single-end reads. (b) The corresponding SAM file. The line in the header section gives the order of reference sequences. Notably, r001 is the name of a read pair. According to FLAG 163 (= ), the read mapped to position 7 is the second read in the pair (128) and regarded as properly paired (1 + 2); its mate is mapped to 37 on the reverse strand (32). Read r002 has three soft-clipped (unaligned) bases. The coordinate shown in SAM is the position of the first aligned base. The CIGAR string for this alignment contains a P (padding) operation which correctly aligns the inserted sequences. Padding operations can be absent when an aligner does not support multiple sequence alignment. The last six bases of read r003 map to position 9, and the first five to position 29 on the reverse strand. The hard clipping operation H indicates that the clipped sequence is not present in the sequence field. The NM tag gives the number of mismatches. Read r004 is aligned across an intron, indicated by the N operation. (c) Simplified pileup output by SAMtools. Each line consists of reference name, sorted coordinate, reference base, the number of reads covering the position and read bases. In the fifth field, a dot or a comma denotes a base identical to the reference; a dot or a capital letter denotes a base from a read mapped on the forward strand, while a comma or a lowercase letter on the reverse strand. Heng Li et al. Bioinformatics 2009;25: © 2009 The Author(s)‏

Peak-finding programs
Most popular: Model-based Analysis of ChIP-Seq (MACS) MACS empirically models the shift size of ChIP-Seq tags to improve the spatial resolution of predicted binding sites. MACS uses a dynamic Poisson distribution to effectively capture local biases in the genome. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9(9):R citations (3/2016)

Watson and Crick peaks (a) and by the FKHR motif (b).
MACS model for FoxA1 ChIP-Seq. (a,b) The 5' ends of strand-separated tags from a random sample of 1,000 model peaks, aligned by the center of their Watson and Crick peaks (a) and by the FKHR motif (b). MACS model for FoxA1 ChIP-Seq. (a,b) The 5' ends of strand-separated tags from a random sample of 1,000 model peaks, aligned by the center of their Watson and Crick peaks (a) and by the FKHR motif (b).

Dynamic Poisson Distribution
Tag distribution along the genome could be modeled by a Poisson distribution. The advantage of this model is that one parameter, λBG, can capture both the mean and the variance of the distribution. Strong local biases in tag distributions: tag counts are well correlated between ChIP and control samples (see following figure). MACS uses a dynamic parameter, λlocal, defined for each candidate peak as:λlocal = max(λBG, [λ1k,] λ5k, λ10k)

(c) The tag count in ChIP versus control in 10 kb windows across the genome. Each dot represents a 10 kb window; red dots are windows containing ChIP peaks and black dots are windows containing control peaks used for FDR calculation. The tag count in ChIP versus control in 10 kb windows across the genome. Each dot represents a 10 kb window; red dots are windows containing ChIP peaks and black dots are windows containing control peaks used for FDR calculation.

MACS improves the motif occurrence in the identified peak centers (e)
(e,f) MACS improves the motif occurrence in the identified peak centers (e) and the spatial resolution (f) for FoxA1 ChIP-Seq through tag shifting and λlocal. Peaks are ranked by p-value. The motif occurrence is calculated as the percentage of peaks with the FKHR motif within 50 bp of the peak summit. The spatial resolution is calculated as the average distance from the summit to the nearest FKHR motif. Peaks with no FKHR motif within 150 bp of the peak summit are removed from the spatial resolution calculation. MACS improves the motif occurrence in the identified peak centers (e) and the spatial resolution (f) for FoxA1 ChIP-Seq through tag shifting and λlocal

Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells Chen et al., Cell 133,13 June 2008, Pages 1106–1117 Chromatin immunoprecipitation coupled with ultra-high-throughput DNA sequencing (ChIP-seq) to map the locations of 13 sequence-specific TFs (Nanog, Oct4, STAT3, Smad1, Sox2, Zfx, c-Myc, n-Myc, Klf4, Esrrb, Tcfcp2l1, E2f1, and CTCF) and 2 transcription regulators (p300 and Suz12).

Figure 1 Genome-Wide Mapping of 13 Factors in ES Cells by Using ChIP-seq Technology TFBS profiles for the sequence-specific transcription factors and mock ChIP control at the Pou5f1 and Nanog gene loci are shown.

Figure 2 Identification of Enriched Motifs by Using a De Novo Approach Matrices predicted by the de novo motif-discovery algorithm Weeder.

ChIP-seq tutorial Chip-seq Analysis with Galaxy: from reads to peaks (and motifs) 2 - Obtaining the raw data: Accessing ChIP-seq reads from ArrayExpress database 3 - Upload the reads in the Galaxy server 4 - Some statistics on the raw data 5 - Mapping the reads with Bowtie 6 - Peak calling with MACS 7 - Retrieving the peak sequences 8 - Visualize the peak regions in UCSC genome browser 9 - Try to identify over represented motifs

References For tutorial: Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells. Chen et al., Cell Volume 133, 13 June 2008, Pages 1106–1117 The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Cock et al., Nucleic Acids Research, 2010, Vol. 38, No –1771. Computation for ChIP-seq and RNA-seq studies. Pepke et al., Nature Methods SUPPLEMENT | VOL.6 NO.11s | NOVEMBER 2009 | S23. ChIP–seq: advantages and challenges of a maturing technology. Park et al., Nature Reviews | Genetics 10 | October 2009 | Next-generation genomics: an integrative approach. Hawkins et al., NATURE REVIEWS | Genetics 11 | July 2010 | Furey TS. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet Dec;13(12):

ChIP-seq Robert J. Trumbly

Similar presentations

Presentation on theme: "ChIP-seq Robert J. Trumbly"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ChIP-seq Robert J. Trumbly

Similar presentations

Presentation on theme: "ChIP-seq Robert J. Trumbly"— Presentation transcript:

Similar presentations

About project

Feedback