Presentation is loading. Please wait.

Presentation is loading. Please wait.

ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012.

Similar presentations


Presentation on theme: "ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012."— Presentation transcript:

1 ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012

2 Plan 1.ChIP-seq 2.Quality Control of ChIP-seq data 3.ChIP-seq Peak detection 4.Peak Analysis and Interpretation 5.A few interesting ChIP-seq papers

3 1. ChIP-seq

4 ChIP-seq Illumina Transcription factor of interest (or histone modification) Antibody

5 Control: input DNA Illumina Can use IgG as additional control

6 ChIP-seq methodology Identify ChIP-grade antibody, determine specificity (Western, histone peptide array) Optimize conditions using single- locus ChIP-PCR (positive and negative controls) Sequence ChIP product using 1 Illumina lane per sample (no TruSeq ChIP-seq), single end Sequence input/IgG as control Assessing the specificity of a commercial H3K9m3 antibody using histone peptide arrays, K. Bunting & B. Swed, WCMC Abcam H3K9Me3 rabbit polyclonal (ab8898)

7 ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp

8 ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp bp

9 ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGA ACTGATTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTT GACTAATCACTTAAG Average length ~ 170bp bp

10 BWA tutorial (for aligning single end reads to genome) Get genome, e.g., from UCSC – Combine into 1 file – tar zvfx chromFa.tar.gz – cat *.fa > wg.fa Indexing the genome – bwa index -p hg19bwaidx -a bwtsw wg.fa Align ChIP reads to reference genome – bwa aln -t 4 hg19bwaidx s_3_sequence.txt.gz > s_3_sequence.txt.bwa Convert to SAM format – bwa samse hg19bwaidx s_3_sequence.txt.bwa s_3_sequence.txt.gz > s_3_sequence.txt.sam Align input reads to same reference genome – bwa aln -t 4 hg19bwaidx s_4_sequence.txt.gz > s_4_sequence.txt.bwa Convert to SAM format – bwa samse hg19bwaidx s_4_sequence.txt.bwa s_4_sequence.txt.gz > s_4_sequence.txt.sam

11 Reads can map to multiple locations/chromosomes Read 1 Read 2 Reference Human Genome (hg18)

12 Reads map to one strand or the other Read 1 Read 2 hg18

13 SAM format DH1608P1_0130:6:1103:10579:166379#TTAGGC 16 chr M * 0 0 GGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG eb`XXYbZdadee^ceV]X][ccTcc^ebeece eeeWbeeeeeeeceeaee XX:Z:NM_017871,32 NM:i:0 MD:Z:51 DH1608P1_0130:6:1102:3415:150915#TTAGGC 16 chr M * 0 0 GGGCGGGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBac]bbbceedaeddeZceeea_ba_\_eee eeeedaeeee XX:Z:NM_017871,32 NM:i:1 MD:Z:5T45 DH1608P1_0130:6:1102:13118:62644#TTAGGC 16 chr M * 0 0 GGGCGTGCCTCGGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGACCCGCG BBBBBBBBBBBBBBBBBBBBB`XTbSa`cffegdggeccbe effdeggggg XX:Z:NM_017871,32 NM:i:2 MD:Z:7A3T39 DH1608P1_0130:6:1203:3012:157120#TTAGGC 16 chr M * 0 0 AAGGCCGTGACTCTGATCTCAGCCCTCGTCTCCGCCGCGCTCCCGGACCCG BBBBBBBB^`QWZZ]UXYSZSTFRU]Z__SO[adcc[acdV \`Y]YWY][_ XX:Z:NM_017871,34 NM:i:3 MD:Z:4G17G1A26 DH1608P1_0130:6:2206:4445:12756#TTAGGC 16 chr M3487N50M * 0 0 CCAAAGGGTGTGACTCTGATCTCGGGCATCGTCTCCGCCGCGCTCCCGGAC BBBBBBBBBBBBBBBBBBBBBBBB`YdddYdc\ cacaNddddcdddaeeee XX:Z:NM_017871,37 NM:i:3 MD:Z:2C5C14A27 DH1608P1_0130:6:2203:7903:43788#TTAGGC 16 chr M3487N50M * 0 0 CCCAAGGGCGTGACTCTGATCTCAGGCATCGTCTCCGCCGCGCTCCCGGAC adbe[fbcbccb_cb^cb^^c^edgegggggdf ggefffgfbfggggegeg XX:Z:NM_017871,37 NM:i:0 MD:Z:51 MD tag, e.g, MD:Z:4T46 = 5 matches, 1 mismatch (T in read), 46 matches CIGAR string, eg 5M3487N46M = 5bp-long block, 3487 insert, 46bp-long block XT tag, e.g. XT:A:U = unique mapper; XT:A:R = more than 1 high-scoring matches

14 Quality Control

15 Clonal reads

16 Fragment size analysis

17

18 Fragment size analysis using opposite strand autocorrelation

19 Fragment size analysis

20 GC-content analysis

21 GC-content analysis

22 Other QC measures Number of peaks: – 0 or very few peaks, even at permissive peak calling thresholds = bad experiment Motif enrichment – is expected motif enriched in peaks ?

23 ChIP-seq peak calling

24 MACS

25 The Poisson distribution MACS # in R P(X>=5|λ=0.001) is 1-sum(dpois(0:4, 0.001)) 2d λ=expected # of reads within an interval of 2d bp Estimate d based on high quality peaks

26 BayesPeak

27 BayesPeak (Bayesian Hidden Markov Models) Parameters estimated using Bayesian treatment Observed variable Hidden states

28 BayesPeak

29 Peak detection using ChIPseeqer (Elemento and Giannopoulou, 2011)

30

31 A nice peak

32 Not all peaks are that nice

33 Peak detection Calculate read count at each position (bp) in genome (we don’t use a sliding window) Determine if read count is greater than expected (at each position - bp)

34 Peak detection We need to correct for input DNA reads (control) - non-uniformaly distributed (form peaks too) - vastly different numbers of reads between ChIP and input

35 Read count genome Expected read count Expected read count = total number of reads * extended fragment length / chr length genome T A T T A A T T A T C C C C A T A T A T G A T A T Use Bioanalyzer (remove adapter lengths)

36 Is the observed read count at a given genomic position greater than expected ? x = observed read count λ = expected read count The Poisson distribution Read count Frequency Read counts follow a Poisson distribution

37 Is the observed read count at a given genomic position greater than expected ? x = 10 reads (observed) λ = 0.5 reads (expected) The Poisson distribution genome P(X>=10) = 1.7 x log10 P(X>=10) = log10 P(X>=10) = 9.77 # in R P(X>=10|λ=0.5) is 1-sum(dpois(0:9, 0.5))

38 Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / chr len

39 Read count Expected read count Input reads -Log(p) Expected read count = total number of reads * extended frag len / chr len

40 Read count Expected read count -Log(P c ) Read count Expected read count -Log(P i ) [-Log(P c )] - [-Log(P i )] Threshold Genome positions (bp) INPUT ChIP

41 Normalized Peak score (at each bp) R = -log10 P(X input ) P(X ChIP ) Will detect peaks with high read counts in ChIP, low in Input Works when no input DNA ! (x=0)

42 Mappability

43 Non-mappable fraction of the genome chr / (=12%) chr / chr / chr / chr / chr / chr / chr / chr / chr / chr / chr / chr / chrX / chr / chr / chr / chr / chrM4628/ chr / chr / chr / chr / chr / chrY / (=74%) We enumerated all 30-mers, counted # occurrences, calculated non-unique fraction of genome Unique/mappable fraction = 1 – non- unique fraction

44 Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / ( chr len * mappable fraction)

45 Peak detection Determine all genomic regions with R>=15 Merge peaks separated by less than 100bp Output all peaks with length >= 100bp Process 23M reads in <5mins

46 BCL6 ChIP-seq Lymphoma cell line (OCI-Ly1) Illumina 6 GA2x lanes for ChIP, 1 for input DNA, 1 for QC 36nt long sequences 32 Million reads Aligned/mapped to hg18 with BWA With Melnick lab at WCMC

47 ChIP reads Input reads Detected Peaks BCL6: 18,814 peaks 80% are within <20kb of a known gene

48

49 Loading peaks into GRange system(“split_samfile s_1_sequence.txt.sam –outdir CHIP/”) system(“split_samfile s_2_sequence.txt.sam –outdir INPUT/”) system(“ChIPseeqer.bin –chipdir CHIP –inputdir INPUT –t 15 –fold 2 –outfile peaks.txt”) tpeaks = read.table(paste(dataFolder, ”peaks.txt”, sep = ""), header = F) peaks = RangedData(ranges = IRanges(start = tpeaks[, 2], end = tpeaks[, 3]), space = tpeaks[, 1], summit = tpeaks[, 6], score = tpeaks[, 5])...

50 Other peak finders

51 Promoter-based analysis (not peak- based) h1 h2 h3 h5 … h1 h2 h3 h4 h5 Maximum peak height in 2kb promoter 2kb All TSS

52 4. Peak analysis and interpretation

53 Gene-based peak annotation

54 Integration of multiple peak lists RangeData in R

55 Conservation analysis fixedStep chrom=chr1 start= step= fixedStep chrom=chr1 start= step=

56 What is the cis-regulatory code of each factor ? Does they require any co- factors ? DNA Activation Repression

57

58

59 No … Random regions Discovering regulatory sequences associated with peak regions True TF binding peak? Yes … Target regions True TF peak Absent Present No Yes Motif correlation is quantified using the mutual information

60 Motif Search Algorithm k-mer MI CTCATCG TCATCGC AAAATTT GATGAGC AAAAATT ATGAGCT TTGCCAC TGCCACC ATCTCAT ACGCGCG CGACGCG TACGCTA ACCCCCT CCACGGC TTCAAAA AGACGCG CGAGAGC CTTATTA Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040

61 Enrichment Depletion Motif co-occurrence anallysis Discovered Motifs FIRE automatically compares discovered motifs to known motifs in TRANSFAC and JASPAR

62 5. A few interesting papers

63 First ChIP-seq paper

64 Epigenetic modifications at enhancer regions

65 Chromatin states

66 Nucleosome localization

67 Whole-genome nucleosome location mapping in B cells Yanwen Jiang, PhD Principal Component Analysis of Nucleosome profiles


Download ppt "ChIP-seq Olivier Elemento, PhD TA: Jenny Giannopoulou, PhD Institute for Computational Biomedicine CSHL High Throughput Data Analysis Workshop, June 2012."

Similar presentations


Ads by Google