Presentation is loading. Please wait.

Presentation is loading. Please wait.

[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 15:

Similar presentations


Presentation on theme: "[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 15:"— Presentation transcript:

1 [BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 15: TF Motifs (Harendra)

2 Project milestones due Today [BejeranoWinter12/13] 2 Announcements

3 Review: Transcriptional regulation of genes Transcription Start Site (TSS) Thousands of transcription factor-CRM interactions that control gene expression in each cell type [BejeranoWinter12/13]3 Enhancer (CRM)

4 [BejeranoWinter12/13] 4 Last Time: ChIP-Seq - a first glimpses of the regulatory genome in action Cis-regulatory peak 4 Peak Calling

5 Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. ‘actin cytoskeleton’) [BejeranoWinter12/13]5 Last Time: Infer functions of ChIP-seq binding profile using GREAT GREAT = Genomic Regions Enrichment of Annotations Tool P = Pr binom (k ≥5 | n=6, p =0.33) p = 0.33 of genome annotated with n = 6 genomic regions k = 5 genomic regions hit annotation

6 [BejeranoWinter12/13] 6 GREAT gives you a tables of functions Ontology Term # Genes Binomial Experimental P-value support * Gene Ontology actin cytoskeleton actin binding 7x x10 -5 Miano et al * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT Pathway Commons TRAIL signaling Class I PI3K signaling 5x x10 -6 Bertolotto et al Poser et al TreeFam 1x Chai & Tarnawski 2002 TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 5x x x x10 -4 Positive control ChIp-Seq support Natesan & Gilman Top GREAT enrichments of SRF FOS gene family

7 Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. ‘actin binding’) [BejeranoWinter12/13]7 Last Time: Infer functions of ChIP-seq binding profile using GREAT GREAT = Genomic Regions Enrichment of Annotations Tool P = Pr binom (k ≥4 | n=6, p =0.5) p = 0.5 of genome annotated with n = 6 genomic regions k = 4 genomic regions hit annotation π π π π π π π

8 [BejeranoWinter12/13] 8 GREAT gives you a tables of functions Ontology Term # Genes Binomial Experimental P-value support * Gene Ontology actin cytoskeleton actin binding 7x x10 -5 Miano et al * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT Pathway Commons TRAIL signaling Class I PI3K signaling 5x x10 -6 Bertolotto et al Poser et al TreeFam 1x Chai & Tarnawski 2002 TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 5x x x x10 -4 Positive control ChIp-Seq support Natesan & Gilman Top GREAT enrichments of SRF FOS gene family

9 [BejeranoWinter12/13] 9 GREAT gives you a tables of functions Ontology Term # Genes Binomial Experimental P-value support * Gene Ontology actin cytoskeleton actin binding 7x x10 -5 Miano et al * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT Pathway Commons TRAIL signaling Class I PI3K signaling 5x x10 -6 Bertolotto et al Poser et al TreeFam 1x Chai & Tarnawski 2002 TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 5x x x x10 -4 Positive control ChIp-Seq support Natesan & Gilman Top GREAT enrichments of SRF FOS gene family Different

10 Hard or impossible to get the required cells Some cells don’t occur in enough quantity to ChIP Others are hard to dissect Certain human tissues are hard to obtain Hard to get a good antibody Ex: We have ChIP results for a factor in brain We have not be able to repeat it since we can’t find the same antibody Lots of time and money to do one experiment Only information for one context – cell type or time Can we computationally predict the binding sites for many contexts and factors? [BejeranoWinter12/13]10 But doing the experiment is the hard part!

11 [BejeranoWinter12/13] 11 Recall: TFBS Position Weight Matrix (PWM) Alignment (count) Matrix A C G T Frequency Weight Matrix A C G T ConsATGGCATG Experimentally determined sites ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTCGACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Can we use a PWM to predict where the TF will bind in the genome (without doing ChIP-seq)?

12 [BejeranoWinter12/13]12 Binding Site Prediction using Match Problem: High number of false positives.

13 [BejeranoWinter12/13] 13 Recall: TFBS Position Weight Matrix (PWM) Alignment (count) Matrix A C G T Frequency Weight Matrix A C G T ConsATGGCATG Experimentally determined sites ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTCGACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG Information content of each column Information content of a motif = sum of all columns = = 6.0

14 [BejeranoWinter12/13]14 Information content is a measure of motif specificity SRF REST SPIB (IC ~ 12) (IC ~ 5) (IC ~ 25) How do these compare to a library of many PWMs?

15 [BejeranoWinter12/13]15 PWMs have a range of information content SRF REST SPIB

16 Measure of motif specificity 16 Information content determines how accurately we can predict the binding site SRF 2 million [BejeranoWinter12/13]

17 Measure of motif specificity 17 Information content determines how accurately we can predict the binding site SRF 2 million 2 million matches to the SRF motif, but ChIP-seq and other estimates suggest ≈ 10,000 actual binding sites [BejeranoWinter12/13] Can we do better?

18 [BejeranoWinter12/13] 18 Use excess conservation to improve prediction accuracy Aaron Shoa Wenger et al., PRISM offers a comprehensive genomic approach to transcription factor function prediction. 2013

19 Use shuffled motifs to calculate confidence of excess conservation binding site prediction 19 [BejeranoWinter12/13] shuffled real branch length (subst / site) fraction conserved Confidence is the fraction conserved in excess. excess = 0.12 total = 0.32 confidence = excess / total Transcription factor motif Genome-wide binding site predictions 10 Shuffled Transcription factor motifs Genome-wide binding site predictions

20 Probabilistic interpretation Confidence is the probability that a motif instance is functional given its observed conservation. 20 Pr R (functional | C ≥ c)= 1 - Pr R (not functional | C ≥ c) Pr R (C ≥ c | not F) Pr R (not F) Pr R (C ≥ c) = 1 - branch length (subst / site) Pr R (C ≥ c) Pr S (C ≥ c) R: real motif S: average shuffled motif Pr R (C ≥ 1.5) = 0.2 Pr S (C ≥ c) Pr R (not F) Pr R (C ≥ c) = 1 - Pr R (C ≥ c) - Pr S (C ≥ c) Pr R (not F) Pr R (C ≥ c) = Pr R (C ≥ c) - Pr S (C ≥ c) Pr R (C ≥ c) ≈ excess total = [BejeranoWinter12/13]

21 Excess conservation score defined by genomic background 21http://cs173.stanford.edu [BejeranoWinter12/13]

22 Excess conservation score also defined by motif [BejeranoWinter12/13]22

23 ARE THE PREDICTIONS ANY GOOD? Perform genome-wide binding site predictions… [BejeranoWinter12/13]23

24 [BejeranoWinter12/13]24 Use ChIP-seq overlap as a measure of sensitivity Genome-wide binding site predictions for one factor (Ex: E2F4) ChIP-seq for same factor (Ex: E2F4) Sensitivity = Overlapping ChIP-peaks / Total ChIP-peaks But how do you assess if your overlap is good? Compare to the best tool out there (or all the tools, if there is no “best”)

25 Excess conservation binding site prediction is more accurate than existing methods 25http://cs173.stanford.edu [BejeranoWinter12/13] (prior state of the art)

26 26 Excess conservation captures binding site profile similar to ChIP-seq ChIP-seqMotifMap PRISM conservation (% identity) [BejeranoWinter12/13]

27 Now we have good genome-wide binding site predictions for many factors Lets submit them to GREAT and find out what they are doing… [BejeranoWinter12/13]27 Submit predictions to GREAT

28 Transcription factorOntologyTop-ranked biological contextGREAT rank for ChIP-seqExperimental support GABPAGO Biological Processtranslation2(Genuario and Perry, 1996) GO Cellular Componentmembrane coat14Novel GO Molecular Functiontranslation initiation factor activity4(Genuario and Perry, 1996) Mouse Phenotypesincreased single-positive T cell numberNone(Yu et al., 2010) PANTHER Pathwaygeneral transcription by RNA polymerase I1(Hauck et al., 2002) Pathway Commonstranscription3(Hauck et al., 2002) REST (NRSF)GO Biological Processneurotransmitter transport1(Schoenherr et al., 1996) GO Cellular Componentneuronal cell bodyNone(Schoenherr et al., 1996) GO Molecular Functioncation channel activity1(Schoenherr et al., 1996) Mouse Phenotypesabnormal synaptic transmission1(Schoenherr et al., 1996) PANTHER Pathwaysynaptic vesicle trafficking2(Schoenherr et al., 1996) Pathway Commonstransmission across chemical synapses3(Schoenherr et al., 1996) SRFGO Biological Processmuscle structure developmentNone(Miano et al., 2007) In JurkatGO Cellular Componentactin cytoskeleton1(Miano et al., 2007) GO Molecular Functionstructural constituent of muscleNone(Miano et al., 2007) Mouse Phenotypesdilated heart ventriclesNone(Parlakian et al., 2004) PANTHER Pathwaycytoskeletal regulation by Rho GTPaseNone(Hill et al., 1995) Pathway Commonsregulation of insulin secretion by acetylcholineNoneNovel STAT3GO Biological Processnegative regulation of signal transductionNone(Naka et al., 1997) In mESCGO Molecular Functiontransforming growth factor beta bindingNone(Kinjyo et al., 2006) Mouse Phenotypesabnormal spleen B cell follicle morphologyNone(Schmidlin et al., 2009) Pathway CommonsSignaling events mediated by TCPTPNone(Yamamoto et al., 2002) Comparing binding site prediction to ChIP-seq 28http://cs173.stanford.edu [BejeranoWinter12/13]

29 TFfunctionp-valuetarget genes SRFmuscle structure development7.43× PRISM re-discovers known functions GLI2skeletal system development7.07× CRXretinal photoreceptor degeneration1.30× ARabnormal spermiogenesis1.19× Is the number of re-discovered known functions impressive? [BejeranoWinter12/13]

30 Evaluate re-discovery of known function using “closed loops” How can we assess if the functional associations predicted by PRISM for a particular TF are reasonable without reading a lot of papers? One way is to check if the TFs are annotated with the function (form a closed loop) 30 SRF Genes involved in “muscle structure development” SRF Is SRF itself annotated with the term “muscle structure development”? YES – a “closed loop” [BejeranoWinter12/13]

31 31 PRISM predictions are consistent with known transcription factor biology [BejeranoWinter12/13] Null Model: How many closed loops using 50,000 random shuffled PWM libraries?

32 1.Incomplete annotation 2.“Regulation of” annotation 32 Many non-closed loops are still true TFfunctionp-valuetarget genes GATA6abnormal pancreas development5.69× SRFactin cytoskeleton4.84× Nature Genetics, December SRF acts in the nucleus, where it regulates actin cytoskeleton genes. [BejeranoWinter12/13]

33 Now we have good genome-wide binding site predictions for many factors AND we have functional predictions without ChIP-seq Was it as easy as creating binding sites and submitting the results to GREAT? …not quite… [BejeranoWinter12/13]33 Raw GREAT results need cleaning for conserved TFBS

34 Shuffled motifs also give GREAT enrichments 34http://cs173.stanford.edu [BejeranoWinter12/13] Examine closely Transcription factor motif Genome-wide binding site predictions 10 Shuffled Transcription factor motifs Genome-wide binding site predictions Run GREAT and observe biological function Filter PRISM

35 [BejeranoWinter12/13]35 Shuffled motifs are used to create a “E-value” metric to black list enrichments that show up for shuffles Stage 1: GREAT on binding site prediction Stage 2: Top significant GREAT terms Stage 3: PRISM terms (via black listing) Obtained = GREATKeptKept = PRISMPRISM vs. GREAT on b.s. prediction # TF-term associations 31,946 7,529 1,658GREAT predictions kept5.2% TF-term FDR50.5%49.5%16.4%FDR improvement308% closed loop %3.3%5.3%10.9%fraction loops improvement329% (from shuffles) What are all the terms we are throwing away?

36 [BejeranoWinter12/13]36 GREAT enrichments from shuffles are due to conservation bias Shuffles (2488) CNEs (2279) Create 10,000 random sets of random conserved non-coding regions Run GREAT How do the enrichments compared to those from shuffled motifs? Pro: E-value helps us get more accurate predictions by removing false predictions Con: Conservation bias filter, causes us to lose potentially real enrichments in systems that are more often conserved

37 “Excess Conservation” advanced the state of the art for binding site prediction “PRISM pipeline” combined accurate binding site prediction with GREAT Publically offered as a web application bejerano.stanford.edu/prism [BejeranoWinter12/13]37 So far…

38 [BejeranoWinter12/13]38 The rest of the talk includes pre-publication work


Download ppt "[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 15:"

Similar presentations


Ads by Google