Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem.

Similar presentations


Presentation on theme: "Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem."— Presentation transcript:

1 cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem *Few slides were adopted and edited from www.cs.ucsb.edu/~ambuj/Courses/ bioinformatics/motif%20finding.ppt

2 cbio course, spring 2005, Hebrew University Background u Basic dogma:  Information is coded in the genome  Information includes:  Where the genes are coded, including: l Transcription Start l UTR l Exons and Introns l Alternative splicing

3 cbio course, spring 2005, Hebrew University Eukaryotic Gene Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/

4 cbio course, spring 2005, Hebrew University Background u Basic dogma:  Information is coded in the genome  Information includes:  Where the genes are coded, including: l Transcription Start l UTR l Exons and Introns l Alternative splicing  Functional units in proteins

5 cbio course, spring 2005, Hebrew University Proteins Local structure motifs diverging type-2 turn Serine hairpin Type-I hairpin Frayed helix Proline helix C-cap alpha-alpha corner glycine helix N-cap I-sites Library = a catalog of local sequence-structure correlations

6 cbio course, spring 2005, Hebrew University Background u Basic dogma:  Information is coded in the genome  Information includes:  Where the genes are coded, including: l Transcription Start l UTR l Exons and Introns l Alternative splicing  Functional units in proteins  RNA family structure

7 cbio course, spring 2005, Hebrew University RNA – Multiple Align. + structure Biological Sequence Analysis; Durbin, Eddy, Krogh, Mitchison; Cambridge press, 1998

8 cbio course, spring 2005, Hebrew University Background u Basic dogma:  Information is coded in the genome  Information includes:  Where the genes are coded, including: l Transcription Start l UTR l Exons and Introns l Alternative splicing  Functional units in proteins  RNA family structure  How to control which gene to turn on/off and when

9 cbio course, spring 2005, Hebrew University Background u In many cases, we can related such functions to reappearing “motifs” in the genome:  Splice/start/end site signals in coding genes  Binding sites of regulatory elements controlling transcription of nearby genes  A certain function of a protein “domain”. The definition of what is a sequence “motif” depends on the context !

10 cbio course, spring 2005, Hebrew University Background u Basic dogma:  Information is coded in the genome  Information includes:  Where the genes are coded, including: l Transcription Start l UTR l Exons and Introns l Alternative splicing  Functional units in proteins  RNA family structure  How to control which gene to turn on/off and when Future Classes

11 cbio course, spring 2005, Hebrew University Expression of Genes in Cells u To produce a protein, a gene (DNA) has to be converted to an intermediary molecule called RNA, in a process called transcription. u Each cell contains the same genome. Different cells have a different set of genes which are turned on (expressed) by allowing the genes to be transcribed. u Different cells have different mixtures of gene regulatory proteins to turn genes on or off.

12 cbio course, spring 2005, Hebrew University Regulation of Gene Expression u Gene regulatory proteins bind to specific places (regulatory sites) on DNA. These sites are usually close to the gene. gene off site gene site on regulatory protein

13 cbio course, spring 2005, Hebrew University Regulatory Sites u Regulatory sites are sometimes divided to 2 types:  Promoter sites – Usually upstream of a gene in non-translated (non-coding) regions. In some cases, these sites can be in exonic or intronic regions.  Enhancer sites – Can be very far away (either upstream or downstream). u Regulatory proteins recognize sites by conserved DNA patterns, which consist of a short stretch of “partially specific” nucleotide sequences.

14 cbio course, spring 2005, Hebrew University lac operon in E. coli

15 Figure 13.16 The lac Operon of E. coli

16 cbio course, spring 2005, Hebrew University Promoter…

17 cbio course, spring 2005, Hebrew University

18

19 Transcription Factor Binding Sites Non-coding regions  gene regulation We want to describe this site

20 cbio course, spring 2005, Hebrew University Difficulty of Finding Regulatory Elements  Regulatory sites are short (up to 30 nucleotides).  Non-coding regions are very long (includes all regions which are not translated into proteins).  Experiments to find regulatory sites are tedious and time-consuming. One approach is to mutate different combinations of nucleotides until functionality changes.  We don’t have good understanding on what makes a site active/how active in terms of the chemical/physical constraints

21 cbio course, spring 2005, Hebrew University Why Not Use Multiple Alignment? u The motif is short and may appear at different location in different sequences. Most other areas are random u Not all positions within a binding site should be treated in the same way, and usually we don’t know in advance how. Therefore the use of a general scoring matrix is not adequate u The problem is made more complicated since not every sequence contains a motif, due to:  The upstream region used may not be long enough to include a regulatory site in every sequence  Usually, potential co-regulated genes are used to construct the sample, which means that we don’t know for sure whether all these genes are really co-regulated

22 cbio course, spring 2005, Hebrew University Computational Approach u Identify a set of genes believed to be controlled by the same regulatory mechanism (co-regulated genes). u Extract regulatory regions of the genes (usually upstream sequences) to form a sample of sequences. u Find some way to identify “conserved” elements in these sequences, resulting in a list of potential regulatory sites.

23 cbio course, spring 2005, Hebrew University How to Find Regulatory Sites gene site gene site gene site gene site gene site sample

24 cbio course, spring 2005, Hebrew University Formulating Motif Finding Task u Given a set of sequences, find a common motif shared by these sequences. u Steps:  Construct a model of what we mean by common motif.  Solve the problem within the model on simulated samples.  Evaluate performance on real life biological samples.

25 cbio course, spring 2005, Hebrew University Formulating Motif Finding Task (2) u This means we need to define:  Input of the algorithm: This implicitly defines various assumptions we have on the problem (e.g: do we have different belief for each sequence that it belongs to the group?)  Type of “motif” class:  Search Algorithm: How we search the space of possible motifs?  Scoring function: How we score putative motifs?  Output of the algorithm: Should it give us just putative sites or maybe a binding site model to predict sites?  Evaluation technique: How do we test our algorithm?

26 cbio course, spring 2005, Hebrew University Task Definition Example u Given a sample of sequences and an unknown pattern (motif) that appears at different unknown positions in each sequence, can we find the unknown pattern? u Input: a set of sequences, each one with an unknown pattern at an unknown position. u Output: a set of starting positions of the pattern in each sequence.

27 cbio course, spring 2005, Hebrew University Pattern == Subsequence atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa Subsequence = AAAAAAAAGGGGGGG

28 cbio course, spring 2005, Hebrew University Pattern == (l,d) atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat AgAAgAAAGGttGGG cAAtAAAAcGGcGGG..|..|||.|..||| All variants of AAAAAAAAGGGGGGG u First formulated by Pevzner (ISMB 2000) u Pattern = subsequence of length l and exactly d random mismatches in it u All other sequence is assumed random u Assumes exactly one “true” occurrence of the motif in each sequence

29 cbio course, spring 2005, Hebrew University Formulating Motif Finding Task (2) u We need to define:  Input of the algorithm: This implicitly defines various assumptions we have on the problem (e.g: do we have different belief for each sequence that it belongs to the group?)  Type of “motif” class:  Search Algorithm: How we search the space of possible motifs?  Scoring function: How we score putative motifs?  Output of the algorithm: Should it give us just putative sites or maybe a binding site model to predict sites?  Evaluation technique: How do we test our algorithm? Think: How the (l,d) problem defines these ? How does it relate to “real” biology?

30 cbio course, spring 2005, Hebrew University How to Define Motif Class? u Subsequences : ACTCTT u IUPAC alphabet: {A, C, G, T, R,Y, M, K, S, W, B, D, H, V, N } = all subsets of {A,C,G,T} u PSSM / PWM (Position Specific Score Matrix or Position Weight Matrix) u More general probabilistic/other models: e.g. using Bayesian Networks modeling language u Refined definition based on prior knowledge:  Homo/Hetro dimers  Variable gaps  Bias to some characteristic information profile (Van, 2003)

31 cbio course, spring 2005, Hebrew University NOTE: Independence assumption between biding sites positions ! The score used in a probabilistic setting is the log odds score In many case the BG is a simple, fixed, background distribution (Q) over {ACGT}. The entries in the Matrix can be P i (a), log(P i (a)) or log(P i (a)/logQ(a) – depending on the context of its usage ! PSSM Representation of Binding Sites Position Specific Score Matrix: each possible kmer will get a “score” for being a binding site which is: u Probabilistic interpretation: ACGTACGT 1 2 k w[i,c] – weight of letter c at position i

32 cbio course, spring 2005, Hebrew University PSSM: + Enables representing low/high affinity in different Positions + Trade off Sens. and Spec. in genomic wide scans - Huge Search space, how to cover efficiently? ABF1 Example – (Targets by Lee at el.,2002) >YAL011W: CGT GTTA G A TGA √ ? PSSM vs. IUPAC

33 cbio course, spring 2005, Hebrew University How to Learn PSSM Motif? Easier Task - We have aligned samples to learn from: u We have a set of known BS, all of length k, (e.g. verified by some biological experiment) u Compute counts for each base in each position, and normalize == ML estimator: u N number of sequence, Na number of “a”s in position i: u Note:  This is the ML solution. As in many other cases, this might be problematic when we have very few samples to learn from (e.g.: we can get probability 0 for base A in position i simply because we did not see enough examples.)  Solution: use pseudo counts or some prior (e.g. Derichele prior)

34 cbio course, spring 2005, Hebrew University How to Learn PSSM Motif ? (2) BS Model 1234567 ACGTACGT Remember: In the motif finding problem we have a much harder task – The input: is a set of (long) sequence suspected to contain a common motif (PSSM according to our current model assumption), but we don’t know where ! The output: Prediction of new BS based on our learned PSSM motif Predictions Input Sequence: Dark blue are BS positions which are hidden from us, and we are trying to learn

35 cbio course, spring 2005, Hebrew University How to Learn PSSM Motif ? (3) MEME Algorithm ( Bailey T.L. and Elkan C.P. 1995 ) u (Still) one of the most commonly used tools for motif (PSSM) search:

36 cbio course, spring 2005, Hebrew University How to Learn PSSM Motif ? (3) MEME Algorithm ( Bailey T.L. and Elkan C.P. 1995 ) u The basic probabilistic framework used by MEME:  Input: N sequences  Assume each has 1 BS  Assume a generative model: sequence is either generated by BS model M (PSSM) or from a fixed background distribution BG  Assume each sequence has exactly 1 BS in it.  Scoring function: P(Seq | M,BG)  Try to maximize likelihood scoring function by adjusting M’s (PSSM) parameters.

37 cbio course, spring 2005, Hebrew University How to Learn PSSM Motif ? (4) u What’s the problem? Why is it hard?  Think of the positions of the BS in each sequence as H were H is a vector of dimension N  Given H we have complete data. Then inferring M’s ML parameters are just as we saw for the aligned case  easy  Problem 1: We don’t have H, we are trying to learn it too and the ML parameters of M for each position become dependent if H is not given  we have no close form to compute them analytically and going over all possible H assignments is not feasible,  we need to resort to some method to search the space of possible assignments to M’s parameters  Problem 2: The landscape of the likelihood function is typically far from convex  many local optima

38 cbio course, spring 2005, Hebrew University How to Learn PSSM Motif ? (5) MEME Algorithm u MEME uses a technique called EM to search the space of model M’s parameters u EM = Expectation Maximization u We review how EM is used in the MEME algorithm in class….

39 cbio course, spring 2005, Hebrew University Problems with the MEME & other Models u Think: In light of what we discussed, what assumptions are made in this model? What might cause us problems in “real” life data?  MEME has also other variants we did not discuss here (oops, zoops, etc.) u Also: EM is very sensitive to starting point  need a good way to find good ones

40 cbio course, spring 2005, Hebrew University Other Algorithmic Techniques for Motif Finding u MEME (Expectation Maximization) u GibbsDNA, AlignAce (Gibbs Sampling) u CONSENUS (greedy multiple alignment) u WINNOWER (Clique finding in graphs) u SP-STAR (Sum of pairs scoring) u MITRA (Mismatch trees to prune exhaustive search space) More then one way to skin a cat….

41 cbio course, spring 2005, Hebrew University How to find Binding Sites- Revisited Find a common motif in gene set (CONSENSUS, MITRA, MEME, AlignACE…) “Classical” Solutions: Gene Set Promoter Find a common & unique motif in genes Discriminative Solutions: Extract the relevant bit from sequences Main problem: In many cases the motif is common not just to the subset of sequences we have, but to many other as well  not a good candidate to explain regulation “A simple hyper-geometric approach for discovering putative transcription factor binding sites” WABI 01

42 cbio course, spring 2005, Hebrew University Finding Discriminative Motifs Define Space of Motifs “mimic” motifs with a simpler class for efficient search Search Space, Evaluate Motifs using discriminative scoring Choose Significant Motifs Correct for multiple hyp. Bonfferoni or FDR criteria Step1 Step2: “A simple hyper-geometric approach for discovering putative transcription factor binding sites” WABI 01 Refine Motifs

43 cbio course, spring 2005, Hebrew University Binding Sites - Revisited → independence assumption Two relevant questions:  Are there dependencies in binding sites?  Do we gain an edge in computational tasks if we model such dependencies? promoter gene binding site A ?C?C ?T?T “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

44 cbio course, spring 2005, Hebrew University How to model binding sites ? X1X1 X2X2 X3X3 X4X4 X5X5 Profile: Independency model Tree: Direct dependencies Mixture of Profiles: Global dependencies Mixture of Trees: Both types of dependencies X1X1 X2X2 X3X3 X4X4 X5X5 T X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T represent a distribution of binding sites “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

45 cbio course, spring 2005, Hebrew University Learning models: Aligned binding sites Learning procedure for Bayesian networks GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG TAGGGGCCGGGC TGGGGGCGGGGT AAAGGGCCGGGC GGGAGGCCGGGA GCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC Aligned binding sites Models X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T Learning Machinery select maximum likelihood model “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

46 cbio course, spring 2005, Hebrew University Arabidopsis ABA binding factor 1 (49 examples) Profile Test LL per instance -19.93 Mixture of Profiles 76% 24% Test LL per instance -18.70 (+1.23) (improvement in likelihood > 2-fold) X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 Tree Test LL per instance -18.47 (+1.46) (improvement in likelihood > 2.5-fold) “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

47 cbio course, spring 2005, Hebrew University Rap1 Example (Harbison at. el.04) (171 expmples) Profile Mixture of Profiles X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 Tree

48 cbio course, spring 2005, Hebrew University Likelihood improvement over profiles Significant improvement in generalization  Data often exhibits dependencies “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

49 cbio course, spring 2005, Hebrew University EM algorithm Learning models: unaligned data Use EM algorithm to simultaneously u Identify binding site positions u Learn a dependency model Unaligned Data Learn a model Identify binding sites Models X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

50 cbio course, spring 2005, Hebrew University Evaluating Performance Detect target genes on a genomic scale: ACGTAT…………….………………….AGGGATGCGAGC -10000 -473 Scoring rule: Crucial issue: p-value of scores “CIS: Compound Importance Sampling Method for Protein-DNA Binding Site p-value Estimation” Bioinformatics, 2004, ISMB 04 Probability by binding site model Background model (order-3 markov chain)

51 cbio course, spring 2005, Hebrew University 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% 1% 2% 3% 4% 5% True Positive Rate (Sensitivity) False Positive Rate Profile Example: ROC curve of HSF1 Mixture of Trees Tree ~60 FP Mixture of Profiles “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

52 cbio course, spring 2005, Hebrew University Evaluation – Localization Data 5-fold Cross Validation [Lee et al 2002] Δ specificity (TP/Predicted) Δ sensitivity (TP/True) Improvement by Mix of Trees over PSSM “True” Predicted TP “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

53 cbio course, spring 2005, Hebrew University Motif Finding - Evaluation u Still an open problem u We have seen several examples on how performance can be evaluated in different ways u There is (still) no absolute solution for this u Main problems:  no large data sets of known sites  no real annotation of negative samples  How to define success measure?  Difference in input/output assumptions  … u A recent effort in this direction: “Assessing computational tools for the discovery of transcription factor binding sites” (Nat. Biotech. Jan 05)  compared publicly available tools on the web on (small) data sets of known binding sites based on the Transfac D.B


Download ppt "Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem."

Similar presentations


Ads by Google