Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Hidden Markov Model in Biological Sequence Analysis – Part 2
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Ab initio gene prediction Genome 559, Winter 2011.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Markov Chains Lecture #5
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 3 Finding Motifs Aleppo University Faculty of technical engineering.
Section 8.6: Gene Expression and Regulation
How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Introduction to BioInformatics GCB/CIS535
Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.
Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Ab initio motif finding
Finding Regulatory Motifs in DNA Sequences
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Motif discovery and Protein Databases Tutorial 5.
From Genomes to Genes Rui Alves.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Motif Search and RNA Structure Prediction Lesson 9.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
bacteria and eukaryotes
Regulatory Motif Finding
Learning Sequence Motif Models Using Expectation Maximization (EM)
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Presentation transcript:

cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem *Few slides were adopted and edited from bioinformatics/motif%20finding.ppt

cbio course, spring 2005, Hebrew University Background u Basic dogma:  Information is coded in the genome  Information includes:  Where the genes are coded, including: l Transcription Start l UTR l Exons and Introns l Alternative splicing

cbio course, spring 2005, Hebrew University Eukaryotic Gene Adapted in part from

cbio course, spring 2005, Hebrew University Background u Basic dogma:  Information is coded in the genome  Information includes:  Where the genes are coded, including: l Transcription Start l UTR l Exons and Introns l Alternative splicing  Functional units in proteins

cbio course, spring 2005, Hebrew University Proteins Local structure motifs diverging type-2 turn Serine hairpin Type-I hairpin Frayed helix Proline helix C-cap alpha-alpha corner glycine helix N-cap I-sites Library = a catalog of local sequence-structure correlations

cbio course, spring 2005, Hebrew University Background u Basic dogma:  Information is coded in the genome  Information includes:  Where the genes are coded, including: l Transcription Start l UTR l Exons and Introns l Alternative splicing  Functional units in proteins  RNA family structure

cbio course, spring 2005, Hebrew University RNA – Multiple Align. + structure Biological Sequence Analysis; Durbin, Eddy, Krogh, Mitchison; Cambridge press, 1998

cbio course, spring 2005, Hebrew University Background u Basic dogma:  Information is coded in the genome  Information includes:  Where the genes are coded, including: l Transcription Start l UTR l Exons and Introns l Alternative splicing  Functional units in proteins  RNA family structure  How to control which gene to turn on/off and when

cbio course, spring 2005, Hebrew University Background u In many cases, we can related such functions to reappearing “motifs” in the genome:  Splice/start/end site signals in coding genes  Binding sites of regulatory elements controlling transcription of nearby genes  A certain function of a protein “domain”. The definition of what is a sequence “motif” depends on the context !

cbio course, spring 2005, Hebrew University Background u Basic dogma:  Information is coded in the genome  Information includes:  Where the genes are coded, including: l Transcription Start l UTR l Exons and Introns l Alternative splicing  Functional units in proteins  RNA family structure  How to control which gene to turn on/off and when Future Classes

cbio course, spring 2005, Hebrew University Expression of Genes in Cells u To produce a protein, a gene (DNA) has to be converted to an intermediary molecule called RNA, in a process called transcription. u Each cell contains the same genome. Different cells have a different set of genes which are turned on (expressed) by allowing the genes to be transcribed. u Different cells have different mixtures of gene regulatory proteins to turn genes on or off.

cbio course, spring 2005, Hebrew University Regulation of Gene Expression u Gene regulatory proteins bind to specific places (regulatory sites) on DNA. These sites are usually close to the gene. gene off site gene site on regulatory protein

cbio course, spring 2005, Hebrew University Regulatory Sites u Regulatory sites are sometimes divided to 2 types:  Promoter sites – Usually upstream of a gene in non-translated (non-coding) regions. In some cases, these sites can be in exonic or intronic regions.  Enhancer sites – Can be very far away (either upstream or downstream). u Regulatory proteins recognize sites by conserved DNA patterns, which consist of a short stretch of “partially specific” nucleotide sequences.

cbio course, spring 2005, Hebrew University lac operon in E. coli

Figure The lac Operon of E. coli

cbio course, spring 2005, Hebrew University Promoter…

cbio course, spring 2005, Hebrew University

Transcription Factor Binding Sites Non-coding regions  gene regulation We want to describe this site

cbio course, spring 2005, Hebrew University Difficulty of Finding Regulatory Elements  Regulatory sites are short (up to 30 nucleotides).  Non-coding regions are very long (includes all regions which are not translated into proteins).  Experiments to find regulatory sites are tedious and time-consuming. One approach is to mutate different combinations of nucleotides until functionality changes.  We don’t have good understanding on what makes a site active/how active in terms of the chemical/physical constraints

cbio course, spring 2005, Hebrew University Why Not Use Multiple Alignment? u The motif is short and may appear at different location in different sequences. Most other areas are random u Not all positions within a binding site should be treated in the same way, and usually we don’t know in advance how. Therefore the use of a general scoring matrix is not adequate u The problem is made more complicated since not every sequence contains a motif, due to:  The upstream region used may not be long enough to include a regulatory site in every sequence  Usually, potential co-regulated genes are used to construct the sample, which means that we don’t know for sure whether all these genes are really co-regulated

cbio course, spring 2005, Hebrew University Computational Approach u Identify a set of genes believed to be controlled by the same regulatory mechanism (co-regulated genes). u Extract regulatory regions of the genes (usually upstream sequences) to form a sample of sequences. u Find some way to identify “conserved” elements in these sequences, resulting in a list of potential regulatory sites.

cbio course, spring 2005, Hebrew University How to Find Regulatory Sites gene site gene site gene site gene site gene site sample

cbio course, spring 2005, Hebrew University Formulating Motif Finding Task u Given a set of sequences, find a common motif shared by these sequences. u Steps:  Construct a model of what we mean by common motif.  Solve the problem within the model on simulated samples.  Evaluate performance on real life biological samples.

cbio course, spring 2005, Hebrew University Formulating Motif Finding Task (2) u This means we need to define:  Input of the algorithm: This implicitly defines various assumptions we have on the problem (e.g: do we have different belief for each sequence that it belongs to the group?)  Type of “motif” class:  Search Algorithm: How we search the space of possible motifs?  Scoring function: How we score putative motifs?  Output of the algorithm: Should it give us just putative sites or maybe a binding site model to predict sites?  Evaluation technique: How do we test our algorithm?

cbio course, spring 2005, Hebrew University Task Definition Example u Given a sample of sequences and an unknown pattern (motif) that appears at different unknown positions in each sequence, can we find the unknown pattern? u Input: a set of sequences, each one with an unknown pattern at an unknown position. u Output: a set of starting positions of the pattern in each sequence.

cbio course, spring 2005, Hebrew University Pattern == Subsequence atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa Subsequence = AAAAAAAAGGGGGGG

cbio course, spring 2005, Hebrew University Pattern == (l,d) atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat AgAAgAAAGGttGGG cAAtAAAAcGGcGGG..|..|||.|..||| All variants of AAAAAAAAGGGGGGG u First formulated by Pevzner (ISMB 2000) u Pattern = subsequence of length l and exactly d random mismatches in it u All other sequence is assumed random u Assumes exactly one “true” occurrence of the motif in each sequence

cbio course, spring 2005, Hebrew University Formulating Motif Finding Task (2) u We need to define:  Input of the algorithm: This implicitly defines various assumptions we have on the problem (e.g: do we have different belief for each sequence that it belongs to the group?)  Type of “motif” class:  Search Algorithm: How we search the space of possible motifs?  Scoring function: How we score putative motifs?  Output of the algorithm: Should it give us just putative sites or maybe a binding site model to predict sites?  Evaluation technique: How do we test our algorithm? Think: How the (l,d) problem defines these ? How does it relate to “real” biology?

cbio course, spring 2005, Hebrew University How to Define Motif Class? u Subsequences : ACTCTT u IUPAC alphabet: {A, C, G, T, R,Y, M, K, S, W, B, D, H, V, N } = all subsets of {A,C,G,T} u PSSM / PWM (Position Specific Score Matrix or Position Weight Matrix) u More general probabilistic/other models: e.g. using Bayesian Networks modeling language u Refined definition based on prior knowledge:  Homo/Hetro dimers  Variable gaps  Bias to some characteristic information profile (Van, 2003)

cbio course, spring 2005, Hebrew University NOTE: Independence assumption between biding sites positions ! The score used in a probabilistic setting is the log odds score In many case the BG is a simple, fixed, background distribution (Q) over {ACGT}. The entries in the Matrix can be P i (a), log(P i (a)) or log(P i (a)/logQ(a) – depending on the context of its usage ! PSSM Representation of Binding Sites Position Specific Score Matrix: each possible kmer will get a “score” for being a binding site which is: u Probabilistic interpretation: ACGTACGT 1 2 k w[i,c] – weight of letter c at position i

cbio course, spring 2005, Hebrew University PSSM: + Enables representing low/high affinity in different Positions + Trade off Sens. and Spec. in genomic wide scans - Huge Search space, how to cover efficiently? ABF1 Example – (Targets by Lee at el.,2002) >YAL011W: CGT GTTA G A TGA √ ? PSSM vs. IUPAC

cbio course, spring 2005, Hebrew University How to Learn PSSM Motif? Easier Task - We have aligned samples to learn from: u We have a set of known BS, all of length k, (e.g. verified by some biological experiment) u Compute counts for each base in each position, and normalize == ML estimator: u N number of sequence, Na number of “a”s in position i: u Note:  This is the ML solution. As in many other cases, this might be problematic when we have very few samples to learn from (e.g.: we can get probability 0 for base A in position i simply because we did not see enough examples.)  Solution: use pseudo counts or some prior (e.g. Derichele prior)

cbio course, spring 2005, Hebrew University How to Learn PSSM Motif ? (2) BS Model ACGTACGT Remember: In the motif finding problem we have a much harder task – The input: is a set of (long) sequence suspected to contain a common motif (PSSM according to our current model assumption), but we don’t know where ! The output: Prediction of new BS based on our learned PSSM motif Predictions Input Sequence: Dark blue are BS positions which are hidden from us, and we are trying to learn

cbio course, spring 2005, Hebrew University How to Learn PSSM Motif ? (3) MEME Algorithm ( Bailey T.L. and Elkan C.P ) u (Still) one of the most commonly used tools for motif (PSSM) search:

cbio course, spring 2005, Hebrew University How to Learn PSSM Motif ? (3) MEME Algorithm ( Bailey T.L. and Elkan C.P ) u The basic probabilistic framework used by MEME:  Input: N sequences  Assume each has 1 BS  Assume a generative model: sequence is either generated by BS model M (PSSM) or from a fixed background distribution BG  Assume each sequence has exactly 1 BS in it.  Scoring function: P(Seq | M,BG)  Try to maximize likelihood scoring function by adjusting M’s (PSSM) parameters.

cbio course, spring 2005, Hebrew University How to Learn PSSM Motif ? (4) u What’s the problem? Why is it hard?  Think of the positions of the BS in each sequence as H were H is a vector of dimension N  Given H we have complete data. Then inferring M’s ML parameters are just as we saw for the aligned case  easy  Problem 1: We don’t have H, we are trying to learn it too and the ML parameters of M for each position become dependent if H is not given  we have no close form to compute them analytically and going over all possible H assignments is not feasible,  we need to resort to some method to search the space of possible assignments to M’s parameters  Problem 2: The landscape of the likelihood function is typically far from convex  many local optima

cbio course, spring 2005, Hebrew University How to Learn PSSM Motif ? (5) MEME Algorithm u MEME uses a technique called EM to search the space of model M’s parameters u EM = Expectation Maximization u We review how EM is used in the MEME algorithm in class….

cbio course, spring 2005, Hebrew University Problems with the MEME & other Models u Think: In light of what we discussed, what assumptions are made in this model? What might cause us problems in “real” life data?  MEME has also other variants we did not discuss here (oops, zoops, etc.) u Also: EM is very sensitive to starting point  need a good way to find good ones

cbio course, spring 2005, Hebrew University Other Algorithmic Techniques for Motif Finding u MEME (Expectation Maximization) u GibbsDNA, AlignAce (Gibbs Sampling) u CONSENUS (greedy multiple alignment) u WINNOWER (Clique finding in graphs) u SP-STAR (Sum of pairs scoring) u MITRA (Mismatch trees to prune exhaustive search space) More then one way to skin a cat….

cbio course, spring 2005, Hebrew University How to find Binding Sites- Revisited Find a common motif in gene set (CONSENSUS, MITRA, MEME, AlignACE…) “Classical” Solutions: Gene Set Promoter Find a common & unique motif in genes Discriminative Solutions: Extract the relevant bit from sequences Main problem: In many cases the motif is common not just to the subset of sequences we have, but to many other as well  not a good candidate to explain regulation “A simple hyper-geometric approach for discovering putative transcription factor binding sites” WABI 01

cbio course, spring 2005, Hebrew University Finding Discriminative Motifs Define Space of Motifs “mimic” motifs with a simpler class for efficient search Search Space, Evaluate Motifs using discriminative scoring Choose Significant Motifs Correct for multiple hyp. Bonfferoni or FDR criteria Step1 Step2: “A simple hyper-geometric approach for discovering putative transcription factor binding sites” WABI 01 Refine Motifs

cbio course, spring 2005, Hebrew University Binding Sites - Revisited → independence assumption Two relevant questions:  Are there dependencies in binding sites?  Do we gain an edge in computational tasks if we model such dependencies? promoter gene binding site A ?C?C ?T?T “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

cbio course, spring 2005, Hebrew University How to model binding sites ? X1X1 X2X2 X3X3 X4X4 X5X5 Profile: Independency model Tree: Direct dependencies Mixture of Profiles: Global dependencies Mixture of Trees: Both types of dependencies X1X1 X2X2 X3X3 X4X4 X5X5 T X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T represent a distribution of binding sites “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

cbio course, spring 2005, Hebrew University Learning models: Aligned binding sites Learning procedure for Bayesian networks GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG TAGGGGCCGGGC TGGGGGCGGGGT AAAGGGCCGGGC GGGAGGCCGGGA GCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC Aligned binding sites Models X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T Learning Machinery select maximum likelihood model “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

cbio course, spring 2005, Hebrew University Arabidopsis ABA binding factor 1 (49 examples) Profile Test LL per instance Mixture of Profiles 76% 24% Test LL per instance (+1.23) (improvement in likelihood > 2-fold) X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 Tree Test LL per instance (+1.46) (improvement in likelihood > 2.5-fold) “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

cbio course, spring 2005, Hebrew University Rap1 Example (Harbison at. el.04) (171 expmples) Profile Mixture of Profiles X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 Tree

cbio course, spring 2005, Hebrew University Likelihood improvement over profiles Significant improvement in generalization  Data often exhibits dependencies “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

cbio course, spring 2005, Hebrew University EM algorithm Learning models: unaligned data Use EM algorithm to simultaneously u Identify binding site positions u Learn a dependency model Unaligned Data Learn a model Identify binding sites Models X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

cbio course, spring 2005, Hebrew University Evaluating Performance Detect target genes on a genomic scale: ACGTAT…………….………………….AGGGATGCGAGC Scoring rule: Crucial issue: p-value of scores “CIS: Compound Importance Sampling Method for Protein-DNA Binding Site p-value Estimation” Bioinformatics, 2004, ISMB 04 Probability by binding site model Background model (order-3 markov chain)

cbio course, spring 2005, Hebrew University 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% 1% 2% 3% 4% 5% True Positive Rate (Sensitivity) False Positive Rate Profile Example: ROC curve of HSF1 Mixture of Trees Tree ~60 FP Mixture of Profiles “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

cbio course, spring 2005, Hebrew University Evaluation – Localization Data 5-fold Cross Validation [Lee et al 2002] Δ specificity (TP/Predicted) Δ sensitivity (TP/True) Improvement by Mix of Trees over PSSM “True” Predicted TP “Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

cbio course, spring 2005, Hebrew University Motif Finding - Evaluation u Still an open problem u We have seen several examples on how performance can be evaluated in different ways u There is (still) no absolute solution for this u Main problems:  no large data sets of known sites  no real annotation of negative samples  How to define success measure?  Difference in input/output assumptions  … u A recent effort in this direction: “Assessing computational tools for the discovery of transcription factor binding sites” (Nat. Biotech. Jan 05)  compared publicly available tools on the web on (small) data sets of known binding sites based on the Transfac D.B