Motif Refinement using Hybrid Expectation Maximization Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of Electrical and Computer Engr.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
The multi-layered organization of information in living systems
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Segmentation and Fitting Using Probabilistic Methods
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Visual Recognition Tutorial
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
1cs542g-term Notes  Assignment 1 due tonight ( me by tomorrow morning)
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.
Lecture 5: Learning models using EM
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Finding Regulatory Motifs in DNA Sequences
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Lecture 5 Motif discovery. Signals in DNA Genes Promoter regions Binding sites for regulatory proteins (transcription factors, enhancer modules, motifs)
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
HMM - Part 2 The EM algorithm Continuous density HMM.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Flat clustering approaches
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
METHOD OF STEEPEST DESCENT ELE Adaptive Signal Processing1 Week 5.
September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:
Step 3: Tools Database Searching
A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
(Regulatory-) Motif Finding
ECE 576 POWER SYSTEM DYNAMICS AND STABILITY
Presentation transcript:

Motif Refinement using Hybrid Expectation Maximization Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of Electrical and Computer Engr. Cornell University, Ithaca, NY

Motif Finding Problem DNA and protein sequences that are strongly conserved i.e. they have important biological functions like gene regulation and gene interaction Motifs are certain patterns in DNA and protein sequences that are strongly conserved i.e. they have important biological functions like gene regulation and gene interaction Finding these conserved patterns might be very useful for controlling the expression of genes Finding these conserved patterns might be very useful for controlling the expression of genes Motif finding problem is to detect novel, over-represented unknown signals in a set of sequences (for eg. transcription factor binding sites in a genome).

Motif Finding Problem Consensus Pattern - ‘ CCGATTACCGA ’ ( l, d ) – (11,2) consensus pattern

Problem Definition Without any previous knowledge about the consensus pattern, discover all instances (alignment positions) of the motifs and then recover the final pattern to which all these instances are within a given number of mutations.

Complexity of the Problem Let n is the length of the DNA sequence l is the length of the motif t is the number of sequences d is the number of mutations in a motif The running time of a brute force approach: There are (n-l+1) l-mers in each of t sequences. Total combination is (n-l+1) t l-mers for t sequences. Typically, n is much larger than l. ie. n = 600, t = 20.

Existing methodologies Generative probabilistic representation - continuous Gibbs Sampling Gibbs Sampling Expectation Maximization Expectation Maximization Greedy CONSENSUS Greedy CONSENSUS HMM based HMM based Mismatch representation – Discrete Consensus Projection Methods Projection Methods Multiprofiler Multiprofiler Suffix Trees Suffix Trees

Existing methodologies Global Solvers Advantage: neighborhood of global optimal solutions. Advantage: neighborhood of global optimal solutions. Disadvantage: misses out better solutions locally. Disadvantage: misses out better solutions locally. ie: Random Projection, Pattern Branching, etc… ie: Random Projection, Pattern Branching, etc… Local Solvers Advantage: returns best solution in neighborhood. Advantage: returns best solution in neighborhood. Disadvantage: relies heavily on initial conditions. Disadvantage: relies heavily on initial conditions. ie: EM, Gibbs Sampling, Greedy CONSENSUS, etc… ie: EM, Gibbs Sampling, Greedy CONSENSUS, etc…

Our Approach Performs global solver to estimate neighborhood of a promising solution. (Random Projection) Performs global solver to estimate neighborhood of a promising solution. (Random Projection) Using this neighborhood as initial guess, apply local solver to refine the solution to be the global optimal solution. (Expectation Maximization) Using this neighborhood as initial guess, apply local solver to refine the solution to be the global optimal solution. (Expectation Maximization) Performs efficient neighborhood search to jump out of convergence region to find another local solutions systematically. Performs efficient neighborhood search to jump out of convergence region to find another local solutions systematically. A hybrid approach includes the advantages of both the global and local solvers. A hybrid approach includes the advantages of both the global and local solvers.

Random Projection Implements a hash function h(x) to map l-mer onto a k- dimensional space. Implements a hash function h(x) to map l-mer onto a k- dimensional space. Hashes all possible l-mers in t sequences into 4 k buckets where each bucket corresponds an unique k-mer. Hashes all possible l-mers in t sequences into 4 k buckets where each bucket corresponds an unique k-mer. Imposing certain conditions and setting a reasonable bucket threshold S, the buckets that exceed S is returned as the solution. Imposing certain conditions and setting a reasonable bucket threshold S, the buckets that exceed S is returned as the solution.

Expectation Maximization Expectation Maximization is a local optimal solver in which we refine the solution yielded by random projection methodology. The EM method iteratively updates the solution until it converges to a locally optimal one. Expectation Maximization is a local optimal solver in which we refine the solution yielded by random projection methodology. The EM method iteratively updates the solution until it converges to a locally optimal one. Follow these steps : Compute the scoring function Compute the scoring function Iterate the Expectation step and the Maximization step Iterate the Expectation step and the Maximization step

Profile Space Jk=bk=1k=2k=3k=4…k=l {A} C 0,1 C 1,1 C 2,1 C 3,1 C 4,1 … C l,1 {T} C 0,2 C 1,2 C 2,2 C 3,2 C 4,2 … C l,2 {G} C 0,3 C 1,3 C 2,3 C 3,3 C 4,3 … C l,3 {C} C 0,4 C 1,4 C 2,4 C 3,4 C 4,4 … C l,4 A profile is a matrix of probabilities, where the rows represent possible bases, and the columns represent consecutive sequence positions. A profile is a matrix of probabilities, where the rows represent possible bases, and the columns represent consecutive sequence positions. Applying the Profile Space into the coefficient formula constructs PSSM.

Scoring function- Maximum Likelihood

Expectation Step The Expectation step returns the expected number of j th residue in each position of the motif instance and overall sequence. The algorithm is as follows: Obtains  k,j from the previous M-step iteration. Obtains  k,j from the previous M-step iteration. Uses  k,j to calculate the probability of all possible l- mers against the expected motif. Uses  k,j to calculate the probability of all possible l- mers against the expected motif. Given probability of each l-mer, calculates probability of the correct starting position for each l-mer using Bayes formula. Given probability of each l-mer, calculates probability of the correct starting position for each l-mer using Bayes formula. Multiplying weight to each position of each l-mer, calculate the expected number of j at position k. Multiplying weight to each position of each l-mer, calculate the expected number of j at position k.

Maximization Step The Maximization Step receives the expected values passed on by E-Step to calculate the new probability  k,j and  0,j and return them for E-Step.  (q) k,j = E[k,j] / t,  (q) 0,j = E[0,j] / (t [ n-l ] )  (q) k,j = E[k,j] / t,  (q) 0,j = E[0,j] / (t [ n-l ] ) If  (q) =  (q-1), then iteration ends. All the local optimal solution sites are returned with the consensus made up of j th residue with highest probability at k th position. If  (q) =  (q-1), then iteration ends. All the local optimal solution sites are returned with the consensus made up of j th residue with highest probability at k th position. Else,  (q) k,j and  (q) 0,j are used to the q+1 iteration of the E-Step. Else,  (q) k,j and  (q) 0,j are used to the q+1 iteration of the E-Step.

Basic Idea one-to-one correspondence of the critical points Local Minimum Local Maximum Stable Equilibrium Point Decomposition Point Source Saddle Point

Theoretical Background Practical Stability Boundary The problem of finding all the Tier-1 stable equilibrium points of x s is the problem of finding all the decomposition points on its stability boundary

Theoretical background Theorem (Unstable manifold of type-1 equilibrium point) : Let x s 1 be a stable e.p. of the gradient system (2) and x d be a type-1 e.p. on the practical stability boundary  A p (x s ). Assume that there exist  and  such that |  f (x)| >  unless x  {x :  f (x) =0}. If every e.p. of (1) is hyperbolic and its stable and unstable manifolds satisfy the transversality condition, then there exists another stable e.p. x s 2 to which the one dimensional unstable manifold of x d converges. Our method finds the stability boundary between the two local minima and traces the stability boundary to find the saddle point. We used a new trajectory adjustment procedure to move along the practical stability boundary.

Definitions Def 1 : x is said to be a critical point of (1) if it satisfies the condition  f (x) = 0 where f (x) is the objective function assumed to be in C 2 (  n,  ).The corresponding nonlinear dynamical system is Eq. (1) The solution curve of Eq. (1) starting from x at time t = 0 is called a trajectory and it is denoted by  ( x,.) :  →  n. A state vector x is called an equilibrium point (e.p.) of Eq. (3) if f ( x ) = 0.

Definitions (contd.) Def 2 : An equilibrium point is said to be hyperbolic if the Jacobian of f at point x has no eigenvalues with zero real part. A hyperbolic e.p. is a Stable e.p. - if all the eigenvalues of its Jacobian have negative real part. Unstable e.p. - if some eigenvalues have positive real part. Type-k e.p. - if its Jacobian has exact k eigenvalues with positive real part. We propose to build a negative gradient system associated with ( 1) as shown below : dx /dt = -  f (x) Eq. (2)

Definitions (contd.) A dynamical system is completely stable if every trajectory of the system leads to one of its stable equilibrium points. Def 3 : The stability region (or region of attraction) of a stable equilibrium point x s of a nonlinear dynamical system (1) is denoted by A(x s ) and is A(x s ) = {x   n : lim t→∞  ( x, t) = x s } The boundary of stability region is called the stability boundary of x s and is represented as  A(x s ).

Definitions (contd.) Def 4 : The practical stability region of a stable equilibrium point x s of a nonlinear dynamical system (1), denoted by A p (x s ) and is. The practical stability boundary (  A p (x s ) ) is a subset of its stability boundary. It eliminates the complex portion of the stability boundary which has no “contact” with the complement of the closure of the stability region. Def 5 : A decomposition point is a type-1 equilibrium point x d on the practical stability boundary of a stable equilibrium point x s.

Theoretical background Theorem 1 (Unstable manifold of type-1 equilibrium point) : Let x s 1 be a stable e.p. of the gradient system (2) and x d be a type-1 e.p. on the practical stability boundary  A p (x s ). Assume that there exist  and  such that |  f (x)| >  unless x  {x :  f (x) =0}. If every e.p. of (1) is hyperbolic and its stable and unstable manifolds satisfy the transversality condition, then there exists another stable e.p. x s 2 to which the one dimensional unstable manifold of x d converges. Our method finds the stability boundary between the two local minima and traces the stability boundary to find the saddle point. We used a new trajectory adjustment procedure to move along the practical stability boundary.

Our Method

Search Directions

Our Method The exit point method is implemented so that EM can move out of its convergence region to seek out other local optimal solutions. Construct a PSSM from initial alignments. Construct a PSSM from initial alignments. Calculate eigenvectors of Hessian matrix. Calculate eigenvectors of Hessian matrix. Find exit points (or saddle points) along each eigenvector. Find exit points (or saddle points) along each eigenvector. Apply EM from the new stability/convergence region. Apply EM from the new stability/convergence region. Repeat first step. Repeat first step. Return max score {A, a 1i, a 2j } Return max score {A, a 1i, a 2j }

Results

Improvements in the Alignment Scores Motif Original Pattern Score Second Tier Pattern Score (11,2)AACGGTCGCAG125.1CCCGGGAGCTG153.3 (11,2)ATACCAGTTAC145.7ATACCAGGGTC153.6 (13,3)CTACGGTCGTCTT142.6CCTCGGGTTTGTC158.7 (13,3)GACGCTAGGGGGT158.3GACCTTGGGTATT165.8 (15,4)CCGAAAAGAGTCCGA147.5CCGAAAGGACTGCGT176.2 (15,4)TGGGTGATGCCTATG164.6TGAGAGATGCCTATG170.4 (17,5)TTGTAGCAAAGGCTAAA143.3CAGTAGCAAAGACTTCC175.8 (17,5) (17,5)ATCGCGAAAGGTTGTGG174.1ATTGCGAAAGAATGTGG178.3 (20,6)CTGGTGATTGAGATCATCAT165.9CATTTAGCTGAGTTCACCTT194.9 (20,6)GGTCACTTAGTGGCGCCATG216.3CGTCACTTAGTCGCGCCATG219.7

Improvements in the Alignment Scores Motif Original Pattern Score Second Tier Pattern Score (11,2)TATCGCTGGGC147.5TCTCGCTGGGC161.1 (13,3)CACCTTGGTAATT168.4GACCATGGGTATT181.5 (15,4)ATGGCGTCCGCAATG174.7ATGGCGTCCGAAAGA188.5 (17,5)CGACACTTTCTCAATGT178.8CGACACTATCTTAAGAT196.2 (20,6)TCAAATAGACTAGAGGCGAC189.0TCTACTAGACTGGAGGCGGC201.1 Random Projection method results

Performance Coefficient K is the set of the residue positions of the planted motif instances, and P is the corresponding set of positions predicted

Results Different Motifs and the average score using random starts. The first tier and second tier improvements on synthetic data.

Results Different Motifs and the average score using random projection. The first tier and second tier improvements on synthetic data.

Results Different Motifs and the average score using random projections and the first tier and second tier improvements on real human sequences.

Results on Real data

Concluding discussion Using dynamical system approach, we have shown that the EM algorithm can be improved significantly. Using dynamical system approach, we have shown that the EM algorithm can be improved significantly. In the context of motif finding, we see that there are many local optimal solutions and it is important to search the neighborhood space. In the context of motif finding, we see that there are many local optimal solutions and it is important to search the neighborhood space. Try different global methods and other techniques like GibbsDNA Try different global methods and other techniques like GibbsDNA

Questions and suggestions !!!!!