Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics III: Gibbs motif sampler & advanced motif.

Slides:



Advertisements
Similar presentations
Hierarchical Dirichlet Processes
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
. Context-Specific Bayesian Clustering for Gene Expression Data Yoseph Barash Nir Friedman School of Computer Science & Engineering Hebrew University.
Comparative Genomics & Annotation The Foundation of Comparative Genomics Non-Comparative Annotation Three methodological tasks of CG Annotation: Protein.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Ka-Lok Ng Dept. of Bioinformatics Asia University
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Lecture 5: Learning models using EM
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Comparative ab initio prediction of gene structures using pair HMMs
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Cis-regulatory element study in transcriptome Jin Chen CSE Fall
Marcin Pacholczyk, Silesian University of Technology.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Sequence analysis – an overview A.Krishnamachari
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Using Mixed Length Training Sequences in Transcription Factor Binding Site Detection Tools Nathan Snyder Carnegie Mellon University BioGrid REU 2009 University.
Sampling Approaches to Pattern Extraction
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
Cis-regulatory Modules and Module Discovery
Multiple Species Gene Finding using Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
A shared random effects transition model for longitudinal count data with informative missingness Jinhui Li Joint work with Yingnian Wu, Xiaowei Yang.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
(H)MMs in gene prediction and similarity searches.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne & Jones, 96 U C G A C A U A C Knudsen.., 99 Eddy & co. Meyer and Durbin.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Transcription factor binding motifs (part II) 10/22/07.
Motif identification with Gibbs Sampler Xuhua Xia
Finding Regulatory Signals in Genomes Regulatory signals know from molecular biology Different Kinds of Signals Promotors Enhancers Splicing Signals The.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Advanced Statistical Computing Fall 2016
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Transcription factor binding motifs
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Sequential Pattern Discovery under a Markov Assumption
Learning Sequence Motif Models Using Gibbs Sampling
Presentation transcript:

Advanced Algorithms and Models for Computational Biology -- a machine learning approach Computational Genomics III: Gibbs motif sampler & advanced motif detection algorithms Eric Xing Lecture 8, February 13, 2005 Reading: Chap. 9, DTW book, additional papers

The product multinomial model [Lawrence et al. Science 1993] Positional specific multinomial distribution:  l = [  l A, …,  l C ] T  Position weight matrix (PWM):  The nucleotide distributions at different positions are independent A1A1   ... LL A2A2 A3A3 ALAL AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGCGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA

For model P ( A |  ): Treat parameters  as unobserved random variable(s) probabilistic statements of  is conditioned on the values of the observed variables A obs Bayes’ rule: Bayesian estimation: p()p() posterior likelihood prior Bayesian approach

Bayesian missing data problem  : parameter of interest X={ x 1,…, x N }: a set of complete i.i.d. observations from a density that depends upon θ: p (X ┃ θ) p (θ ┃ X) = Π i=1,…,n p ( x i ┃ θ) p (θ) / p (X) In practical situations, x i may not be completely observed. Assuming the unobserved values are missing completely at random, let X=(Y,Z), x i =( y i,z i ) i =1,…, n y i : observed part, z i : missing part p (θ ┃ Y) = ∫ p (θ ┃ Y, Z) p (Z ┃ Y)dZ This integration is usually hard obtain in close-form  Imputation

Multiple values, z (1),…, z ( m ) are drawn from p ( Z ┃ Y) to form m complete data sets. With these imputed data sets and the ergodicity theorem, p (θ ┃ Y) ≈ 1/m  { p (θ ┃ Y, z (1) )+ … + p (θ ┃ Y, z (m) )} But in most applied problems it is impossible to draw Z from (Z ┃ Y) directly. Data Imputation

Data Augmentation Tanner and Wong’s data augmentation (DA) apply Gibbs Sampler to draw multiples of θ’s and multiples of Z ’s jointly from p (θ, Z ┃ Y) DA algorithm Notice that: p (Z ┃ Y) = ∫ p (θ, Z ┃ Y)dθ =∫ p (Z ┃ θ, Y) p (θ ┃ Y)dθ ≈1/m  { p (Z ┃ θ (1),Y)+ … + p (Z ┃ θ (m),Y)} I-step z ( m ) ~ p (Z ┃ θ ( m ), Y) Recall that: p (θ ┃ Y) ≈ 1/m  { p (θ ┃ Y, z (1) )+ … + p (θ ┃ Y, z (m) )} P-stepθ ( m ) ~ p (θ ┃ Y, z ( m ) ) By iterating between drawing θ from p (θ ┃ Y,Z) and drawing Z from p (Z ┃ θ,Y), DA constructs a Markov chain whose equilibrium distribution is p (θ, Z ┃ Y)

Collapsed Gibbs Sampler (J. Liu) Consider Sampling from p (θ ┃ D ), θ=(θ 1,θ 2,θ 3 ) Original Gibbs Sampler Collapsed Gibbs Sampler (J. Liu):

The oops model (one occurrence per sequence) Parameter of Interest Missing DataA= {a 1,a 2,…,a k } Observed DataY= Given Sequences Back to de novo Motif Detection  Bayesian missing data problem

The Gibbs Motif Sampler Standard Gibbs Sampler (Iterative sampling): Standard Gibbs Sampler (Iterative sampling): p(  |A,Y); p(A| ,Y)  Draw from [  | A, Data], then draw from [A | , Data] Collapsed Gibbs Sampler (Predictive Updating) Collapsed Gibbs Sampler (Predictive Updating): A ~ p(A ┃ Y) v pretend that K-1 motif instances have been found. We stochastically predict for the K-th instance!! Step0. choose an arbitrary starting point A (0) =(a 1 (0),a 2 (0),…,a K (0) ); Step1. Generate A (t+1) =(a 1 (t+1), a 2 (t+1),…,a N (t+1) ) as follows: Generate a 1 (t+1) ~ p (a 1 ┃ a 2 (t),…,a K (t), Y); Generate a 2 (t+1) ~ p (a 2 ┃ a 1 (t+1), a 3 (t),…,a K (t), Y); … Generate a K (t+1) ~ p (a N ┃ a 1 (t+1), a 2 (t+1),…,a K-1 (t+1), Y); Generate  (t+1) ~ p (  ┃ a 1 (t+1), a 2 (t+1),…,a K (t+1), Y); Step2. Set t=t+1, and go to step 1

Initialized by choosing random starting positions Iterate the following steps many times: Randomly or systematically choose a sequence, say, sequence k, to exclude. Carry out the predictive-updating step to update a k (no need to sample  (t) at each t, we can compute it in close-form, see next slide Notations: Stop when not much change observed, or some criterion met. The Predictive Update Version ak ?ak ? a1a1 a2a2 a3a3

The predictive distribution for a k : Predictive update: Sampling: every segment of width W in Y k has probability (*). Choose one at random according to these probabilities, and this becomes the new a k. assuming: θ 0 ~ Dirichlet(α), θ ~ Product Dirichlet(B), B = (β l,j ), a ~ Uniform (*) Predictive Distribution

Derivation (1) (2) (3) (4) (5) …

ak ?ak ? : True motif locations Compare entropies between new columns and left-out ones. Phase-shift and fragmentation Sometimes get stuck in a local shift optimum How to “escape” from this local optimum? Simultaneous move: A  A+   A+   a 1 + , …, a K +  Use a Metropolis step: accept the move with prob=p,

Convergence

Model Selection

Multiple Pattern Occurances in the same sequences: Liu, J. `The collapsed Gibbs sampler with applications to a gene regulation problem," Journal of the American Statistical Association Prior: any position i has a small probability e to start a binding site: Recall augmented proposal distribution for the Gibbs motif sampler width = w length n L akak Natural Extensions to Basic Model I

Although sampling z directly is prohibitive, a very simple form of the conditional distribution of any z n given all the rest z [-n] is available where are estimated from y and z [-n]. z1z1 z2z2 zNzN … y1y1 y2y2 yNyN … Z  {0, 1} N Back to the binary indicator model

z1z1 z2z2 zNzN … y1y1 y2y2 yNyN … Z  {0, 1} N How about multiple types of motifs? Can motif overlap? Are motif sites independent? This model can be easily extended to solving K different types motif simultaneously by letting Z  {0, …, K} NT A Markov chain for Z. See next... Any problem with the model?

Composite Patterns: BioOptimizer: the Bayesian Scoring Function Approach to Motif Discovery Bioinformatics Multiply the motif p -values Natural Extensions to Basic Model II Combined p-value = * * * = * Seq1

The product of p -values Theorem: The probability Fn( p ) that the product of n independent, uniform [0,1] random variables has an observed value less than or equal to p, is given by for 0 < p  1, and is zero when p is zero. Timothy L. Bailey and Michael Gribskov, “Combining evidence using p-values: application to sequence homology searches,” Bioinformatics, 14:48-54, 1998.

1 K w1w1 w2w2 w3w3 w4w4 Natural Extensions to Basic Model III Correlated in Nucleotide Occurrence in Motif: Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics, 6, Insertion-Deletion BALSA: Bayesian algorithm for local sequence alignment Nucl. Acids Res.,

The PM parameter,  l = [  lA, …,  lC ] T, corresponds exactly to the PWM of a motif The score (likelihood-ratio) of a candidate substring: AAAAGAGTCA The nucleotide distributions at different sites are independent ! Recall the PM Model

The likelihood of a candidate substring: AAAAGAGTCA Weighted PWMs The nt-distributions at different sites are conditionally independent but marginally dependent ! (Barash et al. 2002) Mixture of PM models

(Barash et al. 2002) Mixture of trees The likelihood of a candidate substring: AAAAGAGTCA The nt-distributions at different sites are pairwisely dependent ! … Tree CPTs Tree models

Dirichlet parameters Local HMM parameters Learning: empirical Bayes estimation: a family of training motifs {A} k  hyperparameters { , , B} k [Xing and Karp, PNAS 2004] MotifPrototyper: A Bayesian Markovian model

Enhanceosome: Cis-Regulatory Modules

Cluster Finding Methods Poisson model A. Wagner Cister (CIS-element clusTER finder) M. Frith et al. Comet (Cluster Of Motifs E-value Tool) M. Frith et al. cis-regulatory module finder Gupta M, Liu JS. LOGOS Xing et al.

Finding Motif Clusters Poisson Method Regulatory Modules: Cister & Comet De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Nat’l Acad Sci USA, 102, LOGOS “Pearson type III distribution”:Exponential distribution: M1 M2 M3 Stop Start Gene A Gene B

Cister & Comet: Introduction DNA sequencesegment Cluster model: Poisson-distributed cis-elements, embedded in random DNA

Hidden Markov Model Cis-element Random DNA

Cis-element Random DNA Cluster model Random model Cister

Cister applied to Human c-fos

Comet DNA sequencesegment Parse: One particular arrangement of cis-elements in the segment

The CisModuler global HMM Space of hidden states –all possible functional annotations Parameters –transition parameters {  } Empirical priors for {  } Posterior decoding [Xing, Wu, Jordan and Karp. JBCB 2004]

Motifs Coding regions Protein binding in neighborhood Coding regions Systems Biology Approach: Combining Signals and other Data Expresssion and Motif Regression: Integrating Motif Discovery and Expression Analysis Proc.Natl.Acad.Sci Rank genes by E=log2(expression fold change) 2. Find “many” (hundreds) candidate motifs 3. For each motif pattern m, compute the vector S m of matching scores for genes with the pattern 4. Regress E on S m ChIP-on-chip kb information on protein/DNA interaction: An Algorithm for Finding Protein-DNA Interaction Sites with Applications to Chromatin Immunoprecipitation Microarray Experiments Nature Biotechnology, 20,

Blanchette and Tompa (2003) “FootPrinter: a program designed for phylogenetic footprinting” NAR beginsignalend Phylogenetic Footprinting (homologous detection) Term originated in 1988 in Tagle et al. Blanchette et al.: For unaligned sequences related by phylogenetic tree, find all segments of length k with a history costing less than d. Motif loss an option.

Binding Sites Databases TRANSFAC:

Species-specific: SCPD (yeast) DPInteract (e. coli) Drosophila DNase I Footprint Database (v2.0) More Databases

Logos:

Gibbs Motif Sampler

MEME

Reviews Stormo GD (2000), Bioinformatics, 16:16-23 Bulyk (2003), Genome Biology 5:201 Logos Schneider & Stephens (1990), Nucleic Acids Res. 18: Probabilistic GMS: Lawrence et al. (1993), Science, 262: MEME: Bailey & Elkan (1995), Machine Learning, 21:51-80 Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. S. Liu, A. F. Neuwald, and C. E. Lawrence. Journal of the American Statistical Association, 90, , References

Structural Constraints Modeling dependencies in protein-dna binding sites, Y. Barash et al. ISMB 2003 MotifPrototyper: a profile Bayesian model for motif family. E.P. Xing and R. M. Karp, Proc. Natl. Acad. Sci. vol. 101, no. 29, , Kechris et al. (2004), Genome Biology, 5(7):R50. Regulatory models Frith MC, Hansen U, Weng Z (2001) Bioinformatics 17(10): Frith MC, Spouge JL, Hansen U, Weng Z (2002) Nucleic Acids Research (in press) LOGOS: A modular Bayesian model for de novo motif detection. E.P. Xing, et al., Journal of Bioinformatics and Computational biology 2004, 2(1), CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl. Acad. Sci. USA, 101: De novo cis-regulatory module elicitation for eukaryotic genomes M. Gupta and J. Liu, PNAS, vol. 101, References