1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Identification of Transcriptional Regulatory Elements in Chemosensory Receptor Genes by Probabilistic Segmentation Steven A. McCarroll, Hao Li Cornelia.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Lecture 6, Thursday April 17, 2003
Building a dictionary for genomes By Harmen J. Bussemaker, Hao Li, and Eric D. Siggia Tal Frank.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Lecture 5: Learning models using EM
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
CpG islands in DNA sequences
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Regulatory element detection using correlation with expression (REDUCE) Literature search WANG Chao Sept 14, 2004.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois,
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Cis-regulatory Modules and Module Discovery
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Transcription factor binding motifs (part II) 10/22/07.
Motif identification with Gibbs Sampler Xuhua Xia
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
(Regulatory-) Motif Finding
CSE 5290: Algorithms for Bioinformatics Fall 2009
Nora Pierstorff Dept. of Genetics University of Cologne
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003

2 “ofallthewordsinthisunsegmentedphraseth erearesomehidden” The challenge is to develop an algorithm for DNA sequences that can partition the sequence into meaningful “words”

3 Presentation Outline Introduction MobyDick Stochastic Dictionary-based Data Augmentation (SDDA) Algorithm Extensions Results

4 Introduction Some new challenges now that there are publicly available databases of genome sequences: How do genes regulate the requirements of specific cells or for cells to respond to changes? How can gene regulatory networks be analyzed more efficiently?

5 Gene Regulation Transcription Factors (TF) play a critical role in gene expression Enhance it or Inhibit it Short DNA motifs nucleotides long often correspond to TF binding sites Build model for TF binding sites given a set DNA sequences thought to be regulated together

6 MobyDick Dictionary building algorithm developed in 2000 by Bussemaker, Li and Siggia Decomposes sequences into the most probable set of words Start with dictionary of single letters Test for the concatenation of each pair of words and its frequency Update dictionary

7 MobyDick results Tested on the first 10 chapters of Moby Dick 4214 unique of them repeats Result had 3600 unique words Found virtually all 1600 repeated words

8 SDDA Stochastic Dictionary-based Data Augmentation Stochastic words represented by probabilistic word matrix (PWM) Some definitions D=dictionary size  =sequence data generated by concatenation of words D ={M 1.,… M D } the concatenated of words, including single letters P =p(M 1 )…p(M D ) probability vector A i ={A ik … A nk } denotes the site indicators for motifs M k (A ik =1 or 0)

9 SDDA Some definitions q=4, also A,G,C,T are the first 4 words in dictionary  ={P 1.,… P k } sequence partition so each part P i corresponds to a dictionary word N(  ) = total number of partitions N Mj (  )=number of occurrences of word type M j in the partition w j. (j=1…D) denotes word lengths The D-q motif matrices are denoted by {  q+1 …  D }=  (D) If the k th word is width w then its probability matrix is  k = {  1 k …  wk }

10 Probabilistic Word Matrix A C G T ACAGG=.85*.78*.8*.96*.85=.4328 GCAGA=.1*.78*.8*.96*.12=.0072

11 General idea of Algorithm So we start with D (1) ={A,G,C,T} and estimate the likelihood of those 4 words in the dataset. Then we look at any pair of letters, say AT. If it is over-represented and in comparison to D (1) then it is added to the dictionary D (2) and this is repeated for all the pairs. Consider all the concatenations of all the pairs of words in D (n) and form a new dictionary D (n+1) by including those new words that are over-represented or more abundant than by chance.

12 SDDA Algorithm 1) Partitioning: sample for words given the current value of the stochastic word matrix and word usage probabilities Do a recursive summation of probabilities to evaluate the partial likelihood up to every point in the sequence L i (  )=  P(  [i-wk+1:j] |  )L i-wk (  ) Words are sampled sequentially backward, starting at the end of the sequence. Sample for a word starting at position i, according to the conditional probability P(A ik =1|A i+wk,  )=P(  [i:i+wk-1] |  k,p)L i-1 (  )/L i+wk-1 (  ) If none of the words are selected then the appropriate single letter word is assumed & k is decremented by 1.

13 SDDA Algorithm 2) Parameter Update: Given the partition A, update the word stochastic matrix  D update the word probabilities vector P by sampling their posterior distribution 3) Repeat steps 1 and 2 until convergence, when MAP (maximum a posteriori) score stops increasing. This is a method of “scoring” optimal alignment and is calculated with each iteration. 4) Increase dictionary size D=D+1. Repeat again from step 1 but now  D-1 is a known word matrix

14 Algorithm Extensions Phase Shift via Metropolis steps Patterns with variable insertions and deletions (gaps) Patterns of unknown widths Motif detection in the presence of “low complexity” regions

15 Phase Shift If 7,19,8,23 are strongest pattern but algorithm chooses a1=9, a2=21 early on then it is likely to also choose a3=10,a4=25 Metropolis steps solution a ={a 1 … a m } are starting positions for an occurrence of a motif Choose   1 with probability.5 each Update the motif position a+  with probability min{1, p(a+  |  )/p(a|  )

16 Patterns with: gaps/unknown widths Gaps - Additional recursive sum in the partitioning step(1) using io Insertion-opening probability ie Insertion-extension probability Do Deletion-opening probability De Deletion-extension probability Unknown Widths - The authors also enhanced their algorithm to determine the likely pattern width if it is unspecified.

17 Motif Detection with “low complexity” regions AAAAAAA… CGCGCGCG… The stochastic dictionary model is expected to control this by treating these repeats as a series of adjacent words

18 Results Two case studies are provided Simulated dataset with background polynucleotide repeats CRP binding sites

19 Relative performance of the SDDA compared to BioProspector & AlignAce SDDA BP AA EVAL2SuccessFalse- positive SuccessFalse- positive SuccessFalse- positive a) b)

20 Credits Slide 6,7 Bussemaker,H.J., Li, H and Siggia, E.D. (2000), “Building a Dictionary for Genomes:Identification of Presumptive Regulatory Sites by Statistical Analysis”,Proceedings of the National Academy of Science USA, 97, Slide 9 Liu,J.S., Gupta,M., Liu, X., Mayerhofere, L. and Lawrence, C.E.,”Statistical Models for Biological Sequence Motif Discovery”,1-19 Slide 14 Lawrence, C.E., Altschul, S.F.,Boguski, M.S., Liu,J.S.,Neuwald, A.F., and Wootton,J.C. (1993), “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment”, Science, 262,

21 Bibliography Bussemaker,H.J., Li, H and Siggia, E.D. (2000), “Building a Dictionary for Genomes:Identification of Presumptive Regulatory Sites by Statistical Analysis”,Proceedings of the National Academy of Science USA, 97, Lawrence, C.E., Altschul, S.F.,Boguski, M.S., Liu,J.S.,Neuwald, A.F., and Wootton,J.C. (1993), “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment”, Science, 262, Liu,J.S., Gupta,M., Liu, X., Mayerhofere, L. and Lawrence, C.E.,”Statistical Models for Biological Sequence Motif Discovery”,1-19