Motif identification with Gibbs Sampler Xuhua Xia

Slides:



Advertisements
Similar presentations
. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Advertisements

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Hidden Markov Model in Biological Sequence Analysis – Part 2
GIBBS SAMPLER FOR IDENTIFICATION OF SYMMETRICALLY STRUCTURED AND POSSIBLY SPACED DNA MOTIFS AND ITS VALIDATION ON THE ArcA BINDING SITES.
Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Ka-Lok Ng Dept. of Bioinformatics Asia University
BAYESIAN INFERENCE Sampling techniques
Optimization methods Morten Nielsen Department of Systems biology, DTU.
Markov Chains Lecture #5
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Transcription factor binding motifs (part I) 10/17/07.
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Discovery of RNA Structural Elements Using Evolutionary Computation Authors: G. Fogel, V. Porto, D. Weekes, D. Fogel, R. Griffey, J. McNeil, E. Lesnik,
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Supplementary material Figure S1. Cumulative histogram of the fitness of the pairwise alignments of random generated ESSs. In order to assess the statistical.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Presented by Liu Qi An introduction to Bioinformatics Algorithms Qi Liu
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Hidden Markov Models for Sequence Analysis 4
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Sampling Approaches to Pattern Extraction
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Markov Cluster (MCL) algorithm Stijn van Dongen.
Gibbs sampling for motif finding Yves Moreau. 2 Overview Markov Chain Monte Carlo Gibbs sampling Motif finding in cis-regulatory DNA Biclustering microarray.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Cis-regulatory Modules and Module Discovery
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Multiple Species Gene Finding using Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley.
Local Multiple Sequence Alignment Sequence Motifs
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Sequence Alignment.
Javier Junquera Importance sampling Monte Carlo. Cambridge University Press, Cambridge, 2002 ISBN Bibliography.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Transcription factor binding motifs (part II) 10/22/07.
EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
The Monte Carlo Method/ Markov Chains/ Metropolitan Algorithm from sec in “Adaptive Cooperative Systems” -summarized by Jinsan Yang.
Hidden Markov Models BMI/CS 576
Advanced Statistical Computing Fall 2016
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Motif identification with Gibbs Sampler
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Transcription factor binding motifs
Finding Functionally Significant Structural Motifs in Proteins
Learning Sequence Motif Models Using Gibbs Sampling
Self-organizing map numeric vectors and sequence motifs
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

Motif identification with Gibbs Sampler Xuhua Xia

Xuhua Xia Slide 2 Background Named after Josiah Willard Gibbs (February 11, 1839 – April 28, 1903), winner of the Copley Medal of RSL in One of Markov chain Monte Carlo algorithms Biological applications –Identification of regulatory sequences of genes (Aerts et al., 2005; Coessens et al., 2003; Lawrence et al., 1993; Qin et al., 2003; Thijs et al., 2001; Thijs et al., 2002a; Thijs et al., 2002b; Thompson et al., 2004; Thompson et al., 2003) and functional motifs in proteins (Mannella et al., 1996; Neuwald et al., 1995; Qu et al., 1998) –Classification of biological images (Samso et al., 2002) –Ppairwise sequence alignment (Zhu et al., 1998) and multiple sequence alignment (Holmes and Bruno, 2001; Jensen and Hein, 2005).

Xuhua Xia Slide 3 Motif Identification by Gibbs sampler Other outputs of Gibbs sampler: Position weight matrix that can be used to scan other sequences for motifs, the associated significance tests Position weight matrix scores for identified motifs.

Xuhua Xia Slide 4 Gibbs sampler in motif finding Site sampler Motif sampler

Xuhua Xia Slide 5 Algorithm details: Initialization S1TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT S2CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG S3TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAG S4AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTC S5GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG Table 7-1. Site-specific distribution of nucleotides from the 29 random motifs of length 6. The second column lists the distribution of nucleotides outside the 29 random motifs. Site NucC A C G T Randomly choose motif start A i. F A : 325 F C : 316 F G : 267 F T : 301 Sum: 1209

Xuhua Xia Slide 6 Algorithm details: Predictive update Site NucC A C G T Site NucC A C G T S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG

Xuhua Xia Slide 7 Predictive update: Frequencies Table 7-3. Site-specific distribution of nucleotide frequencies derived from data in Table 7-2, with  = The second column lists the distribution of nucleotide frequencies outside the 28 random motifs. Site NucQ A C G T

Xuhua Xia Slide 8 Predictive update: PWM Site NucQ A C G T A C G T S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG Odds ratio for CATGCC = e = 0.153

Xuhua Xia Slide 9 Predictive update S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG Table 7-4. Possible locations of the 6-mer motif along S 11, together with the corresponding motifs and their position weight matrix scores expressed as odds ratios. The last column lists the odds ratios normalized to have a sum of 1. Site6-merOdds RatioP Norm 1CATGCC ATGCCC TGCCCT GCCCTC CCCTCA CCTCAA CTCAAG TCAAGT CAAGTG AAGTGT AGTGTG TCAAGG – = 35 Scaled to sum to 1 Pick up the one with the largest odds ratio, update the A i value, and generate a new frequency matrix Randomly pick up another sequence to do updating to obtain a new frequency matrix and new A i value. Once all sequences are updated and a new set of A i values obtained, compute Update all the sequences again to obtain a new set of A i and a new F. If the new F is greater the old F, replace the new set of A i values by the new set of A i values. Repeat until F value no long increases or when the maximum number of local iterations is reached. This (from initiation to this slide) completes one global cycle of iteration Repeat a number of global cycles until F does not increase.

Xuhua Xia Slide 10 Final Report: Final Frequency Final site-specific counts: A C G U Final site-specific frequencies: A C G U Final PWM [ln(Qij/Q0)]: A C G U

Xuhua Xia Slide 11 Motif alignment Seq V V 1 UCAGAACCAGUUAUAAAUUUAUCAUUUCCUUCUCCACUCCU 2 CCCACGCAGCCGCCCUCCUCCCCGGUCACUGACUGGUCCUG 3 UCGACCCUCUGAACCUAUCAGGGACCACAGUCAGCCAGGCAAG 4 AAAACACUUGAGGGAGCAGAUAACUGGGCCAACCAUGACUC 5 GGGUGAAUGGUACUGCUGAUUACAACCUCUGGUGCUGC 6 AGCCUAGAGUGAUGACUCCUAUCUGGGUCCCCAGCAGGA 7 GCCUCAGGAUCCAGCACACAUUAUCACAAACUUAGUGUCCA 8 CAUUAUCACAAACUUAGUGUCCAUCCAUCACUGCUGACCCU 9 UCGGAACAAGGCAAAGGCUAUAAAAAAAAUUAAGCAGC 10 GCCCCUUCCCCACACUAUCUCAAUGCAAAUAUCUGUCUGAAACGGUUCC 11 CAUGCCCUCAAGUGUGCAGAUUGGUCACAGCAUUUCAAGG 12GAUUGGUCACAGCAUUUCAAGGGAGAGACCUCAUUGUAAG 13 UCCCCAACUCCCAACUGACCUUAUCUGUGGGGGAGGCUUUUGA 14 CCUUAUCUGUGGGGGAGGCUUUUGAAAAGUAAUUAGGUUUAGC 15 AUUAUUUUCCUUAUCAGAAGCAGAGAGACAAGCCAUUUCUCUUUCCUCCC 23 GAAAAAAAAUAAAUGAAGUCUGCCUAUCUCCGGGCCAGAGCCCCU 24 UGCCUUGUCUGUUGUAGAUAAUGAAUCUAUCCUCCAGUGACU 25 GGCCAGGCUGAUGGGCCUUAUCUCUUUACCCACCUGGCUGU 26 CAACAGCAGGUCCUACUAUCGCCUCCCUCUAGUCUCUG 27 CCAACCGUUAAUGCUAGAGUUAUCACUUUCUGUUAUCAAGUGGCUUCAGC 28 GGGAGGGUGGGGCCCCUAUCUCUCCUAGACUCUGUG 29 CUUUGUCACUGGAUCUGAUAAGAAACACCACCCCUGC

Xuhua Xia Slide 12 Motif scores SeqName Motif Start PWMS S1 UUAUCA S2 CGGUCA S3 CUAUCA S4 AGAUAA S5 UGAUUA S6 CUAUCU S7 UUAUCA S8 UUAUCA S9 CUAUAA S10 CUAUCU S11 UGGUCA S12 UUGUAA S13 UUAUCU S14 UUAUCU S15 UUAUCA S27 UUAUCA S28 CUAUCU S29 UUGUCA

Xuhua Xia Slide 13 Motif sampler output SeqNameN123 Seq1210(TTATAA, )18(TTATCA, ) Seq2122(CGGTCA, ) Seq3114(CTATCA, ) Seq40 Seq5116(TGATTA, ) Seq6118(CTATCT, ) Seq7120(TTATCA, ) Seq822(TTATCA, )24(CCATCA, ) Seq9117(CTATAA, ) Seq10314(CTATCT, )28(ATATCT, )32(CTGTCT, ) Seq11121(TGGTCA, ) Seq1223(TGGTCA, )33(TTGTAA, ) Seq13120(TTATCT, ) Seq1412(TTATCT, ) Seq1531(TTATTT, )10(TTATCA, )36(TTCTCT, ) Seq25117(TTATCT, ) Seq26115(CTATCG, ) Seq27319(TTATCA, )25(CTTTCT, )32(TTATCA, ) Seq28115(CTATCT, ) Seq2922(UUGUCA, )15(TGATAA, )