Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.

Slides:

Advertisements

Similar presentations

Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......

Advertisements

Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.

Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.

Promoter Panel Review. Background related Promoter In genetics, a promoter is a DNA sequence that enables a gene to be transcribed. It may be very long.

Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.

Heuristic alignment algorithms and cost matrices

Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.

Transcription factor binding motifs (part I) 10/17/07.

A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.

Tutorial 5 Motif discovery.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.

Sequence similarity.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.

Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

DNA Motif and protein domain discovery

Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.

A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Multiple testing correction

Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.

Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-

CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.

Sampling Approaches to Pattern Extraction

Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA

Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.

Copyright OpenHelix. No use or reproduction without express written consent1.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Positional Association Rules Dr. Bernard Chen Ph.D. University of Central Arkansas.

Motif discovery and Protein Databases Tutorial 5.

Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.

Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

Protein Domain Database

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Local Multiple Sequence Alignment Sequence Motifs

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

Sequence Alignment.

I.U. School of Informatics Motif Discovery from Large Number of Sequences: A Case Study with Disease Resistance Genes in Arabidopsis thaliana by Irfan.

Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.

Transcription factor binding motifs (part II) 10/22/07.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

A Very Basic Gibbs Sampler for Motif Detection

Learning Sequence Motif Models Using Expectation Maximization (EM)

Transcription factor binding motifs

Sequential Pattern Discovery under a Markov Assumption

(Regulatory-) Motif Finding

Discovering Frequent Poly-Regions in DNA Sequences

Presentation transcript:

Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University

Outline Introduction and motivation Our framework for motif discovery 1. Initial pattern discovery 2. Build seed motif 3. Extract subsequences 4. Motifs discovery 5. Iterative refinement Experiment and Result Discussion and Future work

Introduction – motifs & their applications Protein motifs are short patterns conserved in proteins. They are generally important for the function of a protein or the maintenance of protein structures. 1. Enzyme catalytic sites 2. Regions involved in binding a molecule (ADP/ATP, DNA…) or another protein. 3. A fold important for general 3D structure. Distinguish protein groups based on such patterns. Classify a sequenced protein to a specific family of proteins.

Introduction - motif discovery PROSITE: find patterns manually Deterministic algorithm, expectation maximization based: MEME (time consuming) Stochastic algorithm (Gibbs sampling algorithm), random jumps in the search space: Gibbs Sampler; AlignACE The performance varies with the input sequences’ characteristics. ( For example, all known motifs in disease resistance genes in Arabidopsis thaliana were successfully found using MEME after splitting the sequences into two distinct categories of resistance genes, but no motifs were found by inputting all disease resistance genes as a single input file to MEME.)

Motivation Motif discover is, in a sense, to compare two models: a model for the pattern (signal model) and a model for negative examples (noise model). Input sequences determine the background noise model. The performance of motif discovery algorithms can be significantly improved by clustering input sequences into smaller groups. Thus Motivation for our research is to use subsequences, instead of using whole sequences, for motif discovery. However, it is quite difficult to select correct subsequence regions without prior knowledge, e.g., genes of the same type. We use an iterative algorithm to solve this problem.

Motivation – an example Figure 1, Motif logo for the multiple sequence alignment of a family Figure 2, Motif logo for conserved subsequences of the protein family. PS00343 (L-P-x-T-G-[STGAVDE])

Outline Introduction and motivation Our framework for motif discovery 1. Initial pattern discovery 2. Build seed motif 3. Extract subsequences 4. Motifs discovery 5. Iterative refinement Experiment and Result Discussion and Future work

Test Data Preparation 1. Download PROSITE pattern and sequence databases. 2. Parse all true positive sequences for each PROSITE ID and store them as a PROSITE family. 3. All sequences of one family contain the same PROSITE pattern. 4. We used PROSITE families to discovery motifs and test the performance of our framework.

Framework Overview 1 STEP1. Extract a set S of subsequences around a set of motifs M. STEP2. Input S to a motif discovery algorithm, producing a new set of motifs M’. STEP3. Search entire sequences for more occurrences of M’, producing M’. Set M’ to M and go to step 1.

Framework Overview 2

Outline Introduction and motivation Our framework for motif discovery Initial pattern discovery 2. Build seed motif 3. Extract subsequences 4. Motifs discovery 5. Iterative refinement Experiment and Result Discussion and Future work

Initial Pattern Discovery - thresholds Three thresholds for pattern discovery: 1. length of patterns (L=3, exact patterns longer than 3 do not occur frequently even in the conserved motif regions). 2. log-odd value of 1 st Markov model to random model (statistically significant patterns occur more frequently than random patterns ). 3. support value (patterns should be present in a certain number of sequences ).

Initial Pattern Discovery - algorithm 1. Use thresholds to scan the sequences in one set of sequences, find out qualified patterns in each sequence. 2. Rank the sequences according to how many qualified patterns each sequence has. 3. Save the qualified patterns in the top half sequences and eliminate these sequences. 4. Repeat this algorithm on the rest half set of sequences (go to step 1) until no more patterns can be found. The saved patterns will be used later.

Initial Pattern Discovery - example Qualified Patterns (p1, p2, p3)

Outline Introduction and motivation Our framework for motif discovery 1. Initial pattern discovery Build seed motif 3. Extract subsequences 4. Motifs discovery 5. Iterative refinement Experiment and Result Discussion and Future work

Build Seed Motif 1. Start from the pattern with maximal support, use it as the seed motif. 2. Calculate the scores of the candidate patterns (in sequences not covered by the seed motif) to the seed motif. S i = ΣS i-j Wj (j = 1… n) Si: score of candidate pattern i to seed motif Si-j: score of candidate pattern to j th pattern in the seed motif Wj: the weight (support ratio) of j th pattern in the seed motif 3. Add the pattern which has the highest score (also larger than a score threshold) to the seed motif. 4. Go to step 2, until no more patterns can be added to the seed motif.

Build Seed Motif - example Calculate pattern scores (threshold = 5) PatternSequenceSupport (suppose no shared sequences) WeightScore to motif P1CLG4W 1 = 1 P2CLN213 P3ALG210 P4ALN24 S 2-1 = = 13; S 2 = S 2-1 W 1 = 13 P1 C L G P2 C L N

Build Seed Motif - example Calculate pattern scores (threshold = 5) PatternSequenceSupport (suppose no shared sequences) WeightScore to motif P1CLG4W 1 = 4 / (4+2) P2CLN2W 2 = 2 / (4+2) P3ALG28 P4ALN26 S 3-1 = 10, S 3-2 = 4 S 3 = S 3-1 W 1 + S 3-2 W 2 = 8 > 5 S 4-1 = 4, S 4-2 = 10 S 4 = S 4-1 W 1 + S 4-2 W 2 = 6 > 5

Build Seed Motif - example Calculate pattern scores (threshold = 5) PatternSequenceSupport (suppose no shared sequences) WeightScore to motif P1CLG4W 1 = 4 / 8 P2CLN2W 2 = 2 / 8 P3ALG2W 3 = 2 / 8 P4ALN29 S 4-1 = 4, S 4-2 = 10, S 4-3 = 8 S 4 = S 4-1 W 1 + S 4-2 W 2 + S 4-3 W 3 = 9 > 5

Build Seed Motif

Outline Introduction and motivation Our framework for motif discovery 1. Initial pattern discovery 2. Build seed motif Extract subsequences Motifs discovery Iterative refinement Experiment and Result Discussion and Future work

Extract Subsequences

Motifs Discovery MEME

Iterative refinement sub-sequences MAST motif sub-sequences MEME entire protein family Stable? motif discovery no yes

Iterative refinement

Outline Introduction and motivation Our framework for motif discovery 1. Initial pattern discovery 2. Build seed motif 3. Extract subsequences 4. Motifs discovery 5. Iterative refinement Experiment and Result Discussion and Future work

Experiment 1. We used 108 PROSITE families as test data. 2. Ran MEME directly on these families and got the best motif for each of them. 3. Ran our framework and got the best motif (Because of time constraints, our motif framework performed only single iteration. ) 4. Compared the results.

Performance The result of the comparison patterns ( PS00010 PS00011…) 23 patterns (PS00014 PS00033…) 15 patterns (PS00019 PS00035…) 7 patterns (PS01345 PS01286…) Framework ×× MEME ××

Performance The result of the comparison.

Discussion To make our experiment more rigorous, we choose only the top motif reported by both MEME and our framework. Among the 22 failed cases, our framework did discover 21 of them, though their rank was not top. One flaw: Local optima This framework is general enough to include any motif discovery and search algorithms that report multiple motifs with a statistical score.

Future Work On the theoretical side, we are interested in formalizing and understanding the role of noise. How likely subsequences induced by our initial pattern discovery algorithm can include true motifs? Is convergence to true motif regions guaranteed once the initial set of subsequences contain true motifs? For empirical study, we plan to perform multiple iterations using the whole PROSITE pattern set; embed different motif discovery and search programs into our framework.