Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.

Slides:

Advertisements

Similar presentations

Introduction to Computer Science 2 Lecture 7: Extended binary trees

Advertisements

Avrilia Floratou, Sandeep Tata, and Jignesh M. Patel ICDE 2010 Efficient and Accurate Discovery of Patterns in Sequence Datasets.

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.

1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.

GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.

Heuristic alignment algorithms and cost matrices

Identifying Structural Motifs in Proteins Rohit Singh Joint work with Mitul Saha.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,

1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Protein Domain Finding Problem Olga Russakovsky, Eugene Fratkin, Phuong Minh Tu, Serafim Batzoglou Algorithm Step 1: Creating a graph of k-mers First,

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.

Single Motif Charles Yan Spring Single Motif.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)

TOWARDS IDENTITY ANONYMIZATION ON GRAPHS. INTRODUCTION.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Social Network Analysis via Factor Graph Model

How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.

UNC Chapel Hill M. C. Lin Point Location Reading: Chapter 6 of the Textbook Driving Applications –Knowing Where You Are in GIS Related Applications –Triangulation.

Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.

Burrows-Wheeler Transform the not-so-gory details

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan.

A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.

Mining Frequent Patterns in unaligned Protein Sequences (An Implementation of The Teiresias Algorithm) Fei Shao 1.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University.

Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally.

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Construction of Substitution matrices

2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.

Near repeat burglary chains: describing the physical and network properties of a network of close burglary pairs. Dr Michael Townsley, UCL Jill Dando Institute.

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.

Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Pairwise Sequence Alignment and Database Searching

Challenges in Creating an Automated Protein Structure Metaserver

Motif Discovery in Protein Sequences using Messy de Bruijn Graph

3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.

Intro to Alignment Algorithms: Global and Local

Sequence Based Analysis Tutorial

Basic Local Alignment Search Tool (BLAST)

謝孫源 (Sun-Yuan Hsieh) 成功大學電機資訊學院資訊工程系

Basic Local Alignment Search Tool

Presentation transcript:

Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan

Goal The goal of this project is to develop an algorithm that can take advantage of the properties of De Bruijn graphs for discovering motifs in protein sequences. The goal of this project is to develop an algorithm that can take advantage of the properties of De Bruijn graphs for discovering motifs in protein sequences.

Outline of Presentation Motivation and Background Motivation and Background Approach Approach Implementation Implementation Applications Applications Future Work Future Work

Motivation Most of the popular motif discovery algorithms being used right now depend on statistical significance to find the motif. Most of the popular motif discovery algorithms being used right now depend on statistical significance to find the motif. This project explores computational and graph theoretic ways of doing the same thing without using statistical significance. This project explores computational and graph theoretic ways of doing the same thing without using statistical significance. Such an approach could drastically reduce the time required to search for motifs. Such an approach could drastically reduce the time required to search for motifs.

What is a De Bruijn Graph?  De Bruijn Graph is a graph whose nodes are sequences of symbols from some alphabet and whose edges indicate the sequences which might overlap.  The parameters are nodelength(n) and overlap(k).  So if n=4 and k=3, an edge ACAT  CATS represents the sequence 'ACATS'

Example If we have a sequence ABCDEFG, If we have a sequence ABCDEFG, and we take nodelength=4 and overlap=3, and we take nodelength=4 and overlap=3, we will can represent this same sequence by the following De Bruijn Graph we will can represent this same sequence by the following De Bruijn Graph

CDEFBCDEABCD ABCDEFG DEFG Node Length = 4 Overlap = 3

Applying this to Identify Repeating Sub-sequences If we have a bunch of sequences, we can go on adding corresponding nodes and edges to our De Bruijn graph. If we have a bunch of sequences, we can go on adding corresponding nodes and edges to our De Bruijn graph. If any sub-sequence is repeated, the corresponding edge will already be present in that graph. If any sub-sequence is repeated, the corresponding edge will already be present in that graph. So we just increment the weight of that edge. So we just increment the weight of that edge. Eventually the edges corresponding to highly repeated sequences will have higher weights. Eventually the edges corresponding to highly repeated sequences will have higher weights. Now we can find the motif by simply following the graph along these edges with weights above a specified threshold. Now we can find the motif by simply following the graph along these edges with weights above a specified threshold.

Example Sequence 1: Sequence 1: PAKARCDEKD PAKARCDEKD Sequence 2: Sequence 2: ARCDEKHKH ARCDEKHKH Constructing the De Bruijn Graph for these sequences … Constructing the De Bruijn Graph for these sequences …

PAKAARCDAKARKARCRCDECDEKDEKH  PAKARCDEKD  ARCDEKHKH DEKDEKHKKHKH 11 1

Making them Messy In the context of protein sequences, some amino acid residues can be substituted without affecting the function of the protein. In the context of protein sequences, some amino acid residues can be substituted without affecting the function of the protein. So a sequence could be considered 'similar' to an edge though its not exactly same. So a sequence could be considered 'similar' to an edge though its not exactly same. Similarity is determined in the context of a standard scoring matrix, such as BLOSUM62. Similarity is determined in the context of a standard scoring matrix, such as BLOSUM62. In that case, we increment weights of all edges that represent sequences that are ‘similar’ to the one in question. In that case, we increment weights of all edges that represent sequences that are ‘similar’ to the one in question.

Example Consider the same 2 sequences as before, but with K replaced by R in one of them. Consider the same 2 sequences as before, but with K replaced by R in one of them. PAKARCDERD PAKARCDERD ARCDEKHKH ARCDEKHKH As per BLOSUM62, K and R have a positive substitution score. As per BLOSUM62, K and R have a positive substitution score.

PAKAARCDAKARKARCRCDECDERCDEK  PAKARCDERD  ARCDEKHKH DERDKHKHDEKHEKHK

Another Example > Sequence 1 DMLKLCDKADDKMNDRLDDYLKLDD > Sequence 2 EAKDKFDFKDFKLCDKADDARTYVH > Sequence 3 GTYYYCPGHKLCDEADDFFHVDDTE > Sequence 4 LKLCDKANDYRPYYPITDPLMMNHI > Sequence 5 GTYKPGHKLCDEADDFFHENDTEKYC > Sequence 6 KLCDKADDYRPYYPITDPLGATAKHI

Another Example > Sequence 1 DMLKLCDKADDKMNDRLDDYLKLDD > Sequence 2 EAKDKFDFKDFKLCDKADDARTYVH > Sequence 3 GTYYYCPGHKLCDEADDFFHVDDTE > Sequence 4 LKLCDKANDYRPYYPITDPLMMNHI > Sequence 5 GTYKPGHKLCDEADDFFHENDTEKYC > Sequence 6 KLCDKADDYRPYYPITDPLGATAKHI

Sample output … patward/L519/project/ex1.html patward/L519/project/ex1.html patward/L519/project/ex1.html patward/L519/project/ex1.html atward/L519/project/ttt.gif atward/L519/project/ttt.gif

Results When 41 sequences belonging to PS00021 family were given as input When 41 sequences belonging to PS00021 family were given as input The best motif output was YCRNPD The best motif output was YCRNPD The Prosite Reg Ex for this family is [FY]-C-R-N-P-[DNR]. The Prosite Reg Ex for this family is [FY]-C-R-N-P-[DNR]. patward/L519/project/PS00021_op.ht ml patward/L519/project/PS00021_op.ht ml patward/L519/project/PS00021_op.ht ml patward/L519/project/PS00021_op.ht ml

Possible Applications To predict if a given protein sequence is likely to belong to a particular protein family or not. To predict if a given protein sequence is likely to belong to a particular protein family or not. To construct regular expressions for protein families. To construct regular expressions for protein families. To fine-tune the results of clustering algorithms, by helping to decide whether to merge two clusters or not. To fine-tune the results of clustering algorithms, by helping to decide whether to merge two clusters or not. Do preprocessing to improve the performance of other motif discovery algorithms. Do preprocessing to improve the performance of other motif discovery algorithms.

Limitation of this Approach The motif should have at least 3 continuous amino acid residues. The motif should have at least 3 continuous amino acid residues. So the program runs into trouble if the motif consists of alternate residues. For example, something like AxAxCxDxAxGxC (x could be any residue). So the program runs into trouble if the motif consists of alternate residues. For example, something like AxAxCxDxAxGxC (x could be any residue). The problem is due to the need for overlaps, which is inherent nature of De Bruijn Graphs. The problem is due to the need for overlaps, which is inherent nature of De Bruijn Graphs.

Future Work We would like to integrate a machine- learning aspect to dynamically change the node length and other parameters to find the optimal motif. We would like to integrate a machine- learning aspect to dynamically change the node length and other parameters to find the optimal motif. We also want to try to extend this approach to do clustering itself. We also want to try to extend this approach to do clustering itself.

Link to the Implementation atward/L519/project.html atward/L519/project.html

Acknowledgement I would like to thank Dr. Mehmet Dalkilic for his ideas and support. I would like to thank Dr. Mehmet Dalkilic for his ideas and support.