Presentation is loading. Please wait.

Presentation is loading. Please wait.

Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.

Similar presentations


Presentation on theme: "Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan."— Presentation transcript:

1 Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan

2 Goal The goal of this project is to develop an algorithm that can take advantage of the properties of De Bruijn graphs for discovering motifs in protein sequences. The goal of this project is to develop an algorithm that can take advantage of the properties of De Bruijn graphs for discovering motifs in protein sequences.

3 Outline of Presentation Motivation and Background Motivation and Background Approach Approach Implementation Implementation Applications Applications Future Work Future Work

4 Motivation Most of the popular motif discovery algorithms being used right now depend on statistical significance to find the motif. Most of the popular motif discovery algorithms being used right now depend on statistical significance to find the motif. This project explores computational and graph theoretic ways of doing the same thing without using statistical significance. This project explores computational and graph theoretic ways of doing the same thing without using statistical significance. Such an approach could drastically reduce the time required to search for motifs. Such an approach could drastically reduce the time required to search for motifs.

5 What is a De Bruijn Graph?  De Bruijn Graph is a graph whose nodes are sequences of symbols from some alphabet and whose edges indicate the sequences which might overlap.  The parameters are nodelength(n) and overlap(k).  So if n=4 and k=3, an edge ACAT  CATS represents the sequence 'ACATS'

6 Example If we have a sequence ABCDEFG, If we have a sequence ABCDEFG, and we take nodelength=4 and overlap=3, and we take nodelength=4 and overlap=3, we will can represent this same sequence by the following De Bruijn Graph we will can represent this same sequence by the following De Bruijn Graph

7 CDEFBCDEABCD ABCDEFG DEFG Node Length = 4 Overlap = 3

8 Applying this to Identify Repeating Sub-sequences If we have a bunch of sequences, we can go on adding corresponding nodes and edges to our De Bruijn graph. If we have a bunch of sequences, we can go on adding corresponding nodes and edges to our De Bruijn graph. If any sub-sequence is repeated, the corresponding edge will already be present in that graph. If any sub-sequence is repeated, the corresponding edge will already be present in that graph. So we just increment the weight of that edge. So we just increment the weight of that edge. Eventually the edges corresponding to highly repeated sequences will have higher weights. Eventually the edges corresponding to highly repeated sequences will have higher weights. Now we can find the motif by simply following the graph along these edges with weights above a specified threshold. Now we can find the motif by simply following the graph along these edges with weights above a specified threshold.

9 Example Sequence 1: Sequence 1: PAKARCDEKD PAKARCDEKD Sequence 2: Sequence 2: ARCDEKHKH ARCDEKHKH Constructing the De Bruijn Graph for these sequences … Constructing the De Bruijn Graph for these sequences …

10 PAKAARCDAKARKARCRCDECDEKDEKH 111 2 21  PAKARCDEKD  ARCDEKHKH DEKDEKHKKHKH 11 1

11 Making them Messy In the context of protein sequences, some amino acid residues can be substituted without affecting the function of the protein. In the context of protein sequences, some amino acid residues can be substituted without affecting the function of the protein. So a sequence could be considered 'similar' to an edge though its not exactly same. So a sequence could be considered 'similar' to an edge though its not exactly same. Similarity is determined in the context of a standard scoring matrix, such as BLOSUM62. Similarity is determined in the context of a standard scoring matrix, such as BLOSUM62. In that case, we increment weights of all edges that represent sequences that are ‘similar’ to the one in question. In that case, we increment weights of all edges that represent sequences that are ‘similar’ to the one in question.

12 Example Consider the same 2 sequences as before, but with K replaced by R in one of them. Consider the same 2 sequences as before, but with K replaced by R in one of them. PAKARCDERD PAKARCDERD ARCDEKHKH ARCDEKHKH As per BLOSUM62, K and R have a positive substitution score. As per BLOSUM62, K and R have a positive substitution score.

13 PAKAARCDAKARKARCRCDECDERCDEK 111 2 1.751  PAKARCDERD  ARCDEKHKH DERDKHKHDEKHEKHK 1 1 1 1

14 Another Example > Sequence 1 DMLKLCDKADDKMNDRLDDYLKLDD > Sequence 2 EAKDKFDFKDFKLCDKADDARTYVH > Sequence 3 GTYYYCPGHKLCDEADDFFHVDDTE > Sequence 4 LKLCDKANDYRPYYPITDPLMMNHI > Sequence 5 GTYKPGHKLCDEADDFFHENDTEKYC > Sequence 6 KLCDKADDYRPYYPITDPLGATAKHI

15 Another Example > Sequence 1 DMLKLCDKADDKMNDRLDDYLKLDD > Sequence 2 EAKDKFDFKDFKLCDKADDARTYVH > Sequence 3 GTYYYCPGHKLCDEADDFFHVDDTE > Sequence 4 LKLCDKANDYRPYYPITDPLMMNHI > Sequence 5 GTYKPGHKLCDEADDFFHENDTEKYC > Sequence 6 KLCDKADDYRPYYPITDPLGATAKHI

16 Sample output … http://biokdd.informatics.indiana.edu/r patward/L519/project/ex1.html http://biokdd.informatics.indiana.edu/r patward/L519/project/ex1.html http://biokdd.informatics.indiana.edu/r patward/L519/project/ex1.html http://biokdd.informatics.indiana.edu/r patward/L519/project/ex1.html http://biokdd.informatics.indiana.edu/rp atward/L519/project/ttt.gif http://biokdd.informatics.indiana.edu/rp atward/L519/project/ttt.gif

17 Results When 41 sequences belonging to PS00021 family were given as input When 41 sequences belonging to PS00021 family were given as input The best motif output was YCRNPD The best motif output was YCRNPD The Prosite Reg Ex for this family is [FY]-C-R-N-P-[DNR]. The Prosite Reg Ex for this family is [FY]-C-R-N-P-[DNR]. http://biokdd.informatics.indiana.edu/r patward/L519/project/PS00021_op.ht ml http://biokdd.informatics.indiana.edu/r patward/L519/project/PS00021_op.ht ml http://biokdd.informatics.indiana.edu/r patward/L519/project/PS00021_op.ht ml http://biokdd.informatics.indiana.edu/r patward/L519/project/PS00021_op.ht ml

18 Possible Applications To predict if a given protein sequence is likely to belong to a particular protein family or not. To predict if a given protein sequence is likely to belong to a particular protein family or not. To construct regular expressions for protein families. To construct regular expressions for protein families. To fine-tune the results of clustering algorithms, by helping to decide whether to merge two clusters or not. To fine-tune the results of clustering algorithms, by helping to decide whether to merge two clusters or not. Do preprocessing to improve the performance of other motif discovery algorithms. Do preprocessing to improve the performance of other motif discovery algorithms.

19 Limitation of this Approach The motif should have at least 3 continuous amino acid residues. The motif should have at least 3 continuous amino acid residues. So the program runs into trouble if the motif consists of alternate residues. For example, something like AxAxCxDxAxGxC (x could be any residue). So the program runs into trouble if the motif consists of alternate residues. For example, something like AxAxCxDxAxGxC (x could be any residue). The problem is due to the need for overlaps, which is inherent nature of De Bruijn Graphs. The problem is due to the need for overlaps, which is inherent nature of De Bruijn Graphs.

20 Future Work We would like to integrate a machine- learning aspect to dynamically change the node length and other parameters to find the optimal motif. We would like to integrate a machine- learning aspect to dynamically change the node length and other parameters to find the optimal motif. We also want to try to extend this approach to do clustering itself. We also want to try to extend this approach to do clustering itself.

21 Link to the Implementation http://biokdd.informatics.indiana.edu/rp atward/L519/project.html http://biokdd.informatics.indiana.edu/rp atward/L519/project.html

22 Acknowledgement I would like to thank Dr. Mehmet Dalkilic for his ideas and support. I would like to thank Dr. Mehmet Dalkilic for his ideas and support.


Download ppt "Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan."

Similar presentations


Ads by Google