Motif Discovery in Protein Sequences using Messy de Bruijn Graph

Motif Discovery in Protein Sequences using Messy de Bruijn Graph
Rupali Patwardhan Advisors: Dr. Mehmet Dalkilic Dr. Haixu Tang April 27, 2005 Rupali Patwardhan, Capstone Presentation

Outline of Presentation
Goal Background and Motivation Approach Results Future Work April 27, 2005 Rupali Patwardhan, Capstone Presentation

Rupali Patwardhan, Capstone Presentation
Goal To develop an algorithm that can take advantage of the properties of de Bruijn graph to discover motifs in protein sequences April 27, 2005 Rupali Patwardhan, Capstone Presentation

What is a motif ? A repeating pattern VSKLIPKNRLMISTEWRSLGQQSPGWMHYMP VMLPKDIAKLVPKTHLMSTEWRNRLGVQQSQG SGVPRLLTASREWRNLGEPFIDQIHYSPRYAD YRHVMLPKAMSTEWRSLGLKNPETGTLRILQE GLGITQSLGWSREWRHTLGEPHILLFKREKDYQ A sub sequence that is repeated in a given set of sequences. Motifs need not be continuous, also they might not be exact repeats and a multiple sequence alignment does not always yield the motif, might not even be of fixed length, need heuristic approaches. April 27, 2005 Rupali Patwardhan, Capstone Presentation

Why are motifs interesting ?
They represent regions that have been conserved through evolution So those regions are likely to be important for the function of the protein (e.g. an active site) Motifs can be used to classify proteins into families based on their functions, or predict the function of a new protein Interesting to biologists April 27, 2005 Rupali Patwardhan, Capstone Presentation

Zinc-containing alcohol dehydrogenases signature G-H-E-x(2)-G-x(5)-[GA]-x(2)-[IVSAC] H is a zinc ligand April 27, 2005 Rupali Patwardhan, Capstone Presentation

Motif Discovery Algorithms
There are two main categories Stochastic Algorithms Based on Statistical Significance e.g. MEME, GIBBS Combinatorial Algorithms Based on Enumeration e.g. PRATT, SPLASH April 27, 2005 Rupali Patwardhan, Capstone Presentation

Then why one more ? Existing algorithms Are too slow or computationally expensive for massive inputs (e.g. MEME) Do not handle gapped motifs effectively Need the length/number of the motifs to be specified in advance April 27, 2005 Rupali Patwardhan, Capstone Presentation

What is a de Bruijn Graph?
A graph whose nodes are subsequences of same length (l- tuples) and whose edges indicate the subsequences of the two connected nodes overlap E.g. An edge ACAT  CATS represents the sequence “ACATS” April 27, 2005 Rupali Patwardhan, Capstone Presentation

ABCDEFG ABCD BCDE CDEF DEFG April 27, 2005 Rupali Patwardhan, Capstone Presentation

Applying this to Identify Repeating Subsequences
If we have a set of sequences, we can go on adding corresponding nodes and edges to our de Bruijn graph. If any sub-sequence is repeated, the corresponding edge will already be present in that graph. So we just increment the weight of that edge. Eventually the edges corresponding to highly repeated sequences will have higher weights. Now we can find the motif by simply following the graph along these edges with weights above a specified threshold . Change bunch to something more formal. April 27, 2005 Rupali Patwardhan, Capstone Presentation

1. PAKARCDEKD PAKA AKAR KARC ARCD 1 1 1 1 1 CDEK 1 RCDE DEKD April 27, 2005 Rupali Patwardhan, Capstone Presentation

1. PAKARCDEKD 2. NARCDEKHKH NARC 1 PAKA AKAR KARC ARCD 1 1 1 2 1 CDEK 2 RCDE DEKD 1 DEKH KHKH 1 EKHK 1 April 27, 2005 Rupali Patwardhan, Capstone Presentation

Making them Messy In the context of protein sequences, some amino acid residues can be substituted by some others without affecting the function of the protein. So a sequence could be considered 'similar' to an edge even though its not identical. Similarity between amino acid residues is determined using standard scoring matrices, such as BLOSUM62. In that case, we increment weights of all edges that represent sequences that are ‘similar’ to the one in question. April 27, 2005 Rupali Patwardhan, Capstone Presentation

Example Consider the same 2 sequences as before, but with K replaced by R in one of them. PAKARCDERD NARCDEKHKH As per BLOSUM62, K  R substitution has a positive substitution score. April 27, 2005 Rupali Patwardhan, Capstone Presentation

PAKARCDERD NARCDEKHKH NARC 1 PAKA AKAR KARC ARCD 1 1 1 2 1 CDER 1 RCDE DERD 1 KHKH CDEK EKHK 1 1 DEKH 1 April 27, 2005 Rupali Patwardhan, Capstone Presentation

PAKARCDERD NARCDEKHKH NARC 1 PAKA AKAR KARC ARCD 1 1 1 2 1 CDER 1.4 RCDE DERD 1.4 KHKH CDEK EKHK 1 1 DEKH 1 April 27, 2005 Rupali Patwardhan, Capstone Presentation

Adjusting the weights to account for messiness
Suppose edge A is under consideration, and edges B and C originating from the same node as A are similar to A. WA’  WA + WB*s(A,B) + WC*s(A,C) April 27, 2005 Rupali Patwardhan, Capstone Presentation

Limitation of this Approach
The motif should have at least a few continuous amino acid residues So the method may fail if the motif consists of alternate residues E.g. AxAxCxDxAxGxC (x could be any residue) or AxCDxGxRGxC, since these motifs would not lead to high-weight edges in the de Bruijn graph The problem is due to the need for overlaps, which is inherent nature of de Bruijn Graphs April 27, 2005 Rupali Patwardhan, Capstone Presentation

Gapped Version For each node, we also create nodes obtained by applying a gap mask (or “Dont care” mask) on that node We currently restrict the maximal number of “Dont cares” in a node to 2 There are 10 such masks April 27, 2005 Rupali Patwardhan, Capstone Presentation

Gapped Version Let ‘1’ represent a conserved amino acid and ‘0’ represent a gap or “Don’t care” Then the 10 masks can be represented as: 1111, 0111, 1110, 1011, 1101, 1100, 0011, 1001, 0110, 1010, 0101 April 27, 2005 Rupali Patwardhan, Capstone Presentation

Masking Example If ANCD is the node that we are applying the mask to ANCD * 1001 = AxxD ANCD * 1101 = ANxD ANCD * 1011 = AxCD April 27, 2005 Rupali Patwardhan, Capstone Presentation

1. ….ARCDM… 2. ….ANCDE… 3. ….ASCDT… ARCD RCDM 1 ANCD NCDE 1 ASCD SCDT 1 April 27, 2005 Rupali Patwardhan, Capstone Presentation

1. ….ARCDM… 2. ….ANCDE… 3. ….ASCDT… AxCD xCDx 1 AxCD xCDx 1 AxCD xCDx 1 April 27, 2005 Rupali Patwardhan, Capstone Presentation

1. ….ARCDM… 2. ….ANCDE… 3. ….ASCDT… AxCD xCDx AxCD xCDx 3 AxCD xCDx April 27, 2005 Rupali Patwardhan, Capstone Presentation

AxCDxxGH ANCD NCDE CDEF DEFG EFGH AxCD NxDE CxEF DxFG ExGH ANxD NCxE
CDxF DExG EFxH ANxx NCxx CDxx DExx EFxx xxCD xxDE xxEF xxFG xxGH AxxD NxxE CxxF DxxG ExxH xNCx xCDx xDEx xEFx xFGx AxCx NxDx CxEx DxFx ExGx xNxD xCxE xDxF xExG xFxH . . . . .

Implementation The algorithm is implemented in Perl Web Interface April 27, 2005 Rupali Patwardhan, Capstone Presentation

Issues in Testing Motif Discovery Algorithms
No Benchmarking dataset Difficult to compare different algorithms since they have very different kinds of parameters. Some motifs are easier to find than others. April 27, 2005 Rupali Patwardhan, Capstone Presentation

Test I First 100 PROSITE patterns and their corresponding protein families were used as the test dataset to test the accuracy of the output. The output of the program was compared to MEME and PRATT. April 27, 2005 Rupali Patwardhan, Capstone Presentation

Results I MDMD 62 gMDMD 75 MEME 80 PRATT 61
Program No. of Families for which motif matched PROSITE pattern MDMD 62 gMDMD 75 MEME 80 PRATT 61 For MEME and PRATT, the top 3 motifs were considered. April 27, 2005 Rupali Patwardhan, Capstone Presentation

Test II We also tested families corresponding to 162 PROSITE patterns that did not have any continuous conserved amino acid residues, but had at least one occurrence of alternate conserved amino acid residues. April 27, 2005 Rupali Patwardhan, Capstone Presentation

Results II Rank of motif MEME PRATT MDMD gMDMD 1 50 77 57 68 2 26 13 22 26 3 10 4 17 12 6 6 14 5 5 10 Total 102 105 115 128 April 27, 2005 Rupali Patwardhan, Capstone Presentation

MEME was run on IBM SP cluster on 8 processors in parallel April 27, 2005 Rupali Patwardhan, Capstone Presentation

Future Work Categorizing easy and difficult motifs. Extending this approach to consensus-based multiple sequence alignment. Predicting if a given protein sequence is likely to belong to a particular family or not. April 27, 2005 Rupali Patwardhan, Capstone Presentation

Acknowledgements Dr. Mehmet Dalkilic Dr. Haixu Tang Dr. Sun Kim Bioinformatics Research Group April 27, 2005 Rupali Patwardhan, Capstone Presentation

Motif Discovery in Protein Sequences using Messy de Bruijn Graph

Similar presentations

Presentation on theme: "Motif Discovery in Protein Sequences using Messy de Bruijn Graph"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Motif Discovery in Protein Sequences using Messy de Bruijn Graph

Similar presentations

Presentation on theme: "Motif Discovery in Protein Sequences using Messy de Bruijn Graph"— Presentation transcript:

Similar presentations

About project

Feedback