Presentation is loading. Please wait.

Presentation is loading. Please wait.

Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University.

Similar presentations


Presentation on theme: "Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University."— Presentation transcript:

1 Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University

2 Outline Introduction and motivation Our framework for motif discovery 1. Pattern finding 2. Build seed motif 3. Extract subsequences 4. Find motif 5. Iterative refinement Performance of our framework Discussion and Future work

3 Introduction – motifs & their applications Protein motifs are short patterns conserved in proteins. They are generally important for the function of a protein or the maintenance of protein structures. 1. Enzyme catalytic sites 2. Regions involved in binding a molecule (ADP/ATP, DNA…) or another protein. 3. A fold important for general 3D structure. Distinguish protein groups based on such patterns. Classify a sequenced protein to a specific family of proteins.

4 Introduction - motif discovery PROSITE: find patterns manually Deterministic algorithm, expectation maximization based: 1. MEME (time consuming) Stochastic algorithm (Gibbs sampling algorithm), random jumps in the search space: 1. Gibbs Sampler 2. AlignACE

5 Motivation Motif discover is, in a sense, to look for signals compared to noise. The model for noise largely depends on the input sequences (See previous capstones). Our goal is to use “subsequences” to guide motif discovery. We use an iterative pattern refinement procedure to improve the performance of motif discovery.

6 Outline Introduction and motivation Our framework for motif discovery 1. Pattern finding 2. Build seed motif 3. Extract subsequences 4. Find motif 5. Iterative refinement Performance of our framework Discussion and Future work

7 Test Data Preparation 1. Download PROSITE pattern and sequence databases. 2. Parse all positive sequences for each PROSITE ID and store them as a PROSITE family. 3. All sequences of one family contain the same PROSITE pattern. 4. We used PROSITE families for motif discovery.

8 Framework Overview 1. Find patterns in a PROSITE family 2. Build seed motifs according to patterns 3. Select subsequences based on seed motifs 4. Run motif finding program (MEME) on the subsequences 5. Search motifs using MAST over entire family 6. Select subsequences around the motif regions 7. Go to step 4, until the final motif is stable

9 Outline Introduction and motivation Our framework for motif discovery Pattern finding 2. Build seed motif 3. Extract subsequences 4. Find motif 5. Iterative refinement Performance of our framework Discussion and Future work

10 Pattern Finding - thresholds For each PROSITE family, we find conserved patterns first. Three thresholds to find a qualified pattern: 1. length of patterns. 2. log-odd value of 1 st Markov model to random model. 3. support value, the occurrence of a pattern in different sequences.

11 Pattern Finding - algorithm 1. Use thresholds to scan the sequences in one family, find out qualified patterns in each sequence. 2. Rank the sequences according to how many qualified patterns each sequence has. 3. Output the qualified patterns in the top half sequences. 4. Repeat this algorithm (go to step 1) on the rest half sequences until no more patterns can be found.

12 Pattern Finding - example Qualified Patterns (p1, p2, p3)

13 Outline Introduction and motivation Our framework for motif discovery 1. Pattern finding Build seed motif 3. Extract subsequences 4. Find motif 5. Iterative refinement Performance of our framework Discussion and Future work

14 Build Seed Motif 1. Start from the pattern with maximal support, use it as the seed motif. 2. Calculate the scores of the candidate patterns (in sequences not covered by the seed motif) to the seed motif. S i = ΣS i-j Wj (j = 1… n) Si: score of candidate pattern i to seed motif Si-j: score of candidate pattern to j th pattern in the seed motif Wj: the weight (support ratio) of j th pattern in the seed motif 3. Add the pattern with the highest score (also larger than a score threshold) to the seed motif. 4. Go to step 2, until no more patterns can be added to the seed motif.

15 Build Seed Motif - example Calculate pattern scores (threshold = 5) PatternSequenceSupport (suppose no shared sequences) WeightScore to motif P1CLG4W 1 = 1 P2CLN213 P3ALG210 P4ALN24 S 2-1 = 9+4+0 = 13; S 2 = S 2-1 W 1 = 13 P1 C L G 9 4 0 P2 C L N

16 Build Seed Motif - example Calculate pattern scores (threshold = 5) PatternSequenceSupport (suppose no shared sequences) WeightScore to motif P1CLG4W 1 = 4 / (4+2) P2CLN2W 2 = 2 / (4+2) P3ALG28 P4ALN26 S 3-1 = 10, S 3-2 = 4 S 3 = S 3-1 W 1 + S 3-2 W 2 = 8 > 5 S 4-1 = 4, S 4-2 = 10 S 4 = S 4-1 W 1 + S 4-2 W 2 = 6 > 5

17 Build Seed Motif - example Calculate pattern scores (threshold = 5) PatternSequenceSupport (suppose no shared sequences) WeightScore to motif P1CLG4W 1 = 4 / 8 P2CLN2W 2 = 2 / 8 P3ALG2W 3 = 2 / 8 P4ALN29 S 4-1 = 4, S 4-2 = 10, S 4-3 = 8 S 4 = S 4-1 W 1 + S 4-2 W 2 + S 4-3 W 3 = 9 > 5

18 Build Seed Motif

19 Outline Introduction and motivation Our framework for motif discovery 1. Pattern finding 2. Build seed motif Extract subsequences Find motif Iterative refinement Performance of our framework Discussion and Future work

20 Extract Subsequences

21 Find Motif MEME

22 Iterative refinement motif1, motif2, motif3 MAST motif1’, motif2’, motif3’ sub1, sub2, sub3 MEME entire PROSITE family Stable? choose the best motif no yes

23 Outline Introduction and motivation Our framework for motif discovery 1. Pattern finding 2. Build seed motif 3. Extract subsequences 4. Find motif 5. Iterative refinement Performance of our framework Discussion and Future work

24 Experiment 1. We randomly chose 17 PROSITE families as test data set. 2. Ran MEME directly on these families and got the best motif for each of them. 3. Ran our framework and got the best motif. 4. Compared the results.

25 PROSITE Patterns PS00010 C-x-[DN]-x(4)-[FY]-x-C-x-C. PS00011 x(12)-E-x(3)-E-x-C-x(6)-[DEN]-x-[LIVMFY]-x(9)-[FYW]. PS00014 [KRHQSA]-[DENQ]-E-L>. PS00018 D-x-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]- x(2)-[DE]-[LIVMFYW]. PS00020 [LIVM]-x-[SGN]-[LIVM]-[DAGHE]-[SAG]-x-[DNEAG]-[LIVM]-x-[DEAG]-x(4)- [LIVM]-x-[LM]-[SAG]-[LIVM]-[LIVMT]-W-x-[LIVM](2). PS00099 [AG]-[LIVMA]-[STAGCLIVM]-[STAG]-[LIVMA]-C-x-[AG]-x-[AG]-x-[AG]-x-[SAG]. PS00342 [STAGCN]-[RKH]-[LIVMAFY]>. PS00343 L-P-x-T-G-[STGAVDE]. PS00409 [KRHEQSTAG]-G-[FYLIVM]-[ST]-[LT]-[LIVP]-E-[LIVMFWSTAG](14). PS00881 [DNEG]-x-[LIVFA]-[LIVMY]-[LVAST]-H-N-[STC]. PS01286 P-x(8,10)-[LM]-R-x-[GE]-[LIVP]-x-G-C. PS00012 [DEQGSTALMKRH]-[LIVMFYSTAC]-[GNQ]-[LIVMFYAG]-[DNEKHS]-S- [LIVMST]-{PCFY}-[STAGCPQLIVMF]-[LIVMATN]-[DENQGTAKRHLM]-[LIVMWSTA]- [LIVGSTACR]-x(2)-[LIVMFA]. PS00019 [EQ]-x(2)-[ATV]-[FY]-x(2)-W-x-N. PS00660 W-[LIV]-x(3)-[KRQ]-x-[LIVM]-x(2)-[QH]-x(0,2)-[LIVMF]-x(6,8)-[LIVMF]-x(3,5)-F- [FY]-x(2)-[DENS]. PS00661 [HYW]-x(9)-[DENQSTV]-[SA]-x(3)-[FY]-[LIVM]-x(2)-[ACV]-x(2)-[LM]-x(2)-[FY]-G- x-[DENQST]-[LIVMFYS]. PS00889 [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV]. PS01177 [CSH]-C-x(2)-[GAP]-x(7,8)-[GASTDEQR]-C-[GASTDEQL]-x(3,9)-[GASTDEQN]-x(2)- [CE]-x(6,7)-C-C.

26 Performance The result of the comparison. PS00010 PS00011 PS00018 PS00020 PS00409 PS00881 PS00012 PS01286PS00099 PS00019 PS00660 PS00014 PS00342 PS00343 PS00661 PS00889 PS01177 MEME ×× Frame -work ××

27 Outline Introduction and motivation Our framework for motif discovery 1. Pattern finding 2. Build seed motif 3. Extract subsequences 4. Find motif 5. Iterative refinement Performance of our framework Discussion and Future work

28 Discussion One flaw: Local optima PS01286 is the only family our framework has worse performance on  PROSITE pattern P-x(8,10)-[LM]-R-x-[GE]-[LIVP]-x-G-C  MEME [TNS] W [HE] [GN] [RG] I [AGS] [LM] R [LV] E [LV] [YLF] G C  our framework 1. [EP] W x(4) L G x L [KM] x [VI] T [GA] [VI] [IA] T Q G 2. X(4)-P-x(8)-[LM]-R-x-E-[LV]-x-G-C

29 Future Work Design our own motif discovery algorithm Convert the framework to a complete program Test the performance of our program on more PROSITE patterns

30 Acknowledgement Prof. Sun Kim Prof. Mehmet Dalkilic (Memo) Arvind Gopu Scott Martin


Download ppt "Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University."

Similar presentations


Ads by Google