Presentation is loading. Please wait.

Presentation is loading. Please wait.

M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University.

Similar presentations


Presentation on theme: "M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University."— Presentation transcript:

1 M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University S.Y. Kung Princeton University

2 M.W. Mak and S.Y. Kung, ICASSP’09 2 Contents 1.Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction 2.Conditional Random Field (CRF) CRF for Cleavage Site Prediction 3.Experiments and Results Effectiveness of Different Feature Functions Effect of Varying Window Size Fusion with SignalP

3 M.W. Mak and S.Y. Kung, ICASSP’09 3 Proteins and Their Destination A protein consists of a sequence of amino acids. Newly synthesized proteins need to pass across intra-cellular membrane to their destination. http://redpoll.pharmacy.ualberta.ca

4 M.W. Mak and S.Y. Kung, ICASSP’09 4 Signal Peptide Source: S. R. Goodman, Medical Cell Biology, Elsevier, 2008. A short segment of 20 to 100 amino acids (known as signal peptides) contains information about the destination (address) of the protein. The signal peptide is cleaved off from the resulting mature protein when it passes across the membrane. http://nobelprize.org Mature protein Signal Peptide Cleavage Site

5 M.W. Mak and S.Y. Kung, ICASSP’09 5 Defects in the protein sorting process can cause serious diseases, e.g., kidney stone Importance of Cleavage Site Prediction Source: http://nobelprize.org/nobel_prizes/medicine/laureates/1999/illpres/diseases.html

6 M.W. Mak and S.Y. Kung, ICASSP’09 6 Many proteins (e.g. insulin) are produced in living cells. To cause the proteins to be secreted out of the cell, they are provided with a signal peptide. Importance of Cleavage Site Prediction Source: http://nobelprize.org/nobel_prizes/medicine /laureates/1999/illpres/diseases.html Bioreactor

7 M.W. Mak and S.Y. Kung, ICASSP’09 7 Information in Sequences Signal peptides contain some regular patterns. Although the patterns exhibit substantial variation, they can be detected by machine learning tools. Cleavage Site Rich in hydrophobic AA

8 M.W. Mak and S.Y. Kung, ICASSP’09 8 Existing Methods Weight matrices (PrediSi) Neural Networks (SignalP 1.1) HMMs (SignalP 3.0)

9 M.W. Mak and S.Y. Kung, ICASSP’09 9 Weight Matrices M A R S S L F T F L C L A V F I N G C L S Q I E Q Q Score at position t = 16+0+8+6+78+7+7+13+10+6+8+6+0+6+7=178 t -1 t t+1 20 AA 15 Positions

10 M.W. Mak and S.Y. Kung, ICASSP’09 10 SignalP-HMM Source: Nielsen and Krogh Mature protein Signal Peptide

11 M.W. Mak and S.Y. Kung, ICASSP’09 11 Contents 1.Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction 2.Conditional Random Field (CRF) CRF for Cleavage Site Prediction 3.Experiments and Results Effectiveness of Amino Acid Properties Effectiveness of Different Feature Functions Fusion with SignalP

12 M.W. Mak and S.Y. Kung, ICASSP’09 12 Conditional Random Fields Given a sequence of observations (e.g., words), a CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations. Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of- Speech (POS) tagging

13 M.W. Mak and S.Y. Kung, ICASSP’09 13 HMM Vs. CRF Conditional Random Fields: Learn Hidden Markov Models: Learn y1y1 y2y2 ………yTyT y1y1 y2y2 ………yTyT x1x1 x2x2 ………xTxT More direct Label Observation Label Observation

14 M.W. Mak and S.Y. Kung, ICASSP’09 14 Advantages of CRF Avoid computing likelihood p(observation|label). Instead, the posterior p(label|observation) is computed directly. Able to model long-range dependency without making the inference problem intractable. Guarantee global optimal. M A R S S L F T F L C L A V F I N G C L S Q I E Q Q Depends on

15 M.W. Mak and S.Y. Kung, ICASSP’09 15 CRF for Cleavage Cite Prediction Cleavage site Transition features State features Weights Length of Sequence n-grams of amino acids

16 M.W. Mak and S.Y. Kung, ICASSP’09 16 CRF for Cleavage Cite Prediction e.g. bi-gram and query sequence = T Q T W A G S H S...

17 M.W. Mak and S.Y. Kung, ICASSP’09 17 CRF for Cleavage Cite Prediction Position

18 M.W. Mak and S.Y. Kung, ICASSP’09 18 Contents 1.Introduction Proteins and Their Subcellular Locations Importance of Protein Cleavage-Site Prediction Information in Amino Acid Sequences Existing Approaches to Cleavage Site Prediction 2.Conditional Random Field (CRF) CRF for Cleavage Site Prediction 3.Experiments and Results Effectiveness of Different Feature Functions Effect of Varying Window Size Fusion with SignalP

19 M.W. Mak and S.Y. Kung, ICASSP’09 19 Experiments Data: 1937 protein sequences extracted from Swissprot 56.5. The cleavage sites locations of these sequences were biologically determined Ten-fold cross validation For 1 st -order state features, up to 5-grams of amino acids For 2 nd -order state features, up to bi-grams of amino acids. Use CRF++ software

20 M.W. Mak and S.Y. Kung, ICASSP’09 20 Results Effectiveness of using AA Properties: Observations: (1) Amino acids provide the most relevant information (2) Hydrophobicity and charge/polarity can help

21 M.W. Mak and S.Y. Kung, ICASSP’09 21 Results Effectiveness of Different Feature Functions: Observations: (1)Transition feature by itself is no good. (2)But, once combined with state-features, performance improves (Transition only) (Transition + State)

22 M.W. Mak and S.Y. Kung, ICASSP’09 22 Results Effect of Varying the Window Size: e.g. query sequence = T Q T W A G S H S...

23 M.W. Mak and S.Y. Kung, ICASSP’09 23 Results Compared with Other Predictors Observations: (1) CRF is slightly better than SignalP (2) CRF is complementary to SignalP

24 M.W. Mak and S.Y. Kung, ICASSP’09 24 Web Server http://158.132.148.85:8080/CSitePred/faces/Page1.jsp

25 M.W. Mak and S.Y. Kung, ICASSP’09 25 Web Server http://158.132.148.85:8080/CSitePred/faces/Page1.jsp Available in May 2009

26 M.W. Mak and S.Y. Kung, ICASSP’09 26

27 M.W. Mak and S.Y. Kung, ICASSP’09 27 Conditional Random Fields Given a sequence of observations, A CRF attempts to find the most likely label sequence, i.e., it gives a label for each of the observations. Conditional Random Fields (CRFs) were originally designed for sequence labeling tasks such as Part-of-Speech (POS) tagging Observations Labels x x y


Download ppt "M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University."

Similar presentations


Ads by Google