Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.

Similar presentations


Presentation on theme: "Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell."— Presentation transcript:

1 Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell University Presentaion by Andrejus Parfionovas department of Math & Stat, USU

2 2 Classical methods to predict a structure of a new protein:  Sequence comparison to the known proteins in search of similarities Sequences often diverge and become unrecognizable  Structure comparison to the known structures in the PDB database Structural data is sparse and not available for newly sequenced genes

3 3 What other features can be used to improve prediction?  Domain content  Subcellular location  Tissue specificity  Species type  Pairwise interaction  Enzyme cofactors  Catalytic activity  Expression profiles, etc.

4 4 Having so many features it is important:  To extract relevant information Directly from the sequence Predicted secondary structure Features extracted from database  To combine data in a feasible model Mixture model of Probabilistic Decision Trees (PDT) was used

5 5 Features extracted directly from the sequence (percentage) :  20 individual amino acids  16 amino acid groups percentage (16 amino acid groups: + or – charged, polar, aromatic, hydrophobic, acidic, etc)  20 most informative dipeptides

6 6 Features predicted from the sequence:  Secondary structure predicted by the PSIPRED: Coil Helix Strand

7 7 Features extracted from SWISSPROT database:  Binary features (presence/absence) Alternative products Enzyme cofactors Catalytic activity  Nominal features Tissue Specificity (2 different definitions) Subcellular location Organism and Species classification  Continuous Number of patterns exhibited by each protein (“complexity” of a protein)

8 8 Mixture model of PDT (Probabilistic Decision Trees)  Can handle nominal data  Robust to the errors  Missing data is allowed

9 9 How to select an attribute for a decision node?  Use entropy to measure the impurity  Impurity must reduce after the split  Alternative measure – Mantras distance metric (has lower bias towards low split info).

10 10 Enhancements of the algorithm:  Dynamic attribute filtering  Discretizing numerical features  Multiple values for attributes  Missing attributes  Binary splitting  Leaf weighting  Post-prunning  10-fold cross-validation

11 11 The probabilistic fremaework  Attribute is selected with probability that depends on its information gain  Weight the trees by the performance

12 12 Evaluation of decision trees  Accuracy = (tp + tn)/total  Sensitivity = tp/(tp + fn)  Selectivity = tp/(tp + fp)  Jensen-Shannon divergence score

13 13 Handling skewed distributions (unequal class sizes)  Re-weight cases by 1/(# of counts ) Increases the impurity # of false positives  Mixed entropy Uses average of weighted & unweighted information gain to split and prune trees  Interlaced entropy Start with weighted samples and later use the unweighted entropy

14 14 Model selection (simplification)  Occam’s razor: out of 2 models with the same result choose more simple  Bayesian approach : the most probable model has max.posterior probability

15 15 Learning strategy optimization Configurationsensitivity (selectivity)accepted/rejected Basic (C4.5)0.35initial Mantaras metric0.36accepted Binary branching0.45accepted Weighted entropy0.56accepted 10 fold cross-validation0.65accepted 20 fold cross-validation0.64rejected JS-based post-prunning.07accepted sen/sel post-prunning0.68rejected Weighted leafs0.68rejected Mixed entropy0.63rejected Dipeptide information0.73accepted Propabilistic trees0.81accepted

16 16 Pfam classification test (comparison to BLAST)  PDT performance – 81%  BLAST performance – 86%  Main reasons: Nodes become impure because weighted entropy stops learning too early Important branches were eliminated by post-pruning when validation set is small

17 17 EC classification test (comparison to BLAST)  PDT performance on average – 71%  BLAST performance was often smaller

18 18 Conclusions  Many protein families cannot be defined by sequence similarities only  New method makes use of other features (structure, dipeptides, etc.)  Besides classification, PDT allow feature selection for further use  Results comparable to BLAST

19 19 Modifications and Improvements  Use global optimization for pruning  Use probabilities for attribute values  Use boosting techniques (combine weighted trees)  Use Gini-index to measure node-impurity


Download ppt "Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell."

Similar presentations


Ads by Google