Presentation on theme: "1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis."— Presentation transcript:
1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis Center for Excellence in Cancer Genomics University at Albany, SUNY
2 The problem Develop computational models of characteristics of protein structure and function from sequence alone using machine-learned classifiers Input: Data Output: A model (function) h : X Y Traditional approach: supervised learning Challenges: Experimentally determined data – Expensive, limited, subject to noise/error Large repositories of unannotated data Data representation, bias from unbalanced / underrepresented classes, etc. Swiss-Prot 54.5: 289,473 TrEMBL 37.5: 5,035,267 AIM: Develop a method to use labeled and unlabeled data, while improving performance given the challenges presented by small, unbalanced data
3 Solution Semi-supervised learning Use D l and D u for model induction Method: Generative, Bayesian probabilistic model Based on ngLOC – supervised, Naïve Bayes classification method Input / Feature Representation: Sequence n-gram model Assumption – multinomial distribution IID – Sequence and n-grams Use EXPECTATION MAXIMIZATION! Test setup Prediction of subcellular localization Eukaryotic, non-plant sequences only D l : Data annotated with subcellular localization for eukaryotic, non-plant sequences DL-2 – EXT/PLA (~5500 sequences, balanced) DL-3 – GOL [65%] / LYS [14%] /POX [21%] (~600 sequences, unbalanced) D u : Set from ~75K eukaryotic, non-plant protein sequences. Comparative method Transductive SVM
4 Algorithms based on EM EM-λ on DL-3 data λ – controls effect of UL data on parameter adjustments ALL labeled data (~600) Varied UL data EM- λ outperforms TSVM on this problem (Failed to converge on large amounts of UL data, despite parameter selection) NOTE – TSVM performed very well on binary, balanced classification problems Basic EM on DL-2 Varied labeled data 25,000 UL sequences Most improvement when data is limited
5 Algorithm – EM-CS Core ngLOC method outputs a confidence score (CS) Improve running time through intelligent selection of unlabeled instances CS(x i ) > CSthresh? Use the instance Test on DL-3 data: First, determine range of CS scores through cross-validation without UL: 33.5-47.8 (Dependent on level of similarity in data, size of dataset.) Using only sequences that meet or exceed CSthresh significantly reduces UL data required (97.5% eliminated) NOTE: it is possible to reduce UL data too much.
6 Conclusion Benefits: Probabilistic Extract unlabeled sequences of “high-confidence” Difficult with SVM or TSVM Extraction of knowledge from model Discriminative n-grams and anomalies Information theoretic measures, KL-divergence, etc. Again, difficult with SVM or TSVM Computational resources Time: Significantly lower than SVM and TSVM Space: Dependent on n-gram model Can use large amounts of unlabeled data Applicable toward prediction of any structural or functional characteristic Outputs a global model Transduction is not global! Most substantial gain with limited labeled data Current work in progress: TSVMs Improve performance on smaller, unbalanced data Select an improved smaller dimensional feature space representation Ensemble classifiers, Bayesian model averaging, Mixture of experts