Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
My name is Dustin Boswell and I will be presenting: Ensemble Methods in Machine Learning by Thomas G. Dietterich Oregon State University, Corvallis, Oregon.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Data Mining Classification: Alternative Techniques
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Evaluation.
Data Mining Techniques Outline
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Three kinds of learning
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Experimental Evaluation
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Chapter 5 Data mining : A Closer Look.
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Radial Basis Function Networks
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Machine Learning CS 165B Spring 2012
Bayesian Networks. Male brain wiring Female brain wiring.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Data Mining and Decision Support
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Data Science Credibility: Evaluating What’s Been Learned
Chapter 7. Classification and Prediction
KDD CUP 2001 Task 1: Thrombin Jie Cheng (
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Data Mining Lecture 11.
Data Mining Practical Machine Learning Tools and Techniques
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Somi Jacob and Christian Bach
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003

Overview Began as independent study project completed with Dr. Cha in Spring 2002 Initial goal: Compare data mining algorithms on a public bioinformatics dataset Later: evaluate stacked generalization approach Organization of presentation –Introduction to task –Base models and performance –“Stacked” models and performance –Conclusion and Future Work

Introduction: Data Mining Application of machine learning algorithms to large databases Often used to generate models to classify future data based on “training” dataset of known classifications If data is organized well, domain knowledge is not necessary for the data mining practitioner

Introduction: Bioinformatics and Protein Localization Bioinformatics: use of computational methods e.g. data mining to provide insights in molecular biology Have large databases of information about genes; want to figure out the function of their encoded proteins Proteins are expressed in a specific tissue, cell type, or subcellular component (localization) Knowledge of protein localization can shed light on protein’s function

Introduction

Introduction: KDD Cup Dataset KDD Cup: Annual data mining competition sponsored by ACM SIGKDD Training set with target variable supplied and test set with target variable missing supplied Participants submit predictions for test set’s target variable Submissions with the highest accuracy rate (correct predictions/total instances in test set) win Test set’s target variable is publicly available once competition is over

2001 competition focused on bioinformatics including a protein localization task Dataset consisted of various information about anonymized genes of a particular organism including class, phenotype, chromosome, whether essential, and other genes with which it interacts Purpose of this project: compare data mining algorithms on KDD Cup 2001 protein localization dataset Introduction: KDD Cup Dataset Continued

Methods Simplify dataset: reduce number of variables to facilitate working with commercial data mining package (SAS Enterprise Miner) Decided to eliminate variables pertaining to interactions between genes –were more of these variables than other types –sophisticated relational algorithm was necessary to take full advantage of them Correspondingly, decreased number of target values

Frequency of Classes in KDD Cup Training Set

Frequency of Classes in KDD Cup Test Set

Methods Continued Created subsets by selecting only instances whose target was among nucleus, cytoplasm, and mitochondria, and only non-relational variables Divided training subset into two random subsets of 314 and 313 instances (training and validation) Two actual training datasets this training set: –non-sampled raw data (314 instances) –sampled dataset in which each target value appeared in equal amounts and contained frequency variable

A variable was excluded as an input if: –more than 50% of data missing (none excluded) –effectively unary (274 variables excluded) –in hierarchy and not most detailed (none excluded) Resulting training sets: 171 variables (170 binary, 1 non-binary categorical) No missing values in any variables Methods Continued

Base Models

Models Artificial Neural Network –Fully connected feedforward network –One input node for each dummy variable from 171 inputs –1 hidden node and 2 output nodes: dummy values for nucleus and mitochondria –191 randomly initialized weights –Trained using Dual Quasi-Newton Optimization to minimize misclassification rate of training set

Decision Tree –Used CHAID-like algorithm with a chi-squared p value splitting criterion of 0.2 and model selection based on proportion of instances correctly classified Hybrid ANN/Tree –Difficult for ANN to learn with so many variables –Used decision tree as feature selector to determine variables to use in training ANN Models Continued

Nearest Neighbor –Simple Nearest Neighbor algorithm: assigned each instance in dataset to be predicted to class of instance in training set which matched on the greatest number of variables –Match defined as having the exact same value –In case of ties, value from among possible classes that occurred most frequently in raw training set was used, including when applying to the equally distributed training set

Accuracy rates Statistical comparisons –Hybrid Tree-ANN significantly better for non-sampled than equally distributed on test dataset (p < 0.01) –Non-sampled dataset Hybrid Tree-ANN not significantly better than non-sampled Tree (p < =0.06) but significantly better than non-sampled ANN (p < 0.05) Preliminary Results

Reference Point for Results Highest accuracy rate on actual test: 71.1% Next 5 between 68.5% and 70.6% My accuracy rates just slightly off due to gene with two localizations Actual competition required prediction with many more possible values for target variable However, actual competitors had more variables with which to work (relational ones)

“Stacked” Models

Stacking Method for combining models Not as common as other methods and no standard way of doing Part of training set used to train level-0, or base, models as usual Dataset built from predictions of base models on remainder of set (validation set in this project) Level-1 model derived from this prediction dataset, rather than training set predictions to prevent model from going with overfit models

Methods Continued Level 1 ANN –Same as Level 0 ANN (used Levenberg- Marquardt Optimization because fewer weights) Level 1 Decision Tree –Same as Level 0 Tree Level 1 Naïve Bayes –Calculated likelihood of each target value based on Bayes rule applied to level-0 predictions –Predicted value with highest likelihood

Results of Stacking Approach Continued Accuracy rates Statistical comparisons –For non-sampled, all level 1 models significantly better than level 0 ANN –For equally distributed, no level 1 models significantly better than level 0 ANN –For non-sampled, no level 1 models significantly better than level 0 NN on same dataset

Conclusion and Future Work

Conclusion Stacked generalization produced more accurate predictors of test data than base models overall, though not necessarily significantly so –Consistent with intuition and other findings Nearest Neighbor and Hybrid Tree ANN more accurate than ANN and Tree alone, though not necessarily significantly so –May need better trained ANN and tree

Conclusion Continued Three types of level-1 models performed comparably –Other research suggests linear models may work best for stacking, so Bayesian might be expected to perform best –A priori-type search on prediction dataset before Bayesian training to reject conclusions without enough support may improve performance of Bayesian

Conclusion Continued Non-sampled training dataset (with target distribution found in raw data) produced more accurate models than equally distributed training dataset –Sample size may have been too small –Could try without weight variable since it’s likely that prior probabilities aren’t known (unless localization of all genes for this organism are known)

Future Work Use cross-validation to obtain better estimates of error, both overall and for creating the level-1 training dataset –Dividing training into two may have resulted in too few instances and inputs Changing stacking approach –Use posterior probablities instead of predictions –Use different or modified algorithms (more linear, add a priori to bayesian) –Use a level-2 model on these level-1 models

Future Work Continued Stratify training and validation datasets to keep distribution the same as in the original training set Run chi-squares on all combinations of models and adjust for multiple comparisons (cross- validation usually preferred method) Try on complete KDD Cup dataset