Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.

Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar

Importance: Empirical comparison of different learning algorithms provides answers to questions such as Which is the best learning algorithm? How well does a particular learning algorithm perform when compared to another algorithm over the same data?

The last comprehensive empirical comparison was STATLOG in 1995 Several new learning algorithms have been developed after STATLOG (Random forests, Bagging, SVMs) No extensive evaluation of these new methods.

SVMs ANN Logistic Regression Naïve Bayes KNN Random Forests Decision Trees(Bayes, Cart, Cart0, ID3,c4, MML,SMML) Bagged Trees Boosted Trees Boosted Stumps

Threshold Metrics: Accuracy-The proportion of correct predictions the classifier makes relative to the size of the dataset F-score -Harmonic mean of the precision and recall at a given threshold Lift -%of true positives above the threshold -------------------------------------------- %of dataset above the threshold

Ordering/Rank Metrics: ROC curve - Plot of sensitivity vs. (1- specificity)for all possible thresholds APR - Average precision BEP(Break Even Point) -the precision at the point (threshold value) where precision and recall are equal.

Probability Metrics: (Root Mean Square Error) - A measure of total error defined as the square root of the sum of the variance and the square of the bias MXE (Mean Cross Entropy) - used in the probabilistic setting when interested in predicting the probability that an example is positive MXE = -1/NΣ(True©*ln(Pred(c)) + (1-true(c)*ln(1-pred(c))

Lift is appropriate for marketing Medicine prefers ROC Precision/Recall is used for information retrieval …It is also possible for a algorithm to perform well over one metric and perform poorly over some other metric

Letter Cover Type Adult Protein coding MEDIS MG IndianPine92 California Housing Bacteria SLAC(Stanford linear accelerator)

For each data set, 5000 random instances are used for training and the rest are used as one large test set. 5 fold cross validation is used on the 5000 training instances 5 fold cross validation is used to select the best parameters for the learning algorithm. …The purpose of the 5 fold cross validation is to calibrate the different algorithms using either Platt scaling or Isotonic regression

SVM predictions are transformed to posterior probabilities by passing them through a sigmoid Platt's method also works well for boosted trees and boosted stumps … might not be the correct transformation for all learning algorithms. Isotonic regression provides a more general solution since the only restriction it makes is that the mapping function should be isotonic (strictly increasing or strictly decreasing)

SVMs: radial width {.001,0.005,0.01,0.05,0.1,0.5,1,2} The regularization parameter is varied by factors of ten from 10 -7 to 10 3 ANN hidden units{1,2,4,8,32,128} momentum {0,0.2,0.5,0.9}

Logistic Regression: The ridge (regularization) parameter is varied by factors of 10 from 10 -8 to 10 4 KNN: 26 values of K ranging from K = 1 to K = |trainset| Random Forests: The size of the feature set considered at each split is 1,2,4,6,8,12,16 or 20.

Boosted Trees: 2,4,8,16,32,64,128,256,512,1024 and 2048 steps of boosting Boosted Stumps: single level decision trees generated with 5 different splitting criteria, each boosted for 2,4,8,16,32,64,128,256,512,1024,2048,4096,8192 steps

Without calibration, the best algorithms were bagged trees, random forests, and neural nets. After calibration, the best algorithms were calibrated boosted trees, calibrated random forests, bagged trees, PLT-calibrated SVMs and neural nets …SVMs and Boosted trees have improved rankings with calibrations.

Interestingly, calibrating neural nets with PLT or ISO hurts their calibration. And some algorithms such as Memory-based methods (e.g. KNN) are unaffected by calibration

Letter -Boosted DT(plt) Cover Type -Boosted DT(plt) Adult -Boosted STMP(plt) Protein coding -Boosted DT(plt) MEDIS -Random Forest(plt) MG -Bagged DT IndianPine92 -Boosted DT(plt) California Housing -Boosted DT(plt) Bacteria -Bagged DT SLAC -Random Forest(ISO)

Neural nets perform well on all metrics on 10 of 11 problems, but perform poorly on COD If the COD problem had not been included, neural nets would move up 1-2 places in the rankings

Bootstrap analysis

randomly select a bootstrap sample from the original 11 problems randomly select a bootstrap sample of 8 metrics from the original 8 metrics rank the ten algorithms by mean performance across the sampled problems and metrics Repeat bootstrap sampling 1000 times, yielding 1000 potentially different rankings of the learning methods

Model1st2nd3rd4th5th6th7th8th9th10th Bst DT.580.228.160.023.009.0 RF.390.525.084.001.0 Bag DT.030.232.571.150.017.0 SVM.0.008.148.574.240.029.001.0 ANN.0.007.035.230.606.122.0 KNN.0.009.114.592.245.038.002.0 Bst stm.0.002.013.014.257.710.004.0 DT.0.004.616.291.089 logreg.0.040.312.423.225 NB.0.030.284.686

The models that performed poorest were naive bayes, logistic regression, decisiontrees, and boosted stumps bagged trees, random forests, and neural nets give the best average performance without calibration After calibration with Platt's Method, boosted trees predict better probabilities than all other methods But at the same time boosted stumps and logistic regression, which perform poorly on average, are the best models for some metrics Effectiveness of an algorithm depends on the metric used and the data set.

The End

Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.

Similar presentations

Presentation on theme: "Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.

Similar presentations

Presentation on theme: "Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar."— Presentation transcript:

Similar presentations

About project

Feedback