Pfizer HTS Machine Learning Algorithms: November 2002

Pfizer HTS Machine Learning Algorithms: November 2002
Paul Hsiung Paul Komarek Ting Liu Andrew W. Moore Auton Lab, Carnegie Mellon University School of Computer Science

Datasets Our Name Num. Records Num Attributes Num non-zero input cells
Num positive outputs Description train1 26,733 6,348 3.7M 804 The original dataset sent to CMU in Feb 2002 test1 1,456 6,121 0.2M 878 The test set associated with the above training set jun-3-1 88,358 1,143,054 30M 423 The large “TEST3” dataset sent to us in May the “-1” at the end denotes that we were using the first of the four activation columns combined 211 Combining the “TEST3” datasets. The activation in Combined is positive if and only if at least two of the four original activations were positive. Auton Lab,

Projections train1 train100 train10 test1 test100 test10 train-pls-100
Our Name name given to original name given to 100 dimensional projection name given to 10 dimensional projection train1 train100 train10 test1 test100 test10 train-pls-100 train-pls-10 test-pls-100 test-pls-10 jun-3-1 n/a combined Auton Lab,

Previous Algorithms BC Bayes Classifier Dtree Decision Tree SVM
On original data, a naïve categorical classifier was used. On Real-valued projected data, a Naïve Gaussian classifier was used. Dtree Decision Tree This technique is also known as Recursive Partitioning and CART. It was only implemented for the original data. SVM Support Vector Machine. Except where stated otherwise, a linear SVM was used. We could not find significant performance difference between Linear SVM and Radial Basis Function SVM with a variety of RBF parameters. k-NN k-nearest neighbor Except where stated otherwise, k=9 neighbors were used. Only implemented for projected data. LR Logistic Regression Except where stated otherwise, used Conjugate Gradient to perform intermediate weighted regressions, using a newly developed technique. Auton Lab,

New Algorithms new-KNN Tractable High dimensional k-nearest neighbor
Can work on the 1,000,000 dimensional “June” data. EFP Explicit False Positive Logistic Regression Logistic regression that accounts for the high false positive rate. SMod Super Model. Automatically combining the predictions from multiple algorithms with a “meta-level” of logistic regression. PLS-proj Partial Least Squares Projection Using PLS instead of PCA to project down data PLS Partial Least Squares Prediction Using the PLS algorithm as a predictor Auton Lab,

Explicit False Positive Model
Auton Lab,

Example in 2 dimensions: Decision Boundary
Auton Lab,

Example in 2 dimensions: 100 true positives
Auton Lab,

100 true positives and 100 true negatives
Auton Lab,

100 TP, 100 TN, 10 FP Auton Lab,

Using regular logistic regression
Auton Lab,

Using EFP Model Auton Lab,

Example: 10000 true positives
Auton Lab,

10000 true positives, 10000 true negatives
Auton Lab,

10000 TP, TN, 1000 FP Auton Lab,

Using regular logistic regression
Auton Lab,

Using EFP Model Auton Lab,

EFP Model Real Data Results
K-fold Auton Lab,

EFP Effect …Very impressive on Train1 / Test1
Auton Lab,

Log X-axis Auton Lab,

EFP Effect …Unimpressive on jun31 / jun32 Auton Lab,

Super Model Divide Training Set into Compartment A and Compartment B
Learn each of N models on Compartment A Predict each of N models on Compartment B Learn best weighting of opinions with Logistic Regression of Predictions on Compartment B Apply the models and their weights to Test Data Auton Lab,

Comparison Auton Lab,

Log X-Axis Scale Auton Lab,

Comparison on 100-dims Auton Lab,

Comparison on 10 dims Auton Lab,

NewKNN summary of results and timings
Auton Lab,

Auton Lab, www.autonlab.org

PLS summary of results PLS projections did not do so well.
However, PLS as a predictor performed well, especially under train100/test100. PLS is fast. The runtime varies from 1 to 10 minutes. But PLS takes large amounts of memory. Impossible to use in a sparse representation. (This is due to the update on each iteration.) Auton Lab,

Auton Lab, www.autonlab.org

Summary of results SVM best early on in Train1, LR better in the long-haul. Projecting to 10-d always a disaster Projecting to 100-d often indistinguishable from behavior with original data (and much cheaper) Naïve Gaussian Bayes Classifier best on JUN-3-1 (k-nn better for long haul) Naïve Gaussian Bayes Classifier best on combined Non-linear SVM never seems distinguishable from Linear SVM All methods have won in at least one context, except Dtree. Auton Lab,

Some AUC Results * = Not statistically significantly different
Experiment Algorithm AUC Train on Train1 then test on Test1 Linear SVM 0.876* Best non-Linear SVM 0.875* BC 0.867* LR 0.71 KNN 0.872* DTree 0.70 Combined SVM 0.638 0.700 0.606 0.603 * = Not statistically significantly different Auton Lab,

Some AUC Results Experiment Algorithm AUC
10-fold cross-validation on Train1 Linear SVM 0.919 BC 0.885 LR 0.933 DTree 0.894 Auton Lab,

Pfizer HTS Machine Learning Algorithms: November 2002

Similar presentations

Presentation on theme: "Pfizer HTS Machine Learning Algorithms: November 2002"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pfizer HTS Machine Learning Algorithms: November 2002

Similar presentations

Presentation on theme: "Pfizer HTS Machine Learning Algorithms: November 2002"— Presentation transcript:

Similar presentations

About project

Feedback