A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015

Outline Classification ExamplesClassic ClassifiersTrain the ClassifierEvaluation MethodApply the Classifier

Classification Examples  Spam Email filtering  Fraud detection  Self-piloting automobile

The Classification Problem

Classic Classifiers  Naïve Bayes  Decision Tree ： J48(C4.5)  KNN  RandomForest  SVM ： SMO, LibSVM  Neural Network  …

How to Choose the Classifier?  Observe your data: amount, features  Your application: precision/recall, explainable, incremental, complexity  Decision Tree is easy to understand, but can't predict numerical values and is slow.  Naïve Bayes is robust for somehow, easy to increment.  Neural networks and SVM are "black boxes“. SVM is fast to predict yes or no.  ! Never Mind: You can try all of them.  Model Selection with Cross Validation

How to Choose the Classifier?

 Discussions:  http://stackoverflow.com/questions/2595176/when-to-choose-which- machine-learning-classifier http://stackoverflow.com/questions/2595176/when-to-choose-which- machine-learning-classifier  http://stats.stackexchange.com/questions/7610/top-five-classifiers- to-try-first http://stats.stackexchange.com/questions/7610/top-five-classifiers- to-try-first  http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what- kind-of-classifier-to-use-1.html http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what- kind-of-classifier-to-use-1.html  http://www.researchgate.net/post/How_to_decide_the_best_classifier _based_on_the_data-set_provided http://www.researchgate.net/post/How_to_decide_the_best_classifier _based_on_the_data-set_provided

Train Your Classifier

Obtain Training Set  Instances should be labeled.  From running systems in practice  Annotate by multi-experts (Inter-rater agreement)  Crowdsourcing (Google’s Captcha)  …

Obtain Training Set  Large Enough  More data can reduce the noises  The benefit of enough data even can dominate that of the classification algorithms  Redundant data will do little help.  Selection Strategies: nearest neighbors, ordered removals, random sampling, particle swarms or evolutionary methods

Obtain Training Set  Unbalanced Training Instances for Different Classes  Evaluation: For simple measures, precision/recall,only the instances of the marjority class (class with many samples), this measure still gives you a high rate. (AUC is better.)  No enough information for the features to find the class boundaries.

Obtain Training Set  Strategies:  divide into L distinct clusters, train L predictors, and average them as the final one.  Generate synthetic data for rare class. SMOTESMOTE  Reduce the imbalance level. Cut down the majority class  …

Obtain Training Set  More materials  https://www.quora.com/In-classification-how-do-you-handle-an- unbalanced-training-set https://www.quora.com/In-classification-how-do-you-handle-an- unbalanced-training-set  http://stats.stackexchange.com/questions/57259/highly- unbalanced-test-data-set-and-balanced-training-data-in- classification http://stats.stackexchange.com/questions/57259/highly- unbalanced-test-data-set-and-balanced-training-data-in- classification  He, Haibo, and Edwardo Garcia. "Learning from imbalanced data." Knowledge and Data Engineering, IEEE Transactions on 21, no. 9 (2009): 1263-1284.

Feature Selection  Why  Unrelated Features  noise, heavy computation  Interdependent Features  redundant features  Better Model http://machinelearningmastery.com/an-introduction-to-feature-selection/ Guyon and Elisseeff in “An Introduction to Variable and Feature Selection” (PDF)An Introduction to Variable and Feature Selection

Feature Selection  Feature Selection Method  Filter methods: apply a statistical measure to assign a scoring to each feature. E.g., the Chi squared test, information gain and correlation coefficient scores.  Wrapper methods: consider the selection of a set of features as a search problem.  Embedded methods: learn which features best contribute to the accuracy of the model while the model is being created. LASSO, Elastic Net and Ridge Regression.

Evaluation Method  Basic Evaluation Method  Precision  Confusion matrix  Per-class accuracy  AUC(Area Under the Curve)  The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives to the rate of false positives

Evaluation Method  Cross Validation  Random Subsampling  K-fold Cross Validation  Leave-one-out Cross Validation

Cross Validation  Random Subsampling

Cross Validation  K-fold Cross Validation

Cross Validation  Leave-one-out Cross Validation

Cross Validation  Three-way data splits

Apply the Classifier  Save the Model  Make the Model dynamic

Thank you!

A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Similar presentations

Presentation on theme: "A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Similar presentations

Presentation on theme: "A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015."— Presentation transcript:

Similar presentations

About project

Feedback