Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Similar presentations


Presentation on theme: "A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015."— Presentation transcript:

1 A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015

2 Outline Classification ExamplesClassic ClassifiersTrain the ClassifierEvaluation MethodApply the Classifier

3 Classification Examples  Spam Email filtering  Fraud detection  Self-piloting automobile

4 The Classification Problem

5 The Classification Problem

6 The Classification Problem

7 Classic Classifiers  Naïve Bayes  Decision Tree : J48(C4.5)  KNN  RandomForest  SVM : SMO, LibSVM  Neural Network  …

8 How to Choose the Classifier?  Observe your data: amount, features  Your application: precision/recall, explainable, incremental, complexity  Decision Tree is easy to understand, but can't predict numerical values and is slow.  Naïve Bayes is robust for somehow, easy to increment.  Neural networks and SVM are "black boxes“. SVM is fast to predict yes or no.  ! Never Mind: You can try all of them.  Model Selection with Cross Validation

9 How to Choose the Classifier?

10  Discussions:  http://stackoverflow.com/questions/2595176/when-to-choose-which- machine-learning-classifier http://stackoverflow.com/questions/2595176/when-to-choose-which- machine-learning-classifier  http://stats.stackexchange.com/questions/7610/top-five-classifiers- to-try-first http://stats.stackexchange.com/questions/7610/top-five-classifiers- to-try-first  http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what- kind-of-classifier-to-use-1.html http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what- kind-of-classifier-to-use-1.html  http://www.researchgate.net/post/How_to_decide_the_best_classifier _based_on_the_data-set_provided http://www.researchgate.net/post/How_to_decide_the_best_classifier _based_on_the_data-set_provided

11 Train Your Classifier

12 Obtain Training Set  Instances should be labeled.  From running systems in practice  Annotate by multi-experts (Inter-rater agreement)  Crowdsourcing (Google’s Captcha)  …

13 Obtain Training Set  Large Enough  More data can reduce the noises  The benefit of enough data even can dominate that of the classification algorithms  Redundant data will do little help.  Selection Strategies: nearest neighbors, ordered removals, random sampling, particle swarms or evolutionary methods

14 Obtain Training Set  Unbalanced Training Instances for Different Classes  Evaluation: For simple measures, precision/recall,only the instances of the marjority class (class with many samples), this measure still gives you a high rate. (AUC is better.)  No enough information for the features to find the class boundaries.

15 Obtain Training Set  Strategies:  divide into L distinct clusters, train L predictors, and average them as the final one.  Generate synthetic data for rare class. SMOTESMOTE  Reduce the imbalance level. Cut down the majority class  …

16 Obtain Training Set  More materials  https://www.quora.com/In-classification-how-do-you-handle-an- unbalanced-training-set https://www.quora.com/In-classification-how-do-you-handle-an- unbalanced-training-set  http://stats.stackexchange.com/questions/57259/highly- unbalanced-test-data-set-and-balanced-training-data-in- classification http://stats.stackexchange.com/questions/57259/highly- unbalanced-test-data-set-and-balanced-training-data-in- classification  He, Haibo, and Edwardo Garcia. "Learning from imbalanced data." Knowledge and Data Engineering, IEEE Transactions on 21, no. 9 (2009): 1263-1284.

17 Feature Selection  Why  Unrelated Features  noise, heavy computation  Interdependent Features  redundant features  Better Model http://machinelearningmastery.com/an-introduction-to-feature-selection/ Guyon and Elisseeff in “An Introduction to Variable and Feature Selection” (PDF)An Introduction to Variable and Feature Selection

18 Feature Selection  Feature Selection Method  Filter methods: apply a statistical measure to assign a scoring to each feature. E.g., the Chi squared test, information gain and correlation coefficient scores.  Wrapper methods: consider the selection of a set of features as a search problem.  Embedded methods: learn which features best contribute to the accuracy of the model while the model is being created. LASSO, Elastic Net and Ridge Regression.

19 Evaluation Method  Basic Evaluation Method  Precision  Confusion matrix  Per-class accuracy  AUC(Area Under the Curve)  The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives to the rate of false positives

20 Evaluation Method  Cross Validation  Random Subsampling  K-fold Cross Validation  Leave-one-out Cross Validation

21 Cross Validation  Random Subsampling

22 Cross Validation  K-fold Cross Validation

23 Cross Validation  Leave-one-out Cross Validation

24 Cross Validation  Three-way data splits

25 Apply the Classifier  Save the Model  Make the Model dynamic

26 Thank you!


Download ppt "A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015."

Similar presentations


Ads by Google