Classification of class-imbalanced data

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
SVM—Support Vector Machines
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
The Nature of Statistical Learning Theory by V. Vapnik
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Ensemble Learning: An Introduction
Data mining and statistical learning - lecture 13 Separating hyperplane.
Sparse vs. Ensemble Approaches to Supervised Learning
Imbalanced Data Set Learning with Synthetic Examples
Ensemble Learning (2), Tree and Forest
Learning from Imbalanced Data
Active Learning for Class Imbalance Problem
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Class Imbalance in Text Classification
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Applying Support Vector Machines to Imbalanced Datasets Authors: Rehan Akbani, Stephen Kwek (University of Texas at San Antonio, USA) Nathalie Japkowicz.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
CS 9633 Machine Learning Support Vector Machines
Bagging and Random Forests
Evaluating Classifiers
Deep Feedforward Networks
Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data.
Trees, bagging, boosting, and stacking
LECTURE 16: SUPPORT VECTOR MACHINES
Cost-Sensitive Learning
Support Vector Machines (SVM)
Data Mining Lecture 11.
A “Holy Grail” of Machine Learing
Data Mining Classification: Alternative Techniques
Students: Meiling He Advisor: Prof. Brain Armstrong
Data Mining Practical Machine Learning Tools and Techniques
Cost-Sensitive Learning
CSSE463: Image Recognition Day 11
Introduction to Data Mining, 2nd Edition
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning: Lecture 3
iSRD Spam Review Detection with Imbalanced Data Distributions
Ensembles.
LECTURE 17: SUPPORT VECTOR MACHINES
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Ensemble learning.
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
Support Vector Machine _ 2 (SVM)
Model Combination.
Model generalization Brief summary of methods
Data Mining Class Imbalance
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Roc curves By Vittoria Cozza, matr
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
Support Vector Machines 2
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Classification of class-imbalanced data Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Class Imbalance

Class Imbalance Problem

What is the Class Imbalance Problem? where the total number of a class of data (positive) is far less than the total number of another class of data (negative). This problem is extremely common in practice and can be observed in various disciplines including fraud detection, anomaly detection, medical diagnosis, oil spillage detection, facial recognition, etc

Why is it a problem? Most machine learning algorithms and works best when the number of instances of each classes are roughly equal. When the number of instances of one class far exceeds the other, problems arise.

Solutions to imbalanced learning Sampling methods Cost-sensitive methods Kernel and Active Learning methods

Sampling methods Oversampling Under-sampling by adding more of the minority class so it has more effect on the machine learning algorithm Under-sampling by removing some of the majority class so it has less effect on the machine learning algorithm

Oversampling By oversampling, just duplicating the minority classes could lead the classifier to overfitting to a few examples, which can be illustrated below:

On the left hand side is before oversampling, where as on the right hand side is oversampling has been applied. On the right side, The thick positive signs indicate there are multiple repeated copies of that data instance. The machine learning algorithm then sees these cases many times and thus designs to overfit to these examples specifically, resulting in a blue line boundary as above.

Under-sampling By undersampling, we could risk removing some of the majority class instances which is more representative, thus discarding useful information. This can be illustrated as follows:

Here the green line is the ideal decision boundary we would like to have, and blue is the actual result. On the left side is the result of just applying a general machine learning algorithm without using undersampling. On the right, we undersampled the negative class but removed some informative negative class, and caused the blue decision boundary to be slanted, causing some negative class to be classified as positive class wrongly.

methods require a performance measure to be specified a priori before learning. An alternative is to use a so-called threshold-moving method that a posteriori changes the decision threshold of a model to counteract the imbalance, thus has a potential to adapt to the performance measure of interest. Surprisingly, little attention has been paid to the potential of combining bagging ensemble with threshold-moving. our method preserves the natural class distribution of the data resulting in well calibrated posterior probabilities.

Cost-Sensitive Methods Utilize cost-sensitive methods for imbalanced learning Considering the cost of misclassification Instead of modifying data…

Cost-Sensitive Learning Framework  

Cost-Sensitive Dataspace Weighting with Adaptive Boosting

Cost-Sensitive Dataspace Weighting with Adaptive Boosting

Cost-Sensitive Decision Trees Cost-sensitive adjustments for the decision threshold The final decision threshold shall yield the most dominant point on the ROC curve Cost-sensitive considerations for split criteria The impurity function shall be insensitive to unequal costs Cost-sensitive pruning schemes The probability estimate at each node needs improvement to reduce removal of leaves describing the minority concept Laplace smoothing method and Laplace pruning techniques

Cost-Sensitive Neural Network Four ways of applying cost sensitivity in neural networks Modifying probability estimate of outputs Applied only at testing stage Maintain original neural networks Altering outputs directly Bias neural networks during training to focus on expensive class Modify learning rate Set η higher for costly examples and lower for low-cost examples Replacing error-minimizing function Use expected cost minimization function instead

Kernel-based learning framework Based on statistical learning and Vapnik-Chervonenkis (VC) dimensions Problems with Kernel-based support vector machines (SVMs) Support vectors from the minority concept may contribute less to the final hypothesis Optimal hyperplane is also biased toward the majority class To minimize the total error Biased toward the majority

Thank you