Instance Selection. 1.Introduction 2.Training Set Selection vs. Prototype Selection 3.Prototype Selection Taxonomy 4.Description of Methods 5.Related.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Perceptron Learning Rule
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
RIPPER Fast Effective Rule Induction
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Identifying Competence-Critical Instances for Instance-Based Learners Presenter: Kyu-Baek Hwang.
Instance Based Learning
Lazy vs. Eager Learning Lazy vs. eager learning
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
Classification and Decision Boundaries
Feature Selection Presented by: Nafise Hatamikhah
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Decision Tree Algorithm
Feature Selection for Regression Problems
Three kinds of learning
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Nearest Neighbour Condensing and Editing David Claus February 27, 2004 Computer Vision Reading Group Oxford.
Introduction. 1.Data Mining and Knowledge Discovery 2.Data Mining Methods 3.Supervised Learning 4.Unsupervised Learning 5.Other Learning Paradigms 6.Introduction.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.
CS Instance Based Learning1 Instance Based Learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Metaheuristics The idea: search the solution space directly. No math models, only a set of algorithmic steps, iterative method. Find a feasible solution.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data Author: Gustavo E. A. Batista Presenter: Hui Li University of Ottawa.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Hierarchical Distributed Genetic Algorithm for Image Segmentation Hanchuan Peng, Fuhui Long*, Zheru Chi, and Wanshi Siu {fhlong, phc,
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
1 Part II: Practical Implementations.. 2 Modeling the Classes Stochastic Discrimination.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
David Claus and Christoph F. Eick: Nearest Neighbor Editing and Condensing Techniques Nearest Neighbor Editing and Condensing Techniques 1.Nearest Neighbor.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
On Utillizing LVQ3-Type Algorithms to Enhance Prototype Reduction Schemes Sang-Woon Kim and B. John Oommen* Myongji University, Carleton University*
Discretization. 1.Introduction 2.Perspectives and Background 3.Properties and Taxonomy 4.Experimental Comparative Analysis.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Data Mining and Decision Support
Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:
CS Machine Learning Instance Based Learning (Adapted from various sources)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Instance-Based Learning Evgueni Smirnov. Overview Instance-Based Learning Comparison of Eager and Instance-Based Learning Instance Distances for Instance-Based.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
K Nearest Neighbors and Instance-based methods
Basic machine learning background with Python scikit-learn
Instance Based Learning (Adapted from various sources)
K Nearest Neighbor Classification
A Unifying View on Instance Selection
COSC 4335: Other Classification Techniques
Nearest Neighbors CSC 576: Data Mining.
Data Mining Classification: Alternative Techniques
Chapter 7: Transformations
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Presentation transcript:

Instance Selection

1.Introduction 2.Training Set Selection vs. Prototype Selection 3.Prototype Selection Taxonomy 4.Description of Methods 5.Related and Advanced Topics 6.Experimental Comparative Analysis in PS

Instance Selection 1.Introduction 2.Training Set Selection vs. Prototype Selection 3.Prototype Selection Taxonomy 4.Description of Methods 5.Related and Advanced Topics 6.Experimental Comparative Analysis in PS

Introduction Instance selection (IS) performs the complementary process regarding the FS. The major issue of scaling down the data is the selection or identification of relevant data from an immense pool of instances. Instance Selection is to choose a subset of data to achieve the original purpose of a DM application as if the whole data were used.

Introduction IS vs. data sampling: intelligent operation of instance categorization, according to a degree of irrelevance or noise. The optimal outcome of IS is a minimum data subset, model independent that can accomplish the same task with no performance loss. IS method on obtaining a subset S ⊂ T such that S does not contain superfluous instances, Acc(S) is similar to Acc(T), where Acc(X) is the classification accuracy obtained using X as a training set.

Introduction IS has the following outstanding functions: – Enabling: IS enables a DM algorithm to work with huge data. – Focusing: a concrete DM task is focused on only one aspect of interest of the domain. IS focus the data on the relevant part. – Cleaning: redundant as well as noisy instances are usually removed, improving the quality of the input data.

Instance Selection 1.Introduction 2.Training Set Selection vs. Prototype Selection 3.Prototype Selection Taxonomy 4.Description of Methods 5.Related and Advanced Topics 6.Experimental Comparative Analysis in PS

Training Set Selection vs. Prototype Selection Several terms have been used for selecting the most relevant data from the training set. Instance selection is the most general one and was thought to work with other learning methods, such as decision trees, ANNs or SVMs. Nowadays, there are two clear distinctions in the literature: Prototype Selection (PS) and Training Set Selection (TSS).

Training Set Selection vs. Prototype Selection Prototype Selection

Training Set Selection vs. Prototype Selection Training Set Selection

Instance Selection 1.Introduction 2.Training Set Selection vs. Prototype Selection 3.Prototype Selection Taxonomy 4.Description of Methods 5.Related and Advanced Topics 6.Experimental Comparative Analysis in PS

Prototype Selection Taxonomy Common properties: – Direction of Search. Incremental. An incremental search begins with an empty subset S, and adds each instance in T R to S if it fulfills some criteria. Decremental. The decremental search begins with S = T R, and then searches for instances to remove from S Batch. Decremental removing more than one instance. Mixed. begins with a pre-selected subset S and iteratively can add or remove any instance which meets the specific criterion Fixed. The number of instances is user-defined.

Prototype Selection Taxonomy Common properties: – Type of Selection. Condensation. retain the points which are closer to the decision boundaries, also called border points. Edition. seek to remove border points. They remove points that are noisy or do not agree with their neighbors. Hybrid. They try to find the smallest subset S which maintains or even increases the generalization accuracy in test data.

Prototype Selection Taxonomy Common properties: – Evaluation of Search. Filter. When the kNN rule is used for partial data to determine the criteria of adding or removing and no leave-one-out validation scheme is used to obtain a good estimation of generalization accuracy. Wrapper. When the kNN rule is used for the complete training set with the leave-one-out validation scheme. The conjunction in the use of the two mentioned factors allows us to get a great estimation of generalization accuracy, which helps to obtain better accuracy over test data.

Prototype Selection Taxonomy Common properties: – Criteria to Compare PS Methods. Storage reduction Noise tolerance Generalization accuracy Time requirements

Prototype Selection Taxonomy Prototype Selection Methods (1):

Prototype Selection Taxonomy Prototype Selection Methods (2):

Prototype Selection Taxonomy Prototype Selection Methods (3):

Prototype Selection Taxonomy Prototype Selection Methods (4):

Prototype Selection Taxonomy Prototype Selection Methods (5):

Prototype Selection Taxonomy Prototype Selection Methods (6):

Prototype Selection Taxonomy Taxonomy

Instance Selection 1.Introduction 2.Training Set Selection vs. Prototype Selection 3.Prototype Selection Taxonomy 4.Description of Methods 5.Related and Advanced Topics 6.Experimental Comparative Analysis in PS

Description of Methods Condensation Condensed Nearest Neighbor (CNN) — This algorithm finds a subset S of the training set TR such that every member of TR is closer to a member of S of the same class than to a member of S of a different class. It begins by randomly selecting one instance belonging to each output class from TR and putting them in S.Then each instance inT Ris classified using only the instances in S. If an instance is misclassified, it is added to S, thus ensuring that it will be classified correctly. This process is repeated until there are no instances in T R that are misclassified.

Description of Methods Condensation Fast Condensed Nearest Neighbor family (FCNN) — The FCNN1 algorithm starts by introducing in S the centroids of each class. Then, for each prototype p in S, its nearest enemy inside its Voronoi region is found, and add to S. This process is performed iteratively until no enemies are found on a single iteration. The FCNN2 algorithm is similar to FCNN1 but, instead of adding the nearest enemy on each Voronoi region, is added the centroid of the enemies found in the region. The FCNN3 algorithm is similar to FCNN1 but, instead of adding one prototype per region in each iteration, only one prototype is added (the one which belongs to the Voronoi region with most enemies). In FCNN3, S is initialized only with the centroid of the most populated Class.

Description of Methods Condensation Reduced Nearest Neighbor (RNN) — RNN starts with S = TR and removes each instance from S if such a removal does not cause any other instances in TR to be misclassified by the instances remaining in S. It will always generate a subset of the results of CNN algorithm.

Description of Methods Condensation Patterns by Ordered Projections (POP) — This algorithm consists of eliminating the examples that are not within the limits of the regions to which they belong. For it, each attribute is studied separately, sorting and increasing a value, called weakness, associated to each one of the instances, if it is not within a limit. The instances with a value of weakness equal to the number of attributes are eliminated.

Description of Methods Edition Edited Nearest Neighbor (ENN) — Wilson developed this algorithm which starts with S = TR and then each instance in S is removed if it does not agree with the majority of its k nearest neighbors.

Description of Methods Edition Multiedit — This method proceeds as follows.

Description of Methods Edition Relative Neighborhood Graph Edition (RNGE) —

Description of Methods Edition All KNN — All KNN is an extension of ENN. The algorithm, for i = 0 to k flags as bad any instance not classified correctly by its i nearest neighbors. When the loop is completed k times, it removes the instances flagged as bad.

Description of Methods Hybrid Instance-Based Learning Algorithms Family (IB3) — The IB3 algorithm proceeds as follows:

Description of Methods Hybrid Decremental Reduction Optimization Procedure Family (DROP) — Each instance X i has k nearest neighbors where k is typically a small odd integer. X i also has a nearest enemy, which is the nearest instance with a different output class. Those instances that have x i as one of their k nearest neighbors are called associates of X i.

Description of Methods Hybrid DROP1

Description of Methods Hybrid DROP2: In this method, the removal criterion can be restated as: Remove X i if at least as many of its associates in TR would be classified correctly without X i. Using this modification, each instance X i in the original training set TR continues to maintain a list of its k + 1 nearest neighbors in S, even after X i is removed from S. DROP2 also changes the order of removal of instances. It initially sorts the instances in S by the distance to their nearest enemy. DROP3: It is a combination of DROP2 and ENN algorithms. DROP3 uses a noise-filtering pass before sorting the instances in S (Wilson ENN editing). After this, it works identically to DROP2.

Description of Methods Hybrid Iterative Case filtering (ICF) — ICF defines local set L(x) which contain all cases inside largest hypersphere centered in xi such that the hypersphere contains only cases of the same class as instance x i. Authors define two properties, reachability and coverage: In the first phase ICF uses the ENN algorithm to remove the noise from the training set. In the second phase the ICF algorithm removes each instance X i for which the Reachabili ty(X i ) is bigger than the Coverage(X i ). This procedure is repeated for each instance in TR. After that ICF recalculates reachability and coverage properties and restarts the second phase.

Description of Methods Hybrid Random Mutation Hill Climbing (RMHC) — It randomly selects a subset S from TR which contains a fixed number of instances s (s = %|T R|). In each iteration, the algorithm interchanges an instance from S with another from TR - S. The change is maintained if it offers better accuracy.

Description of Methods Hybrid Steady-state memetic algorithm (SSMA) —

Instance Selection 1.Introduction 2.Training Set Selection vs. Prototype Selection 3.Prototype Selection Taxonomy 4.Description of Methods 5.Related and Advanced Topics 6.Experimental Comparative Analysis in PS

Related and Advanced Topics Prototype Generation Prototype generation methods are not limited only to select examples from the training set. They could also modify the values of the samples, changing their position in the d- dimensional space considered. Most of them usemerging or divide and conquer strategies to set new artificial samples, or are based on clustering approaches, Learning Vector Quantization hybrids, advanced proposals and evolutionary algorithms based schemes.

Related and Advanced Topics Distance Metrics, Feature Weighting and Combinations with Feature Selection This area refers to the combination of IS and PS methods with other well-known schemes used for improving accuracy in classification problems. For example, the weighting scheme combines the PS with the FS or Feature Weighting, where a vector of weights associated with each attribute determines and influences the distance computations.

Related and Advanced Topics Hybridizations with Other Learning Methods and Ensembles This family includes all the methods which simultaneously use instances and rules in order to compute the classification of a new object. If the values of the object are within the range of a rule, its consequent predicts the class; otherwise, if no rule matches the object, the most similar rule or instance stored in the data base is used to estimate the class. This area also refers to ensemble learning, where an IS method is run several times and a classification decision is made according to the majority class obtained over several subsets and any performance measure given by a learner.

Related and Advanced Topics Scaling-Up Approaches One of the disadvantages of the IS methods is that most of them report a prohibitive run time or even cannot be applied over large size data sets. Recent improvements in this field cover the stratification of data and the development of distributed approaches for PS.

Related and Advanced Topics Data Complexity This area studies the effect on the complexity of data when PS methods are applied previous to the classification or how to make a useful diagnosis of the benefits of applying PS methods taking into account the complexity of the data.

Instance Selection 1.Introduction 2.Training Set Selection vs. Prototype Selection 3.Prototype Selection Taxonomy 4.Description of Methods 5.Related and Advanced Topics 6.Experimental Comparative Analysis in PS

Experimental Comparative Analysis in PS Framework 10-FCV. Parameters recommended by the authors of the algorithms. Euclidean distance. Three runs for stochastic methods. 42 PS methods involved. 39 small data sets. 19 medium data sets. Reduction, Accuracy and Kappa and Time as evaluation measures.

Experimental Comparative Analysis in PS Results in small data sets (1)

Experimental Comparative Analysis in PS Results in small data sets (2)

Experimental Comparative Analysis in PS Analysis in small data sets Best condensation methods: FCNN and MCNN in incremental and RNN and MSS in decremental Best edition methods: ENN, RNGE and NCNEdit obtain the best results in accuracy/kappa and MENN and ENNTh offers a good tradeoff considering the reduction rate. Best hybrid methods: CPruner, HMNEI, CCIS, SSMA, CHC and RMHC. Best global methods: in terms of accuracy or kappa are MoCS, RNGE and HMNEI. Considering the tradeoff reduction- accuracy/kappa are RMHC, RNN, CHC, Explore and SSMA.

Experimental Comparative Analysis in PS Results in medium data sets

Experimental Comparative Analysis in PS Analysis in small data sets Five techniques outperform 1NN in terms of accuracy/kappa over medium data sets: RMHC,SSMA, HMNEI,MoCSand RNGE. There are some techniques whose run could be prohibitive when the data scales up. This is the case for RNN, RMHC, CHC and SSMA. The best methods in terms of accuracy or kappa are RNGE and HMNEI. The best methods considering the tradeoff reduction- accuracy/kappa are RMHC, RNN and SSMA.

Experimental Comparative Analysis in PS Final suggestions: For the tradeoff reduction-accuracy rate: The algorithms which obtain the best behavior are RMHC and SSMA. The methods that harm the accuracy at the expense of a great reduction of time complexity are DROP3 and CCIS. If the interest is the accuracy rate: In this case, the best results are to be achieved with the RNGE as editor and HMNEI as hybrid method. When the key factor is the condensation: FCNN is the highlighted one, being one of the fastest.

Experimental Comparative Analysis in PS Visualization of data sub sets: Banana data set (Reduction rate, Test Accuracy, Test Kappa)

Experimental Comparative Analysis in PS Visualization of data sub sets: Banana data set (Reduction rate, Test Accuracy, Test Kappa)

Experimental Comparative Analysis in PS Visualization of data sub sets: Banana data set (Reduction rate, Test Accuracy, Test Kappa)

Experimental Comparative Analysis in PS Visualization of data sub sets: Banana data set (Reduction rate, Test Accuracy, Test Kappa)

Experimental Comparative Analysis in PS Visualization of data sub sets: Banana data set (Reduction rate, Test Accuracy, Test Kappa)

Experimental Comparative Analysis in PS Visualization of data sub sets: Banana data set (Reduction rate, Test Accuracy, Test Kappa)