FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Data Mining Classification: Alternative Techniques
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Aggregating local image descriptors into compact codes
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
« هو اللطیف » By : Atefe Malek. khatabi Spring 90.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Feature Selection Presented by: Nafise Hatamikhah
x – independent variable (input)
Feature Selection for Regression Problems
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.
Feature Selection Lecture 5
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Presented by Tienwei Tsai July, 2005
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Fuzzy Entropy based feature selection for classification of hyperspectral data Mahesh Pal Department of Civil Engineering National Institute of Technology.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Chapter 7. Classification and Prediction
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
School of Computer Science & Engineering
Instance Based Learning
International Workshop
Information Management course
Classification with Perceptrons Reading:
Data Mining (and machine learning)
Data Mining Lecture 11.
Advanced Artificial Intelligence Feature Selection
K Nearest Neighbor Classification
Machine Learning Today: Reading: Maria Florina Balcan
Feature Selection To avid “curse of dimensionality”
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Revision (Part II) Ke Chen
Ying shen Sse, tongji university Sep. 2016
Neural Networks and Their Application in the Fields of Coporate Finance By Eric Séverin Hanna Viinikainen.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Revision (Part II) Ke Chen
Instance Based Learning
Ensemble learning.
Text Categorization Berlin Chen 2003 Reference:
Chapter 7: Transformations
Feature Selection Methods
Multivariate Methods Berlin Chen, 2005 References:
J.M. Sotoca, F. Pla, A. C. Klaren
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR J.M. Sotoca (Pattern Recognition in Information System, PRIS, 2003)

Feature selection process with validation Evaluation Validation Subset of Original set of features features Goodness of the subset S E L C T D S U B E T Stopping criterion Selection: Process of search a subset of features to evaluate. Evaluation: Evaluate the goodness of the subset under examination. Stopping criterion: Is the best subset? ->Criterion Can have a threshold with a fixed number of features to select. To maximise the accuracy classification or a posteriori probability for a determinate rule of classifier. Validation: Classification the training Set of the subset of features chosen with a test Set. NO: Search a new subset of features. no yes

Filter and wrapper methods Set of input variables Subset selection algorithm Learning algorithm Filter approach Wrapper approach Set of input variables Subset evaluation Learning algorithm Subset selection algorithm Two groups of feature selection methods. Filtering methods: The subset selected is independent of the learning method that will use the selected features. We obtain a ranking of relevance in the features set. We have a choice of a subset with the better features. Wrapper methods: In this case, using the evaluation function based on the same learning algorithm that will be used for learning on domain represented with the selected features.

Validation: weighting-selection In the machine learning literature can distinguish two degrees of feature relevance: Strongly relevant: Removing some strongly relevant feature means to add ambiguity and generally produces a decrease in the classifier performance. Weakly relevant: The effect of eliminating a weakly relevant feature depends on which other attributes are removed. Otherwise, it can be considered as irrelevant feature. Validation: The goodness of this weights can be showed through the NN rule using weighted distances. This is validation criterion of the order and the quality between features. Feature weighting: We obtain a set of weights with the degree of relevance of the features. Feature selection: Pick up only the features more relevant, by using a binary weights vector (that is, assigning a value of 1 to relevant features and 0 to irrelevant features. Filter methods: When use filter methods we obtain a ranking of relevance. The data set has 40 continuous features when 19 are irrelevant. So, in the figure sort the features and progressively discard the feature with the lowest weight value following a scheme SBS (Sequential Backward Selection). The dotted line, represent the accuracy when assign a weight value of 1 though the features selected and 0 to the features removed (feature selection). The solid line, represents the accuracy when assign the weight value obtained by ReliefF to the features pick up and 0 to the features discarded (feature weighting in the features not eliminated). So, when the irrelevant and some weakly features are eliminated improve the classification accuracy. However, if remove a strongly relevant decrease the classifier performance.

Comparative of feature weighting methods Nearest hit: Search for each instance x, the nearest neighbour with the same class. Nearest miss: Search for each instance x, the nearest neighbour with different class. ReliefF Algorithm (Kononenko, 1994): This algorithm calculates for each feature and m instances randomly of TS, the difference between nearest miss and nearest hit. ReliefF is a extension for multi-class data sets. Class Weighted-L2 (CW_L2) (Paredes and Vidal, 2000): This method obtains a set of weights (one weight per attribute and class) by means of gradient-descent minimisation of an appropriate criterion function based in the division of nearest hit with nearest miss. Cambiar esta transparencia: Relief_F: Choose instances in the training set and for each instance in it finds its near neighbour of the same class (near hit) and its near neighbour of the different class (near miss). A feature is more relevant if it distinguishes between an instance and its near miss, and less relevant if it distinguishes between and instance and its near hit. GLS (Generalized Least Squares): We use the Generalized Least Squares to minimise a function criterion, with the end to obtain a set of weight that collect the order of relevance of the features set. We have made a comparative between methods based of distance to observed their behaviours. This method have the similar prestence of data sets with irrelevant features and obtain better results than Relief when all atributes have similar relevance.

The Generalized Least Squares(GLS) Initialisation: wi = 1.0; n= (d x K) + 2 is the number of observations for each instance x. Qll is a matrix equal to identity matrix assuming isotropic error in the observations . In each iteration t, do: Calculate the matrices A, B, Qww= BQllBT and the vector of residual functions W. Calculate the new weights wt: Until the residual or leaving-one-out error rate is minimum;

A class-intensity-based model Class Intensity: Sum of the influences of each neighbour pk with class label c(pk) over a instance x of the Training Set (TS). This influence is inverse of the squared distance D as: w: Weights vector or parameters of the model. : Observations vector in the TS. It is formed by the set of differences d x K to take part in the neighbourhood, where K is the number of neighbours and d is the number of dimensions. The charge class C is defined as follow:

A class-intensity-based model The squared criterion distance D can be expressed as follows: where max(xi) and min(xi) are the maximum and minimum of the feature i.

Feature Weight Estimation For each instance x TS, the criterion function to minimise is: where Ex1(w,) is the class intensity in the actual iteration and Ex2(wa,) is when all neighbours have the same class label. wa are the weights vector obtained by the model in the previous iteration. The parameters model w = {w1,...,wd} in the d-dimensional feature space, collect the relevance of the features.

Feature Weight Estimation The observations vector is the set of all ki, k= 1,...,K, i=1,...,d. Also, we add Ex1 and Ex2 in ours observations over the instance x. The vector of residual functions is defined as follows:

Descriptions of data sets The main characteristics are summarised in the table (the number of irrelevant features are given in brackets). Six artificial databases (Led+17, Monk 1-3, Waveform and Waveform+40) have been chosen to evaluate performance under controller conditions.   Features Classes Instances Led+17 24 (17) 10 2000 Waveform 21 3 5000 Waveform+40 40 (19) Monk1 6 (3) 2 556 Monk2 6 601 Monk3 494 Diabetes 8 768 Glass 9 214 Heart 13 270 Vowel 528 Vehicle 18 4 848 Wine 178

Empirical Results Validation with the k-NN classifier rule. We call (wi = 1.0) in the case of non-weighted k-NN classification. The first five columns correspond to the results when using the 1-NN rule, while the last columns are those from the best k-NN classifiers (1 k  21).

Learning capability Effect of TS size in the Led+17 database Effect of TS size in the Monk2 database Validation: Learning Ability Study the effect of using different Training Sets on the classification accuracy of the feature weighting methods. Right: Binary data set is showed, with 24 features where 17 features are irrelevant. Both algorithms (ReliefF and GLS) find the 7 relevant features and only need around of 100 prototypes to reach the optimal classification. Left: Categorical data set is showed with 6 features where all are relevant, although exit small differences of relevant between features. The experiment with this a other data sets suggest that, when all attributes are relevant, GLS performs faster than ReliefF.

Concluding remarks A new feature weighting method has been introduced. It basically consist to minimisation a criterion function through generalised least squared (GLS). The behaviour of the GLS algorithm proposed here is similar to that of the well-known ReliefF approach. Studying the learning rate of ReliefF and GLS models, both obtain goods results in presence of irrelevant attributes, while GLS is able to obtain better results when all attributes are relevants.

Further works Movement of the set of observed data . Detection of outliers. Simultaneous fit of multiple models. Feature selection by class.