School of Computer Science & Engineering

School of Computer Science & Engineering
Artificial Intelligence Feature Selection Dae-Won Kim School of Computer Science & Engineering Chung-Ang University

What is feature selection?

We predict the grade of ‘AI’ of a new student (age, sex,
We predict the grade of ‘AI’ of a new student (age, sex, .. smoke) using the following training data. Name Age Sex Alg. C++ Military Height Weight C Smoke Class label Student-1 21 M A B Y 178 80 Yes Student-2 23 F N 165 50 No Student-3 22 160 45 Student-4 180 70

Definition: Given a set of features, select a subset that performs best.

Q: Advantages ? (4 viewpoints)

Benefits-1: simpler and faster

It reduces the number of features.

Benefits-2: cost-effect

It provides lower computational cost and cost of testing, etc.

Benefits-3: a better accuracy

It may provide a better accuracy by removing useless features.

Benefits-4: a deeper insight

Knowledge of good features gives insights into the underlying structure.

Caution: Feature Selection
vs. Feature Transformation

Related to scaling and normalization.

Certain transformation of features may lead to the discovery of structures that were not obvious on the original scale.

Y = 2X or Y = X Given X, the result Y is predictable

For scaling, we usually take square roots, reciprocals, and logarithms.

We also handle the question of normalization.

Features are often normalized to lie in a fixed range, from zero to one.

1. By dividing all values by the maximum value encountered.

2. By subtracting the minimum value and dividing by the range between the max. and the min.

3. By calculating the standard mean and standard deviation of the feature, subtract the mean from each value, and divide the result by the standard deviation. Q: Try these three options.

This is also called standardization. (mean-zero and s.d.-one)

However, we may sacrifice the way it represents the underlying data.

Caution: Feature Selection
vs. Feature Extraction

Definition: It creates new good features using combinations or transformation of the original feature. (dimension reduction)

1. By using linear combinations that are simple to compute & tractable.

2. By projecting high dimensional data onto a lower dimensional space.

PCA (principal component analysis)
SVD (singular value composition) LDA (linear discriminant analysis) MDA (multiple discriminant analysis) ICA (independent component analysis) MDS (multi-dimensional scaling) NMF (nonnegative matrix factorization) …

PCA extracts principal components calculated by the covariance.

New features are obtained by a linear combination of PCs.

Have you heard about eigen vector, eigen value in linear algebra?

PCA (Principal Component Analysis)
Given four data where each data point has three attributes, reduce the dimension of attributes to ‘two’ Step 1. Find the eigen vectors () and corresponding eigen values () through the covariance matrix Step 2. Sort the eigen vectors according to their corresponding eigen values in descending order 1 =  1 = [ ]T (referred to as the first principal component) 2 =  2 = [ ]T (referred to as the second principal component) 3 =  3 = [ ]T (referred to as the third principal component) Step 3. Create new attributes using the top k eigen vectors where k = 2. Original attributes x eigenvectorT = [1 2 1] x 1T = [1 2 1] x [ ]T = 0.41 Original attributes x eigenvectorT = [1 2 1] x 2T = [1 2 1] x [ ]T = -1.42

SVD (Singular Value Decomposition)
Given the relation matrix of documents and terms, recalculate the relation matrix with top 2 singular values Step 1. Decompose the original matrix A into three matrices (i.e., using MATLAB) A = [1/singular values x A x eigen vectors] [singular values] [eigen vectors]T, singular value = (eigen value)1/2 1st eigen vector 2nd eigen vector 2nd singular value 1st singular value Step 2. Create new matrix Anew using the top k entries of the decomposed matrices where k = 2.

PCA is useful for representing data.

LDA is useful for discriminating data.

PCA seeks for the orthogonal directions.

ICA seeks for the independent directions.

ICA is useful for blind source separation.

Let us go back to feature selection.

How can we select good features?

Quiz: propose your idea on feature selection using examples

We can find a set of good features.

We can find a set of bad features.

We need definitions on the good and bad feature.

redundant, relevant, inconsistent,…

Feature selection is a special case of search problem (optimization prob).

We can use all search techniques.

Among many approaches, we learn techniques using Filter vs. Wrapper.

The difference: learning algorithm.

Wrapper uses learning algorithms.

Filter: greedy by univariate ranking
score each feature sort them select top-ranked features

Q: pros and cons of univariate filter

Filter: multivariate correlation issue

Wrapper: Search + Learning Evaluation Form subsets by search Evaluate them using classifers Expand or modify subsets

Wrapper: Revisit your solution to TSP

Wrapper: 1. Search by (Greedy, B&B, HC, GA) 2. Evaluation by learning accuracy (Bayesian or K-NN)

Q: pros and cons of wrapper

Q: which feature handling technique is appropriate to your project?

School of Computer Science & Engineering

Similar presentations

Presentation on theme: "School of Computer Science & Engineering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

School of Computer Science & Engineering

Similar presentations

Presentation on theme: "School of Computer Science & Engineering"— Presentation transcript:

Similar presentations

About project

Feedback