Feature Selection Methods

Slides:

Advertisements

Similar presentations

Component Analysis (Review)

Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Minimum Redundancy and Maximum Relevance Feature Selection

Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.

Data Mining: Concepts and Techniques — Chapter 3 — Cont.

x – independent variable (input)

Principal Component Analysis

Statistical Methods Chichang Jou Tamkang University.

Dimensional reduction, PCA

Reduced Support Vector Machine

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Classification and Prediction: Regression Analysis

Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.

Summarized by Soo-Jin Kim

Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.

Presented By Wanchen Lu 2/25/2013

0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

Additive Data Perturbation: data reconstruction attacks.

Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.

1 Recognition by Appearance Appearance-based recognition is a competing paradigm to features and alignment. No features are extracted! Images are represented.

Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.

CSE 185 Introduction to Computer Vision Face Recognition.

Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.

CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources.

Supervisor: Nakhmani Arie Semester: Winter 2007 Target Recognition Harmatz Isca.

PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.

MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Feature Selection and Dimensionality Reduction. “Curse of dimensionality” – The higher the dimensionality of the data, the more data is needed to learn.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:

Principal Components Analysis ( PCA)

Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.

Unsupervised Learning II Feature Extraction

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.

Principal Component Analysis (PCA)

Data Transformation: Normalization

PREDICT 422: Practical Machine Learning

Chapter 7. Classification and Prediction

Information Management course

LECTURE 11: Advanced Discriminant Analysis

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

School of Computer Science & Engineering

LECTURE 10: DISCRIMINANT ANALYSIS

Dimensionality Reduction

CH 5: Multivariate Methods

Principal Component Analysis (PCA)

Principal Components Analysis

Machine Learning Dimensionality Reduction

Roberto Battiti, Mauro Brunato

Machine Learning Feature Creation and Selection

Principal Component Analysis

PCA vs ICA vs LDA.

In summary C1={skin} C2={~skin} Given x=[R,G,B], is it skin or ~skin?

Blind Signal Separation using Principal Components Analysis

What is Regression Analysis?

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Feature Selection Methods

X.1 Principal component analysis

Dimensionality Reduction

Feature space tansformation methods

Generally Discriminant Analysis

LECTURE 09: DISCRIMINANT ANALYSIS

Chapter 7: Transformations

Multivariate Methods Berlin Chen, 2005 References:

Presentation transcript:

Feature Selection Methods Qiang Yang MSC IT 5210 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Feature Selection Also known as dimensionality reduction subspace learning Two types: subset vs. new features F F‘ F F‘ Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Motivation The objective of feature reduction is three-fold: Improving the accuracy of classification Providing a faster and more cost-effective predictors (CPU time) Providing a better understanding of the underlying process that generated the data 3 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Filtering methods Assume that you have both the feature Xi and the class attribute Y Associate a weight Wi with Xi Choose the features with largest weights Information Gain (Xi, Y) Mutual Information (Xi, Y) Chi-Square value of (Xi, Y) Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Wrapper Methods Classifier is considered a black-box: Say KNN Loop Choose a subset of features Classify test data using classifier Obtain error rates Until error rate is low enough (< threshold) One needs to define: how to search the space of all possible variable subsets ? how to assess the prediction performance of a learner ? Data Mining: Concepts and Techniques 5/54

The space of choices is large Multivariate FS implies a search in the space of all possible combinations of features. For n features, there are 2^n possible subsets of features. This yields both to a high computational and statistical complexity. Wrappers use the performance of a learning machine to evaluate each subset. Training 2^n learning machines is infeasible for large n, so most wrapper algorithms resort to greedy or heuristic search. The statistical complexity of the learning problem is governed by the ratio n/m where m is the number of examples, but can be reduced to log n/m for regularized algorithms and forward or backward selection procedures. Filters function analogously to wrappers, but they use in the evaluation function something cheaper to compute than the performance of the target learning machine (e.g. a correlation coefficient or the performance of a simpler learning machine). Embedded methods Kohavi-John, 1997 n features, 2n possible feature subsets! Data Mining: Concepts and Techniques

Comparsion of filter and wrapper methods for feature selection: Wrapper method (+: optimized for learning algorithm) tied to a classification algorithm very time consuming Filtering method (+: fast) Tied to a statistical method not directly related to learning objective 7 Data Mining: Concepts and Techniques

Feature Selection using Chi-Square Question: Are attributes A1 and A2 independent? If they are very dependent, we can remove either A1 or A2 If A1 is independent on a class attribute A2, we can remove A1 from our training data Data Mining: Concepts and Techniques

Chi-Squared Test (cont.) Question: Are attributes A1 and A2 independent? These features are nominal valued (discrete) Null Hypothesis: we expect independence Outlook Temperature Sunny High Cloudy Low Data Mining: Concepts and Techniques

The Weather example: Observed Count temperature Outlook High Low Outlook Subtotal Sunny 2 Cloudy 1 Temperature Subtotal: Total count in table =3 Outlook Temperature Sunny High Cloudy Low Data Mining: Concepts and Techniques

The Weather example: Expected Count If attributes were independent, then the subtotals would be Like this (this table is also known as temperature Outlook High Low Subtotal Sunny 3*2/3*2/3=4/3=1.3 3*2/3*1/3=2/3=0.6 2 (prob=2/3) Cloudy 3*2/3*1/3=0.6 3*1/3*1/3=0.3 1, (prob=1/3) Subtotal: 1 (prob=1/3) Total count in table =3 Outlook Temperature Sunny High Cloudy Low Data Mining: Concepts and Techniques

Question: How different between observed and expected? If Chi-squared value is very large, then A1 and A2 are not independent  that is, they are dependent! Thus, X^2 value is large  Attributes A1 and A2 are dependent X^2 value is small  Attributes A1 and A2 are independent Data Mining: Concepts and Techniques

Chi-Squared Table: what does it mean? If calculated value is much greater than in the table, then you have reason to reject the independence assumption When your calculated chi-square value is greater than the chi2 value shown in the 0.05 column (3.84) of this table  you are 95% certain that attributes are actually dependent! i.e. there is only a 5% probability that your calculated X2 value would occur by chance Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Principal Component Analysis (PCA) See online tutorials such as http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf X2 Y1 Y2 x Note: Y1 is the first eigen vector, Y2 is the second. Y2 ignorable. X1 Key observation: variance = largest! Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Principle Component Analysis (PCA) Principle Component Analysis: project onto subspace with the most variance (unsupervised; doesn’t take y into account) Data Mining: Concepts and Techniques 15

Principal Component Analysis: one attribute first Temperature 42 40 24 30 15 18 35 Question: how much spread is in the data along the axis? (distance to the mean) Variance=Standard deviation^2 Data Mining: Concepts and Techniques

Now consider two dimensions X=Temperature Y=Humidity 40 90 30 15 70 Covariance: measures the correlation between X and Y cov(X,Y)=0: independent Cov(X,Y)>0: move same dir Cov(X,Y)<0: move oppo dir Data Mining: Concepts and Techniques

More than two attributes: covariance matrix Contains covariance values between all possible dimensions (=attributes): Example for three attributes (x,y,z): Data Mining: Concepts and Techniques

Background: eigenvalues AND eigenvectors Eigenvectors e : C e = e How to calculate e and : Calculate det(C-I), yields a polynomial (degree n) Determine roots to det(C-I)=0, roots are eigenvalues  Check out any math book such as Elementary Linear Algebra by Howard Anton, Publisher John,Wiley & Sons Or any math packages such as MATLAB Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques An Example Mean1=24.1 Mean2=53.8 X1 X2 X1' X2' 19 63 -5.1 9.25 39 74 14.9 20.25 30 87 5.9 33.25 23 -30.75 15 35 -9.1 -18.75 43 -10.75 32 -21.75 73 19.25 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Covariance Matrix 75 106 482 C= Using MATLAB, we find out: Eigenvectors: e1=(-0.98, 0.21), 1=51.8 e2=(0.21, 0.98), 2=560.2 Thus the second eigenvector is more important! Data Mining: Concepts and Techniques

If we only keep one dimension: e2 yi -10.14 -16.72 -31.35 31.374 16.464 8.624 19.404 -17.63 We keep the dimension of e2=(0.21, 0.98) We can obtain the final data as Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Summary of PCA PCA is used for reducing the number of numerical attributes The key is in data transformation Adjust data by mean Find eigenvectors for covariance matrix Transform data Note: only linear combination of data (weighted sum of original data) Data Mining: Concepts and Techniques

Linear Method: Linear Discriminant Analysis (LDA) LDA finds the projection that best separates the two classes Multiple discriminant analysis (MDA) extends LDA to multiple classes Best projection direction for classification 4/22/2019 24 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques PCA vs. LDA PCA is unsupervised while LDA is supervised. PCA can extract r (rank of data) principles features while LDA can find (c-1) features. Both based on SVD technique. Data Mining: Concepts and Techniques