Machine Learning Feature Creation and Selection

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Image Registration  Mapping of Evolution. Registration Goals Assume the correspondences are known Find such f() and g() such that the images are best.
Chapter 5 Multiple Linear Regression
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Random Forest Predrag Radenković 3237/10
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Multiple Linear Regression
Minimum Redundancy and Maximum Relevance Feature Selection
Lecture 4: Embedded methods
Uncertainty Representation. Gaussian Distribution variance Standard deviation.
CSci 8980: Data Mining (Fall 2002)
Dimensional reduction, PCA
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Active Appearance Models Suppose we have a statistical appearance model –Trained from sets of examples How do we use it to interpret new images? Use an.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Radial Basis Function Networks
Chapter 6-2 Radial Basis Function Networks 1. Topics Basis Functions Radial Basis Functions Gaussian Basis Functions Nadaraya Watson Kernel Regression.
COSC 4335 DM: Preprocessing Techniques
A REVIEW OF FEATURE SELECTION METHODS WITH APPLICATIONS Alan Jović, Karla Brkić, Nikola Bogunović {alan.jovic, karla.brkic,
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time.
Chapter 4: Introduction to Predictive Modeling: Regressions
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data Mining and Decision Support
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data statistics and transformation revision Michael J. Watts
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CSE 4705 Artificial Intelligence
BINARY LOGISTIC REGRESSION
Data Transformation: Normalization
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Background on Classification
School of Computer Science & Engineering
Noisy Data Noise: random error or variance in a measured variable.
General Linear Model & Classical Inference
CSE 4705 Artificial Intelligence
regression Can BMI explain Y? Can BMI predict Y?
Machine Learning Basics
Data Mining (and machine learning)
Advanced Artificial Intelligence Feature Selection
Roberto Battiti, Mauro Brunato
A special case of calibration
K Nearest Neighbor Classification
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Digital Image Processing Week IV
Machine Learning in Practice Lecture 22
Nearest Neighbors CSC 576: Data Mining.
Model generalization Brief summary of methods
Machine Learning for Visual Scene Classification with EEG Data
Support Vector Machines
Chapter 7: Transformations
Machine learning overview
Introduction to Sensor Interpretation
Feature Selection Methods
Neural networks (3) Regularization Autoencoder
Statistical Models and Machine Learning Algorithms --Review
Data Pre-processing Lecture Notes for Chapter 2
Introduction to Sensor Interpretation
Presentation transcript:

Machine Learning Feature Creation and Selection

Feature creation Well-conceived new features can sometimes capture the important information in a dataset much more effectively than the original features. Three general methodologies: Feature extraction typically results in significant reduction in dimensionality domain-specific Map existing features to new space Feature construction combine existing features

Scale invariant feature transform (SIFT) Image content is transformed into local feature coordinates that are invariant to translation, rotation, scale, and other imaging parameters. SIFT features

Extraction of power bands from EEG Select time window Fourier transform on each channel EEG to give corresponding channel power spectrum Segment power spectrum into bands Create channel-band feature by summing values in band time window Multi-channel EEG recording (time domain) Multi-channel power spectrum (frequency domain)

Map existing features to new space Fourier transform Eliminates noise present in time domain Two sine waves Two sine waves + noise Frequency

Attribute transformation Simple functions Examples of transform functions: xk log( x ) ex | x | Often used to make the data more like some standard distribution, to better satisfy assumptions of a particular algorithm. Example: discriminant analysis explicitly models each class distribution as a multivariate Gaussian log( x )

Feature subset selection Reduces dimensionality of data without creating new features Motivations: Redundant features highly correlated features contain duplicate information example: purchase price and sales tax paid Irrelevant features contain no information useful for discriminating outcome example: student ID number does not predict student’s GPA Noisy features signal-to-noise ratio too low to be useful for discriminating outcome example: high random measurement error on an instrument

Feature subset selection Benefits: Alleviate the curse of dimensionality Enhance generalization Speed up learning process Improve model interpretability

Curse of dimensionality As number of features increases: Volume of feature space increases exponentially. Data becomes increasingly sparse in the space it occupies. Sparsity makes it difficult to achieve statistical significance for many methods. Definitions of density and distance (critical for clustering and other methods) become less useful. all distances start to converge to a common value

Curse of dimensionality Randomly generate 500 points Compute difference between max and min distance between any pair of points

Approaches to feature subset selection Filter approaches: Features selected before machine learning algorithm is run Wrapper approaches: Use machine learning algorithm as black box to find best subset of features Embedded: Feature selection occurs naturally as part of the machine learning algorithm example: L1-regularized linear regression

Approaches to feature subset selection Both filter and wrapper approaches require: A way to measure the predictive quality of the subset A strategy for searching the possible subsets exhaustive search usually infeasible – search space is the power set (2d subsets)

Filter approaches Most common search strategy: Score each feature individually for its ability to discriminate outcome. Rank features by score. Select top k ranked features. Common scoring metrics for individual features t-test or ANOVA (continuous features) -square test (categorical features) Gini index etc.

Filter approaches Other strategies look at interaction among features Eliminate based on correlation between pairs of features Eliminate based on statistical significance of individual coefficients from a linear model fit to the data example: t-statistics of individual coefficients from linear regression

Wrapper approaches Most common search strategies are greedy: Random selection Forward selection Backward elimination Scoring uses some chosen machine learning algorithm Each feature subset is scored by training the model using only that subset, then assessing accuracy in the usual way (e.g. cross- validation)

Forward selection Assume d features available in dataset: | FUnsel | == d Optional: target number of selected features k Set of selected features initially empty: FSel =  Best feature set score initially 0: ScoreBest = 0 Do Best next feature initially null: FBest =  For each feature F  FUnsel Form a trial set of features FTrial = FSel + F Run wrapper algorithm, using only features Ftrial If score( FTrial ) > scoreBest FBest = F; scoreBest = score( FTrial ) If FBest   FSel = FSel + FBest; FUnsel = FUnsel – FBest Until FBest ==  or FUnsel ==  or | FSel | == k Return FSel

Random selection Number of features available in dataset d Target number of selected features k Target number of random trials T Set of selected features initially empty: FSel =  Best feature set score initially 0: ScoreBest = 0. Number of trials conducted initially 0: t = 0 Do Choose trial subset of features FTrial randomly from full set of d available features, such that | FTrial | == k Run wrapper algorithm, using only features Ftrial If score( FTrial ) > scoreBest FSel = FTrial; scoreBest = score( FTrial ) t = t + 1 Until t == T Return FSel

Other wrapper approaches If d and k not too large, can check all possible subsets of size k. This is essentially the same as random selection, but done exhaustively.