Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMP61011 Foundations of Machine Learning Feature Selection

Similar presentations


Presentation on theme: "COMP61011 Foundations of Machine Learning Feature Selection"— Presentation transcript:

1 COMP61011 Foundations of Machine Learning Feature Selection

2 Only 200 papers in the world! I wish!

3

4

5 Square Kilometre Array (due 2024)
World’s largest radio telescope array 1 terabyte per second Need to classify stellar objects real-time.

6 Supervised Learning

7 Supervised Learning Training data + labels Possibly high dimensional.
Model Test input Standard supervised learning scenario. Filter Methods. Label prediction

8 High Dimensional Data Standard supervised learning scenario. Filter Methods. (this is real, on a USB stick on my desk – 41,672 features, 59 patients)

9 Supervised Learning Training data + labels Model Test input
Standard supervised learning scenario. Filter Methods. Label prediction

10 Supervised Learning + Feature Selection
Training data + labels Select subset of features (i.e. columns) Model Test input Label prediction

11 The “Wrapper” approach
“You want to build a model… so just do it.” Can we just do an exhaustive search…? Bit set to 1 means we use that feature, otherwise 0 … …so use 8 features. Try a feature set Model With M total features… possible sets! 20 features … 1 million feature sets to check 25 features … 33.5 million sets 30 features … 1.1 billion sets Evaluate the model

12 The “Wrapper” approach
“You want to build a model… so just do it.” Simplest strategy: greedy search REPEAT: 1. Try out each of the remaining features with your model. 2. Add the “best” one. UNTIL satisfied with accuracy/error Try a feature set Model Evaluate the model

13

14 Visualising the search space…
Greedy forward search evaluates sets

15

16 Maybe we cannot, or don’t want to, build a classifier.
How inherently “useful” is a feature?

17 Can we say how “useful” a feature is?
Imagine you’re trying to guess the price of a car. Relevant : engine size, age, mileage, presence of rust, … Irrelevant : color of windscreen wipers, size of wheels, stickers on window, … Redundant : age / mileage.

18 “Filters”

19 Relevancy = Correlation?
How often have you heard the phrase “X is correlated with Y” ?

20

21

22 All these have r = 0.81. ….Pearson only detects LINEAR relationships. ….and it is only for one feature (“univariate”). ….and it is assuming two real-valued variables.

23

24 How about a classification problem?
Let’s use a simple “threshold” on variable X. Each point is a person in your database. Green stars = “good” health Red circles = “bad” health. High Useful feature. “Discriminates” very well. Low Example of NL classifier: SVM

25 How about a classification problem?
Let’s use a simple “threshold” on variable X. Each point is a person in your database. Green stars = “good” health Red circles = “bad” health. High No useful threshold ! Feature is not “discriminative”. Low Example of NL classifier: SVM

26 Fisher Score (m1 – m2)2 F = v1 + v2 m1 m2 v2 v1
Example of NL classifier: SVM (m1 – m2)2 …. is called the between-class scatter – BIG for good features. v1 + v2 ………. is called the within-class scatter – SMALL for good features.

27 Fisher Score (m1 – m2)2 F = v1 + v2 m1 m2 v1 v2
Example of NL classifier: SVM (m1 – m2)2 …. is called the between-class scatter – BIG for good features. v1 + v2 ………. is called the within-class scatter – SMALL for good features.

28 How useful is a single measurement?
Imagine a feature… Example of NL classifier: SVM Small value Big value Guyon & Elisseeff, Introduction to Feature Selection, Journal of Machine Learning Research 2004

29 Considering features together…
High Low Example of NL classifier: SVM Small value Big value Guyon & Elisseeff, Introduction to Feature Selection, Journal of Machine Learning Research 2004

30 Two irrelevant features may be relevant together
Example of NL classifier: SVM Guyon & Elisseeff, Introduction to Feature Selection, Journal of Machine Learning Research 2004

31 How useful is a feature? Need some kind of “dependency” measure…
e.g. Pearson’s correlation …. but assumes linearity Fisher score … but assumes gaussianity And both ignore feature interactions.

32

33 Mutual Information Measures dependency of X,Y Zero when independent.
Maximal when identical.

34 “Filter” methods: Three Ingredients
Dependency measure Search procedure Stopping criterion X Y J(X;Y) = 0.6 Select / discard? J(X;Y) is the dependency criterion. e.g. Pearson’s correlation Fisher score Mutual Information Select most relevant features. Discard irrelevant features. Selected set S. Iteratively add/remove features. I have a bag of features S which I add/remove from according to the criterion, until a stop criterion is met.

35

36 Feature Selection Useful to: Many methods:
Reduce chance of overfitting Reduce computational complexity at test time Increase interpretability Many methods: Wrappers vs Filters, pros and cons of each Many variants of filters.

37 Projects due next Friday, 4pm
This is the End of COMP61011. That’s it. We’re done. Exam in January – past papers on website. Projects due next Friday, 4pm You need to submit a hardcopy to SSO: - your 6 page (maximum) report You need to send by - the report as PDF, and a ZIP file of your code.


Download ppt "COMP61011 Foundations of Machine Learning Feature Selection"

Similar presentations


Ads by Google