Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 760 – Machine Learning (UW-Madison)

Similar presentations


Presentation on theme: "CS 760 – Machine Learning (UW-Madison)"— Presentation transcript:

1 CS 760 – Machine Learning (UW-Madison)
Course Instructor: David Page office: MSC 6743 (University & Charter) hours: 1pm Tuesdays and Fridays Teaching Assistant: Nathanael Fillmore office: CS 3379 hours: 8:50am Mondays © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

2 Textbooks & Reading Assignment
Machine Learning (Tom Mitchell) Selected on-line readings Read in Mitchell Preface Chapter 1 Sections 2.1 and 2.2 Chapter 8 Chapter 3 (for next lecture) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

3 Monday, Wednesday, and Friday?
We’ll meet 30 times this term (may or may not include exam in this count) We’ll meet on FRIDAY this and next week, in order to cover material for HW 1 (plus I have some business travel this term) Default: we will NOT meet on Fridays unless I announce it (at least one week’s notice) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

4 CS 760 – Machine Learning (UW-Madison)
Course "Style" Primarily algorithmic & experimental Some theory, both mathematical & conceptual (much on statistics) "Hands on" experience, interactive lectures/discussions Broad survey of many ML subfields, including "symbolic" (rules, decision trees, ILP) "connectionist" (neural nets) support vector machines, nearest-neighbors theoretical ("COLT") statistical ("Bayes rule") reinforcement learning, genetic algorithms © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

5 CS 760 – Machine Learning (UW-Madison)
Two Major Goals to understand what a learning system should do to understand how (and how well) existing systems work Issues in algorithm design Choosing algorithms for applications © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

6 CS 760 – Machine Learning (UW-Madison)
Background Assumed Languages Java (see CS 368 tutorial online), C or C++ are OK AI Topics Search FOPC Unification Formal Deduction Math Calculus (partial derivatives) Simple prob & stats No previous ML experience assumed (so some overlap with CS 540) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

7 CS 760 – Machine Learning (UW-Madison)
Requirements Some written and programming HWs "hands on" experience valuable HW0 – build a dataset HW1 – experimental methodology I’m updating the website as we go, so please wait for me to assign HWs in class "Midterm" exam (in class, about 90% through semester) Find project of your choosing during last 4-5 weeks of class © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

8 CS 760 – Machine Learning (UW-Madison)
Grading HW's % Exam 40% Project 25% © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

9 CS 760 – Machine Learning (UW-Madison)
Late HW's Policy HW's 2:30pm you have 5 late days to use over the semester (Fri 4pm → Mon 4pm is 1 late "day") SAVE UP late days! extensions only for extreme cases Penalty points after late days exhausted Can't be more than ONE WEEK late © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

10 Academic Misconduct (also on course homepage)
All examinations, programming assignments, and written homeworks must be done individually. Cheating and plagiarism will be dealt with in accordance with University procedures (see the Academic Misconduct Guide for Students). Hence, for example, code for programming assignments must not be developed in groups, nor should code be shared. You are encouraged to discuss with your peers, the TAs or the instructor ideas, approaches and techniques broadly, but not at a level of detail where specific implementation issues are described by anyone. If you have any questions on this, please ask the instructor before you act. © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

11 A Few Examples of Machine Learning
Movie recommender (Netflix prize… ensembles) Your spam filter (probably naïve Bayes) Google, Microsoft and Yahoo Predictive models for medicine (e.g. see news on Health Discovery Corporation and SVMs) Wall Street (e.g., Rebellion research) Speech recognition (hidden Markov models) and natural language translation Identifying the proteins of an organism from its genome (also using HMMs… see CS/BMI 576) Many examples in scientific data analysis… © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

12 Some Quotes (taken from P. Domingos’ ML class notes at U-Washington)
A breakthrough in mach. learning would be worth ten Microsofts Bill Gates, Chairman, Microsoft Machine learning is the next Internet Tony Tether, previous Director, DARPA Machine learning is the hot new thing John Hennessy, President, Stanford Web rankings today are mostly a matter of machine learning Prabhakar Raghavan, Director of Research, Yahoo Machine learning is going to result in a real revolution Greg Papadopoulos, CTO, Sun Machine learning is today’s discontinuity Jerry Yang, founder and former CEO, Yahoo © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

13 What Do You Think Learning Means?
© Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

14 CS 760 – Machine Learning (UW-Madison)
What is Learning? “Learning denotes changes in the system that … enable the system to do the same task … more effectively the next time.” - Herbert Simon “Learning is making useful changes in our minds.” - Marvin Minsky © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

15 CS 760 – Machine Learning (UW-Madison)
Today’s Topics Memorization as Learning Feature Space Supervised ML K-NN (K-Nearest Neighbor) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

16 Memorization (Rote Learning)
Employed by first machine learning systems, in 1950s Samuel’s Checkers program Michie’s MENACE: Matchbox Educable Naughts and Crosses Engine Prior to these, some people believed computers could not improve at a task with experience © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

17 Rote Learning is Limited
Memorize I/O pairs and perform exact matching with new inputs If computer has not seen precise case before, it cannot apply its experience Want computer to “generalize” from prior experience © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

18 Some Settings in Which Learning May Help
Given an input, what is appropriate response (output/action)? Game playing – board state/move Autonomous robots (e.g., driving a vehicle) -- world state/action Video game characters – state/action Medical decision support – symptoms/ treatment Scientific discovery – data/hypothesis Data mining – database/regularity © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

19 Broad Paradigms of Machine Learning
Inducing Functions from I/O Pairs Decision trees (e.g., Quinlan’s C4.5 [1993]) Connectionism / neural networks (e.g., backprop) Nearest-neighbor methods Genetic algorithms SVM’s Learning without Feedback/Teacher Conceptual clustering Self-organizing systems Discovery systems Not in Mitchell’s textbook (covered in CS 776) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

20 CS 760 – Machine Learning (UW-Madison)
IID We are assuming examples are IID: independently identically distributed Eg, we are ignoring temporal dependencies (covered in time-series learning) Eg, we assume the learner has no say in which examples it gets (covered in active learning) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

21 Supervised Learning Task Overview
Real World HW 0 Feature Selection (usually done by humans) Feature Space Classification Rule Construction (done by learning algorithm) HW 1-3 Concepts/ Classes/ Decisions © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

22 Empirical Learning: Task Definition
Given A collection of positive examples of some concept/class/category (i.e., members of the class) and, possibly, a collection of the negative examples (i.e., non-members) Produce A description that covers (includes) all/most of the positive examples and non/few of the negative examples (and, hopefully, properly categorizes most future examples!) Note: one can easily extend this definition to handle more than two classes The Key Point! © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

23 CS 760 – Machine Learning (UW-Madison)
Example Positive Examples Negative Examples How does this symbol classify? Concept Solid Red Circle in a (Regular?) Polygon What about? Figures on left side of page © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

24 CS 760 – Machine Learning (UW-Madison)
Concept Learning Learning systems differ in how they represent concepts: Neural Net Backpropagation C4.5, CART Decision Tree Training Examples AQ, FOIL Φ <- X^Y Φ <- Z Rules . SVMs If 5x1 + 9x2 – 3x3 > 12 Then + © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

25 CS 760 – Machine Learning (UW-Madison)
Feature Space If examples are described in terms of values of features, they can be plotted as points in an N-dimensional space. Size Big ? Color Gray 2500 Weight A “concept” is then a (possibly disjoint) volume in this space. © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

26 Learning from Labeled Examples
Most common and successful form of ML Venn Diagram - - - - + + + - + - - - Examples – points in a multi-dimensional “feature space” Concepts – “function” that labels every point in feature space (as +, -, and possibly ?) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

27 CS 760 – Machine Learning (UW-Madison)
Brief Review Instances Conjunctive Concept Color(?obj1, red) ^ Size(?obj1, large) Disjunctive Concept Color(?obj2, blue) v Size(?obj2, small) More formally a “concept” is of the form x y z F(x, y, z) -> Member(x, Class1) “and” “or” A A A © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

28 Empirical Learning and Venn Diagrams
- - - - - - - - + + - - - + - - - + + - - + + + + + + + + + + - - A - - - - - - + + + - - + B - - - - - - - - Feature Space Concept = A or B (Disjunctive concept) Examples = labeled points in feature space Concept = a label for a set of points © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

29 CS 760 – Machine Learning (UW-Madison)
Aspects of an ML System “Language” for representing classified examples “Language” for representing “Concepts” Technique for producing concept “consistent” with the training examples Technique for classifying new instance Each of these limits the expressiveness/efficiency of the supervised learning algorithm. HW 0 Other HW’s © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

30 Nearest-Neighbor Algorithms
(aka. Exemplar models, instance-based learning (IBL), case-based learning) Learning ≈ memorize training examples Problem solving = find most similar example in memory; output its category Venn - - + + + + “Voronoi Diagrams” (pg 233) + - - - - + - + + + + ? - © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

31 CS 760 – Machine Learning (UW-Madison)
Simple Example: 1-NN (1-NN ≡ one nearest neighbor) Training Set a=0, b=0, c=1 + a=0, b=0, c=0 - a=1, b=1, c=1 - Test Example a=0, b=1, c=0 ? “Hamming Distance” Ex 1 = 2 Ex 2 = 1 Ex 3 = 2 So output - © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

32 Sample Experimental Results (see UCI archive for more)
Testbed Testset Correctness 1-NN D-Trees Neural Nets Wisconsin Cancer 98% 95% 96% Heart Disease 78% 76% ? Tumor 37% 38% Appendicitis 83% 85% 86% Simple algorithm works quite well! © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

33 CS 760 – Machine Learning (UW-Madison)
K-NN Algorithm Collect K nearest neighbors, select majority classification (or somehow combine their classes) What should K be? It probably is problem dependent Can use tuning sets (later) to select a good setting for K Tuning Set Error Rate 2 3 4 5 K Shouldn’t really “connect the dots” (Why?) 1 © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

34 HW0 – Create Your Own Dataset (repeated from lecture #1)
Think about before next class Read HW0 (on-line) Google to find: UCI archive (or UCI KDD archive) UCI ML archive (UCI ML repository) More links in HW0’s web page © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

35 HW0 – Your “Personal Concept”
Step 1: Choose a Boolean (true/false) concept Books I like/dislike Movies I like/dislike www pages I like/dislike Subjective judgment (can’t articulate) “time will tell” concepts Stocks to buy Medical treatment at time t, predict outcome at time (t +∆t) Sensory interpretation Face recognition (see textbook) Handwritten digit recognition Sound recognition Hard-to-Program Functions © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

36 Some Real-World Examples
Car Steering (Pomerleau, Thrun) Medical Diagnosis (Quinlan) DNA Categorization TV-pilot rating Chemical-plant control Backgammon playing Learned Function Steering Angle Digitized camera image Medical record age=13, sex=M, wgt=18 Learned Function sick vs healthy © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

37 HW0 – Your “Personal Concept”
Step 2: Choosing a feature space We will use fixed-length feature vectors Choose N features Each feature has Vi possible values Each example is represented by a vector of N feature values (i.e., is a point in the feature space) e.g.: <red, 50, round> color weight shape Feature Types Boolean Nominal Ordered Hierarchical Step 3: Collect examples (“I/O” pairs) Defines a space In HW0 we will use a subset (see next slide) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

38 CS 760 – Machine Learning (UW-Madison)
Standard Feature Types for representing training examples – a source of “domain knowledge” Nominal No relationship among possible values e.g., color є {red, blue, green} (vs. color = 1000 Hertz) Linear (or Ordered) Possible values of the feature are totally ordered e.g., size є {small, medium, large} ← discrete weight є [0…500] ← continuous Hierarchical Possible values are partially ordered in an ISA hierarchy e.g. for shape -> closed polygon continuous triangle square circle ellipse © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

39 Another View of Std Datasets - a Single Table (2D array)
Feature 1 Feature 2 . . . Feature N Output Category Example 1 0.0 small red true Example 2 9.3 medium false Example 3 8.2 blue Example M 5.7 green © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

40 Our Feature Types (for CS 760 HW’s)
Discrete tokens (char strings, w/o quote marks and spaces) Continuous numbers (int’s or float’s) If only a few possible values (e.g., 0 & 1) use discrete i.e., merge nominal and discrete-ordered (or convert discrete-ordered into 1,2,…) We will ignore hierarchical info and only use the leaf values (common approach) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

41 HW0: Creating Your Dataset
Ex: IMDB has a lot of data that are not discrete or continuous or binary-valued for target function (category) Name Country List of movies Name Year of birth Gender Oscar nominations List of movies Studio Actor Name Year of birth List of movies Director/ Producer Made Directed Acted in Produced Movie Title, Genre, Year, Opening Wkend BO receipts, List of actors/actresses, Release season © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

42 CS 760 – Machine Learning (UW-Madison)
HW0: Sample DB Choose a Boolean or binary-valued target function (category) Opening weekend box-office receipts > $2 million Movie is drama? (action, sci-fi,…) Movies I like/dislike (e.g. Tivo) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

43 HW0: Representing as a Fixed-Length Feature Vector
<discuss on chalkboard> Note: some advanced ML approaches do not require such “feature mashing” (eg, ILP) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

44 CS 760 – Machine Learning (UW-Madison)
David Jensen’s group at UMass uses Naïve Bayes and other ML algo’s on the IMDB Opening weekend box-office receipts > $2 million 25 attributes Accuracy = 83.3% Default accuracy = 56% (default algo?) Movie is drama? 12 attributes Accuracy = 71.9% Default accuracy = 51% © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

45 From Earlier: Memorization
Employed by first machine learning systems, in 1950s Samuel’s Checkers program Michie’s MENACE: Matchbox Educable Naughts and Crosses Engine Prior to these, some people believed computers could not improve at a task with experience © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

46 Rote Learning is Limited
Memorize I/O pairs and perform exact matching with new inputs If computer has not seen precise case before, it cannot apply its experience Want computer to “generalize” from prior experience © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

47 Nearest-Neighbor Algorithms
(aka. Exemplar models, instance-based learning (IBL), case-based learning) Learning ≈ memorize training examples Problem solving = find most similar example in memory; output its category Venn - - + + + + “Voronoi Diagrams” (pg 233) + - - - - + - + + + + ? - © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

48 CS 760 – Machine Learning (UW-Madison)
Simple Example: 1-NN (1-NN ≡ one nearest neighbor) Training Set a=0, b=0, c=1 + a=0, b=0, c=0 - a=1, b=1, c=1 - Test Example a=0, b=1, c=0 ? “Hamming Distance” Ex 1 = 2 Ex 2 = 1 Ex 3 = 2 So output - © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

49 Sample Experimental Results (see UCI archive for more)
Testbed Testset Correctness 1-NN D-Trees Neural Nets Wisconsin Cancer 98% 95% 96% Heart Disease 78% 76% ? Tumor 37% 38% Appendicitis 83% 85% 86% Simple algorithm works quite well! © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

50 CS 760 – Machine Learning (UW-Madison)
K-NN Algorithm Collect K nearest neighbors, select majority classification (or somehow combine their classes) What should K be? It probably is problem dependent Can use tuning sets (later) to select a good setting for K Tuning Set Error Rate 2 3 4 5 K Shouldn’t really “connect the dots” (Why?) 1 © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

51 CS 760 – Machine Learning (UW-Madison)
In More Detail K-Nearest Neighbors / Instance-Based Learning (k-NN/IBL) Distance functions Kernel functions Feature selection (applies to all ML algo’s) IBL Summary Chapter 8 of Mitchell © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

52 CS 760 – Machine Learning (UW-Madison)
Some Common Jargon Classification Learning a discrete valued function Regression Learning a real valued function IBL easily extended to regression tasks (and to multi-category classification) Discrete/Real Outputs © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

53 CS 760 – Machine Learning (UW-Madison)
Variations on a Theme (From Aha, Kibler and Albert in ML Journal) IB1 – keep all examples IB2 – keep next instance if incorrectly classified by using previous instances Uses less storage (good) Order dependent (bad) Sensitive to noisy data (bad) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

54 Variations on a Theme (cont.)
IB3 – extend IB2 to more intelligently decide which examples to keep (see article) Better handling of noisy data Another Idea - cluster groups, keep example from each (median/centroid) Less storage, faster lookup © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

55 CS 760 – Machine Learning (UW-Madison)
Distance Functions Key issue in IBL (instance-based learning) One approach: assign weights to each feature © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

56 Distance Functions (sample)
distance between examples 1 and 2 a numeric weighting factor distance for feature i only between examples 1 and 2 © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

57 Kernel Functions and k-NN
Term “kernel” comes from statistics Major topic in support vector machines (SVMs) Weights the interaction between pairs of examples © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

58 Kernel Functions and k-NN (continued)
Assume we have k nearest neighbors e1, ..., ek associated output categories O1, ..., Ok Then output for test case et is the kernel “delta” function (=1 if Oi=c, else =0) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

59 Sample Kernel Functions K(ei , et)
- + ? In diagram to right, example ‘?’ has three neighbors, two of which are ‘-’ and one of which is ‘+’. K(ei , et) = 1 K(ei , et) = 1 / dist(ei , et) simple majority vote (? classified as -) inverse distance weight (? could be classified as +) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

60 CS 760 – Machine Learning (UW-Madison)
Gaussian Kernel Heavily used in SVMs Euler’s constant © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

61 CS 760 – Machine Learning (UW-Madison)
Local Learning Collect k nearest neighbors Give them to some supervised ML algo Apply learned model to test example + - + + Train on these - + - - - ? + + + © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

62 Instance-Based Learning (IBL) and Efficiency
IBL algorithms postpone work from training to testing Pure k-NN/IBL just memorizes the training data Sometimes called lazy learning Computationally intensive Match all features of all training examples © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

63 Instance-Based Learning (IBL) and Efficiency
Possible Speed-ups Use a subset of the training examples (Aha) Use clever data structures (A. Moore) KD trees, hash tables, Voronoi diagrams Use subset of the features © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

64 CS 760 – Machine Learning (UW-Madison)
Where will kNN FAIL? Learning “Juntas” (Blum, Langley ‘94) Target concept is a function of a small subset of the features -- relevant features Most features are irrelevant (not correlated with relevant features) In this case, nearness for kNN is based mostly on irrelevant features © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

65 CS 760 – Machine Learning (UW-Madison)
Looking Ahead (Trees) ML method we will discuss next time is Decision Tree learning Tree learners focus on choosing the most relevant features, so address Junta-learning better They choose features one at a time, in a greedy fashion © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

66 CS 760 – Machine Learning (UW-Madison)
Looking Ahead (SVMs) Later we will cover support vector machines (SVMs) As kNN, SVMs classify a new instance based on similarity to other instances, use kernels to capture similarity But SVMs also assign intrinsic weights to examples (apart from distance)… “support vectors” have weight > 0 © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

67 Number of Features and Performance for ML
Too many features can hurt test set performance Too many irrelevant features mean many spurious correlation possibilities for a ML algorithm to detect “Curse of dimensionality” kNN is especially susceptible © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

68 Feature Selection and ML (general issue for ML)
Filtering-Based Feature Selection all features subset of features model Wrapper-Based Feature Selection all features FS algorithm FS algorithm calls ML algorithm many times, uses it to help select features ML algorithm ML algorithm model © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

69 Feature Selection as Search Problem
State = set of features Start state = empty (forward selection) or full (backward selection) Goal test = highest scoring state Operators add/subtract features Scoring function accuracy on training (or tuning) set of ML algorithm using this state’s feature set © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

70 Forward and Backward Selection of Features
Hill-climbing (“greedy”) search Forward Backward {F1,F2,...,FN} 73% {F2,...,FN} 79% subtract F1 subtract F2 {} 50% {FN} 71% {F1} 62% add FN add F1 Features to use ... Accuracy on tuning set (our heuristic function) ... add F1 ... ... © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

71 Forward vs. Backward Feature Selection
Faster in early steps because fewer features to test Fast for choosing a small subset of the features Misses useful features whose usefulness requires other features (feature synergy) Fast for choosing all but a small subset of the features Preserves useful features whose usefulness requires other features Example: area important, features = length, width © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

72 CS 760 – Machine Learning (UW-Madison)
Some Comments on k-NN Positive Negative Easy to implement Good “baseline” algorithm / experimental control Incremental learning easy Psychologically plausible model of human memory Led astray by irrelevant features No insight into domain (no explicit model) Choice of distance function is problematic Doesn’t exploit/notice structure in examples © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

73 Questions about IBL (Breiman et al. - CART book)
Computationally expensive to save all examples; slow classification of new examples Addressed by IB2/IB3 of Aha et al. and work of A. Moore (CMU; now Google) Is this really a problem? © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

74 Questions about IBL (Breiman et al. - CART book)
Intolerant of Noise Addressed by IB3 of Aha et al. Addressed by k-NN version Addressed by feature selection - can discard the noisy feature Intolerant of Irrelevant Features Since algorithm very fast, can experimentally choose good feature sets (Kohavi, Ph. D. – now at Amazon) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

75 CS 760 – Machine Learning (UW-Madison)
More IBL Criticisms High sensitivity to choice of similiarity (distance) function Euclidean distance might not be best choice Handling non-numeric features and missing feature values is not natural, but doable No insight into task (learned concept not interpretable) © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)

76 CS 760 – Machine Learning (UW-Madison)
Summary IBL can be a very effective machine learning algorithm Good “baseline” for experiments © Jude Shavlik 2006, David Page 2010 CS 760 – Machine Learning (UW-Madison)


Download ppt "CS 760 – Machine Learning (UW-Madison)"

Similar presentations


Ads by Google