Supervised Learning I, Cont’d. Administrivia Machine learning reading group Not part of/related to this class We read advanced (current research) papers.

Slides:



Advertisements
Similar presentations
Machine Learning: Intro and Supervised Classification
Advertisements

Slides from: Doug Gray, David Poole
Machine Learning III Decision Tree Induction
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Approximations of points and polygonal chains
Support Vector Machines
Lecture 6 Matrix Operations and Gaussian Elimination for Solving Linear Systems Shang-Hua Teng.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Supervised Learning I, Cont’d Reading: DH&S, Ch
More Methodology; Nearest-Neighbor Classifiers Sec 4.7.
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)
Steep learning curves Reading: Bishop Ch. 3.0, 3.1.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Decision trees and empirical methodology Sec 4.3,
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Methods, cont’d; SVMs intro. Straw poll Which would you rather do first? Unsupervised learning Clustering Structure of data Scientific discovery.
Supervised Learning & Classification, part I Reading: DH&S, Ch 1.
Bayesian Learning 1 of (probably) 2. Administrivia Readings 1 back today Good job, overall Watch your spelling/grammar! Nice analyses, though Possible.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Linear Discriminant Functions Chapter 5 (Duda et al.)
The joy of Entropy.
Supervised Learning I, Cont’d Reading: Bishop, Ch 14.4, 1.6, 1.5.
Support Vector Machines a.k.a, Whirlwind o’ Vector Algebra Sec. 6.3 SVM Tutorial by C. Burges (on class “resources” page)
The joy of Entropy. Administrivia Reminder: HW 1 due next week No other news. No noose is good noose...
Bayesian Learning Part 3+/- σ. Administrivia Final project/proposal Hand-out/brief discussion today Proposal due: Mar 27 Midterm exam: Thurs, Mar 22 (Thurs.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Decision Procedures An Algorithmic Point of View
Introduction Continued on Next Slide  Section 3.1 in Textbook.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
MAT 1234 Calculus I 9.4 System of Linear Equations: Matrices
Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
CS-424 Gregory Dudek Today’s outline Administrative issues –Assignment deadlines: 1 day = 24 hrs (holidays are special) –The project –Assignment 3 –Midterm.
Review of Matrices Or A Fast Introduction.
Machine Learning CSE 681 CH2 - Supervised Learning.
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 2. Linear systems.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Learning from Observations Chapter 18 Through
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
Linear Discrimination Reading: Chapter 2 of textbook.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:
Linear Transformations
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
CS 445/545 Machine Learning Winter, 2017
Computational Intelligence: Methods and Applications
CS 445/545 Machine Learning Spring, 2017
Data Science Algorithms: The Basic Methods
Linear Transformations
Data Mining (and machine learning)
Linear Transformations
Quantum One.
1.3 Vector Equations.
Chapter 1: Linear Equations in Linear Algebra
Machine Learning in Practice Lecture 17
CS639: Data Management for Data Science
Presentation transcript:

Supervised Learning I, Cont’d

Administrivia Machine learning reading group Not part of/related to this class We read advanced (current research) papers in the ML field Might be of interest. All are welcome Meets Fri, 3:00-4:30, FEC349 conf room More info: Lecture notes online Pretest/solution set online

5 minutes of math... Solve the linear system

5 minutes of math... What if this were a scalar equation?

5 minutes of math... Not much different for linear systems Linear algebra developed to make working w/ linear systems as easy as working w/ linear scalar equations BUT matrix multiplication doesn’t commute! NOTE! not

5 minutes of math... So when does this work? When does a solution for V exist/unique? Think back to scalar version: When does this have a solution? What’s the moral equivalent for linear systems?

5 minutes of math... The moral equivalent of a scalar “0” is a “singular matrix” Many ways to determine this. Simplest is the determinant: System has (unique) solution iff

5 minutes of math... Finally, what “shapes” are all of the parts? RHS and LHS must have same shape... So R must be a column vector What about c T V ? Column vector

5 minutes of math... Consider some cases. What if T is a vector? What about a rectangular matrix?

5 minutes of math... ⇒ For the term c T V to be a column vec, T must be a square matrix:

Feature (attribute): Instance (example): Label (class): Feature space: Training data: Review of notation

Hypothesis spaces The “true” we want is usually called the target concept (also true model, target function, etc.) The set of all possible we’ll consider is called the hypothesis space, NOTE! Target concept is not necessarily part of the hypothesis space!!! Example hypothesis spaces: All linear functions Quadratic & higher-order fns.

Visually... Space of all functions on Might be here Or it might be here...

More hypothesis spaces Rules if (x.skin==”fur”) { if (x.liveBirth==”true”) { return “mammal”; } else { return “marsupial”; } } else if (x.skin==”scales”) { switch (x.color) { case (”yellow”) { return “coral snake”; } case (”black”) { return “mamba snake”; } case (”green”) { return “grass snake”; } } } else {... }

More hypothesis spaces Decision Trees

More hypothesis spaces Decision Trees

Finding a good hypothesis Our job is now: given an in some and an, find the best we can by searching Space of all functions on

Measuring goodness What does it mean for a hypothesis to be “as close as possible”? Could be a lot of things For the moment, we’ll think about accuracy (Or, with a higher sigma-shock factor...)

Constructing DT’s, intro Hypothesis space: Set of all trees, w/ all possible node labelings and all possible leaf labelings How many are there? Proposed search procedure: 3. Propose a candidate tree, 4. Evaluate accuracy of w.r.t. and 5. Keep max accuracy 6. Go to 1 Will this work?

A more practical alg Can’t really search all possible trees Instead, build tree greedily and recursively: DecisionTree buildDecisionTree(X,Y) Input: InstanceSet X, LabelSet Y Output: decision tree if (pure(X,Y)) { return new Leaf(Y); } else { Attribute a=getBestSplitAttribute(X,Y); DecisionNode n=new DecisionNode(a); [X1,..., Xk, Y1,..., Yk]=splitData(X,Y,a); for (i=1;i<=k;++i) { n.addChild(buildDecisionTree(Xi,Yi)); } return n; }

A bit of geometric intuition x1: petal length x2: sepal width

The geometry of DTs Decision tree splits space w/ a series of axis orthagonal decision surfaces A.k.a. axis parallel Equivalent to a series of half-spaces Intersection of all half-spaces yields a set of hyper-rectangles (rectangles in d>3 dimensional space) In each hyper-rectangle, DT assigns a constant label So a DT is a piecewise-constant approximator over a sequence of hyper-rectangular regions

Filling out the algorithm Still need to specify a couple of functions: pure(X) Determine whether we’re done splitting set X getBestSplitAttribute(X,Y) Find the best attribute to split X on pure(X) is the easy (easier, anyway) one...

Splitting criteria What properties do we want our getBestSplitAttribute() function to have? Increase the purity of the data After split, new sets should be closer to uniform labeling than before the split Want the subsets to have roughly the same purity Want the subsets to be as balanced as possible These choices are designed to produce small trees Definition: Learning bias == tendency to find one class of solution out of H in preference to another