Machine Learning Week 3 Lecture 1. Programming Competition

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

VC Dimension – definition and impossibility result
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.
Lecture 9 Support Vector Machines
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Ch11 Curve Fitting Dr. Deshi Ye
Machine Learning Week 2 Lecture 1.
Data mining in 1D: curve fitting
Machine Learning.
Support Vector Machines (and Kernel Methods in general)
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
Machine Learning Week 2 Lecture 2.
An Algorithm for Polytope Decomposition and Exact Computation of Multiple Integrals.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Vapnik-Chervonenkis Dimension
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Vapnik-Chervonenkis Dimension Part II: Lower and Upper bounds.
Support Vector Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
Lecture 2 Computational Complexity
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
CORRECTIONS L2 regularization ||w|| 2 2, not ||w|| 2 Show second derivative is positive or negative on exams, or show convex – Latter is easier (e.g. x.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Review of statistical modeling and probability theory Alan Moses ML4bio.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Richard Kass/F02P416 Lecture 6 1 Lecture 6 Chi Square Distribution (  2 ) and Least Squares Fitting Chi Square Distribution (  2 ) (See Taylor Ch 8,
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 10: Comparing Models.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
CH 5: Multivariate Methods
Vapnik–Chervonenkis Dimension
Computational Learning Theory
Computational Learning Theory
CSCI B609: “Foundations of Data Science”
Support Vector Machines and Kernels
Machine Learning: UNIT-3 CHAPTER-2
Presentation transcript:

Machine Learning Week 3 Lecture 1

Programming Competition

Hand In

Student Comparison Efter 5 minutter: Train: MCC: [[ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ]] Test: MCC: [[ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ]]

Today Recap Learning with infinite hypothesis sets Bias Variance The end of learning theory….

Recap

The Test Set – Take a look at E out Fixed hypothesis h, N independent data points, and any ε>0 Split your data into two parts D-train,D-test Train on D-train and select hypothesis h Test h on D-test, error Apply Hoeffding bound to Cannot be used for anything else

Generalization Bound Goal: Extend to infinite hypothesis spaces

Dichotomies Dichotomy bit string of length N Fixed set of N points X = (x 1,..,x N ) Hypothesis set Each gives a dichotomy

Growth, Shattering, Breaks If then we say that shatters (x 1,…,x N ) If no data set of size K can be shattered by then K is a break point for

Revisit Examples Positive Rays Intervals Convex sets a a1a1 a2a2 Impossible Dichotomy

2D Hyperplanes 3 is not a break point. Shatter Some Point Set of size 3

2D Hyperplanes 4 is break point Impossible Dichotomies 3 Points on a line Triangle With Point Inside Else

Very Important Theorem If has a break point then the growth function is polynomial Polynomial of max degree k-1

Growth Function Generalization Bound If growth function is polynomial we can learn with infinite hypothesis sets!!! VC Theorem Proof: Book Appendix (Intuition in book)

VC Dimension The VC Dimension of a hypothesis set is maximal number of points it can shatter, e.g. max N such that It is denoted d vc The smallest break point minus one

Revisit Examples Positive Rays Intervals Convex sets 2D Hyperplanes a a1a1 a2a2 VC Dim. 1 2 ∞ 3 Unbreakable

VC Dimension For Linear Classification VC Dimension for d dimensional inputs (d+1 parameters with the bias) is d+1 Proof Coming Up Show d vc ≥ d+1 Show d vc < d+2

d vc ≥ d+1 Idea: Make points (Vectors) independent, by using one dimension for each point X is a Matrix (d+1)x(d+1) Rows = Points Invertible Find a set of d+1 points we can shatter

d vc ≥ d+1 Consider Dichotomy Find θ such that Solve

d vc < d+2 They must be linearly dependent (more vectors than dimensions) Ignore zero terms For any d+2 points we must prove there is a dichotomy hyperplanes can not capture, Considerd+1 dimensional points (vectors)

d vc < d+2 Dichotomy: Claim. This dichotomy is impossible to capture with hyperplanes

d vc = d+1 For Hyperplanes d vc = number of free parameters

Vapnik-Chervonenskis (VC) Generalization Bound With Probability at least 1-δ For any tolerance δ >0 Quote Book “The VC Generalization bound is the most important mathematical result in the theory of learning”

VC Generaliation Bound Independent of learning algorithm, target function, P(x), and out of sample error – General Indeed We need to bound Growth function for all of our hypothesis spaces. We showed #free parameters for hyperplanes.

Exercise 2.5 with N=100, Probability that to be within 0.1 of ? Happens with probability 1-δ < 0 e.g. Ridiculous a

Cost of Generality Growth Function was really worst case Independent of P(x), target, out of sample error

Sample Complexity Fix tolerance δ, (success probability ≥ 1-δ) Consider generalization error at most ε. How big N? Upper bound it more by using VC dim polynomial

Sampling Complexity Plug in ε,δ = 0.1 Lets Plot function sc_vcdim() dat = 5000:2500:100000; hold off; plot(dat,dat,'r-','linewidth',3); hold on for i=3:9 tmp = 800*log(40*((2.*dat).^i+1)); plot(dat,tmp,'b-','linewidth',3); end

Sampling Complexity Book Statement. In Practice

VC Interpretation We can learn with infinite hypothesis sets. VC Dimension captures Effective number of parameters/ degrees of Freedom In Sample Error + Model Complexity

As a figure VC Dimension In Sample Error Model Complexity Out of sample error Balance These In Sample Error + Model Complexity

Model Selection t Models m 1,…,m t Which is better? E in (m 1 ) + Ω(m 1 ) E in (m 2 ) + Ω(m 2 ). E in (m t ) + Ω(m t ) Pick the minimum one Problem, Slack in Ω

Vapnik-Chervonenkis (VC) Theorem Test Set Estimate (M=1) a lot tighter With probability 1-delta For any delta > 0

Learning Summary There is theory for regression as well. We will not go there Move on to bias variance Last learning theory in this course.

Bias Variance Decomposition Consider Least Squares Error Measure again and see if we can understand out of sample Error. For simplicity, assume our target function is noiseless, e.g. it is an exact function of the input.

Experiments Take two functions ax+b, cx 2 + dx+e Target function which is ¼ < x < 3/4 Repeatedly pick 3 random data points (x,target(x)) and fit our models Plot it and see

Bias Variance Out of sample error we get depends on hypothesis. Hypothesis is result of learning algorithm which is a function of the training data Training data effects out of sample error Think of data as a random variable and analyze what happens. What happens if we repeatedly sample data and run our learning algorithm.

Notation = Hypothesis learned on D

Bias Variance Bias Variance

Bias Variance

Learning Curves Plot of In sample and out of sample error as a function of input size