Machine Learning Week 2 Lecture 2.

Slides:

Advertisements

Similar presentations

Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.

Advertisements

VC Dimension – definition and impossibility result

Shortest Vector In A Lattice is NP-Hard to approximate

Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.

Lecture 24 MAS 714 Hartmut Klauck

Evaluating Classifiers

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.

Machine Learning Week 3 Lecture 1. Programming Competition

BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

Games of Prediction or Things get simpler as Yoav Freund Banter Inc.

Machine Learning Week 2 Lecture 1.

Chernoff Bounds, and etc.

Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.

Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Probably Approximately Correct Model (PAC)

Vapnik-Chervonenkis Dimension

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

4. Multiple Regression Analysis: Estimation -Most econometric regressions are motivated by a question -ie: Do Canadian Heritage commercials have a positive.

Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.

Machine Learning CMPT 726 Simon Fraser University

Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.

Evaluating Classifiers Lecture 2 Instructor: Max Welling.

Lecture 19 Exam: Tuesday June15 4-6pm Overview. General Remarks Expect more questions than before that test your knowledge of the material. (rather then.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

Decision Trees and more!. Learning OR with few attributes Target function: OR of k literals Goal: learn in time – polynomial in k and log n –  and 

Experimental Evaluation

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

Crash Course on Machine Learning

Introduction to Algorithms Jiafen Liu Sept

Confidence Intervals and Hypothesis Testing

Probability theory: (lecture 2 on AMLbook.com)

Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.

Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Pareto Linear Programming The Problem: P-opt Cx s.t Ax ≤ b x ≥ 0 where C is a kxn matrix so that Cx = (c (1) x, c (2) x,..., c (k) x) where c.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.

Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.

CpSc 881: Machine Learning Evaluating Hypotheses.

Machine Learning Chapter 5. Evaluating Hypotheses

Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.

CPSC 536N Sparse Approximations Winter 2013 Lecture 1 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.

Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Complexity 24-1 Complexity Andrei Bulatov Interactive Proofs.

Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.

Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.

PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Computational Learning Theory

Computational Learning Theory

Evaluating Classifiers

ECE 5424: Introduction to Machine Learning

Vapnik–Chervonenkis Dimension

INF 5860 Machine learning for image classification

Computational Learning Theory

Computational Learning Theory

Computational Learning Theory Eric Xing Lecture 5, August 13, 2010

CSCI B609: “Foundations of Data Science”

Presentation transcript:

Machine Learning Week 2 Lecture 2

Hand In It is online. Web board forum for Matlab questions Comments and corrections very welcome. I will upload new versions as we go along. Currently we are at version 3 Your data is coming. We might change it over time.

Quiz Go through all Questions

Recap

Impossibility of Learning! x1 x2 x3 f(x) 1 ? What is f? There are 256 potential functions 8 of them has in sample error 0 Assumptions are needed

No Free Lunch "All models are wrong, but some models are useful.” George Box Machine Learning has many different models and algorithms There is no single best model that works best for all problems (No Free Lunch Theorem) Assumptions that works well in one domain may fail in another

Probabilistic Approach Repeat N times independently μ is unknown Sample mean: ν #heads/N Sample: h,h,h,t,t,h,t,t,h Hoeffdings Inequality Sample mean is probably approximately correct PAC

Classification Connection Testing a Hypothesis Unknown Target Fixed Hypothesis Probability Distribution over x is probability of picking x such that f(x) ≠ h(x) is probability of picking x such that f(x) = h(x) μ is the sum of the probability of all the points X where hypothesis is wrong This is just the sum Sample Mean - true error rate μ

Learning? Only Verification not Learning For finite hypothesis sets we used union bound Make sure is close to and minimize

Error Functions Walmart. Discount for a given person Error Function CIA Access (Friday bar stock) Error Function h(x)/f(x) Lying True Est. Lying Est. True h(x)/f(x) Lying True Est. Lying Est. True 1000 1 1 1000 Point being. Depends on application

Unknown Probability Distribution P(x) Final Diagram Unknown Target Unknown Probability Distribution P(x) Learn Importance P(y | x) Data Set Learning Algorithm Hypothesis Set Final Hypothesis Error Measure e If x has very low probability then it is not really gonna count.

Today We are still only talking classification Test Sets Work towards learning with infinite size hypothesis spaces for classification Reinvestigate Union Bound Dichotomies Break Points

The Test Set Split your data into two parts D-train,D-test Fixed hypothesis h, N independent data points, and any ε>0 Split your data into two parts D-train,D-test Train on D-train and select hypothesis h Test h on D-test, error Apply Hoeffding bound to

Test Set Strong Bound: 1000 points then with 98% probability, in sample error will be within 5% of out of sample error Unbiased Just as likely to better than worse Problem lose data for training If Error is high it is not a help that it will also be high in practice Can NOT be used to select h (contamination)

Learning With Probability 1-δ Pick a tolerance (risk) δ of failing you can accept Set RHS equal to δ and solve for ε = With Probability 1-δ Generalization Bound Why we minimize in sample error.

Union Bound

Union Bound Learning We did not subtract overlapping events!!! Learning algorithm pick hypothesis hl P(hl is bad) is less than the probability that some hypothesis is bad We did not subtract overlapping events!!!

Hypotheses seem correlated Change h2 h1 The differencesin E_out is the small triangle in between if h1 is bad (poor generalization) then probably so is h2 Hope to improve union bound result

Goal Replace M with something like effective number of hypotheses General bound. E.g. independent, target function and input distribution Simple would be nice.

Look at finite point sets Dichotomy bit string of length N Fixed set of N points X = (x1,..,xN) Hypothesis set Each gives a dichotomy How Many Different Dichotomies do we get? At Most Capturing the “expressiveness” of the hypothesis set on X

Growth Function Fixed set of N points X = (x1,..,xN) Hypothesis set

Example 1: Positive Rays 1-Dimensional input space (points on the real line) a Only Change When a moves to different interval

Example 2: Intervals 1-Dimensional input space (points on the real line) a1 a2 a1,a2 in separate parts + Put in same

Example 3: Convex Sets 2-Dimensional input space (points in the plane) CIRCLE IS JUST FOR ILLUSTRATION

Goal Continued Prove we can replace M with growth function Generalization Bound Imagine we can replace M with growth function RHS is dropping exponentially fast in N If Growth function is a polynomial in N then RHS still drops exponentially in N Bright Idea. Prove Growth function is polynomial in N Prove we can replace M with growth function

Bounding Growth Function Might be hard to compute Instead of computing the exact value Prove that it is bounded by a polynomial

Shattering and Break Point If then we say that shatters (x1,…,xN) If no data set of size K can be shattered by then K is a break point for If K is a break point for then so is all numbers larger than K? Why?

Revisit Examples Positive Rays Intervals Convex sets a a1 a2

2D Linear Classification (Hyperplanes)

2D Linear Classification 3 Points on a line For 2D Linear Classification Hypothesis set 4 is a break point

Break Points and Growth Function If has a break point then the growth function is polynomial (needs proof) If not then it is not! By definition of break point:

Break Point Game Has Break Point 2 x1 x2 x3 1 x1 x2 1 1 1 1 1 1 1 x1 x2 1 1 1 Row 1,2,3,4 1 1 Row 6,5,2,1 1 Row 7,5,3,2 Impossible for 1 Row 8,5,3,2

Proof Coming If has a break point then the growth function is polynomial

B(n,k) is the maximal number of dichotomies Definition: B(n,k) is the maximal number of dichotomies possible on N points such that no subset of k points can be shattered by the dichotomies. More general than hypothesis sets If no data set of size K can be shattered by then K is a break point for B stands for binomial for any with break point k

Computing B(n,k) – Boundary Cases Cannot shatter set of size 1. There is no way of picking dichotomies that gives different classes for a point. There is only one dichotomy since a different dichotomy would give different class for at least one point There is only one point, this only 2 dichotomies are possible

Compute B(N,k)- Recursion List L with all dichotomies in B(n,k)

Recursion Consider the first n_1 points, there are α+β different (S2 sets are identical here) They can still at most shatter k points, e.g. B(N-1,k) is an upper bound Consider the first n_1 points in S2. If they can shatter k-1 points we can extend with last point where we have both combinations for all dichotomies. This gives k points we can shatter a contradiction. On the n-1 first we can only have a+b dichotomies since s20,s21 er ens på de første N-1 per konsruktion. Det betyder så at det er bundet at B(N-1,k) da vi heller ikke må shatter en delmængde af størrelse k af N-1 punkter når vi må på N Kigger på beta så kan vi se på s20. Hvis den kan shatter en mængde af størrelse k-1 på de første n-1 værdier så kan vi sammen med s2+ shatter en mængde af størrelse K på n punkter da vi har begge kombinationer af xn så e.g. tag de dichotomies der shatter k-1 punkter og brug dem igen, men tag både den fra s20 og den fra s21 så shatter vi de samme punkter som før samt xn det vil sige at

Proof Coming Base Cases:

Induction Step Show for N0+1 for k>1 (k=1 was base case) should be 0 change parameter

Continue Make it into one sum Recurrence for binomials Add in zero index again QED