Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Introduction to Support Vector Machines (SVM)
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
CHAPTER 10: Linear Discrimination
An Introduction of Support Vector Machine
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Data mining in 1D: curve fitting
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Reduced Support Vector Machine
Support Vector Machines
SVM (Support Vector Machines) Base on statistical learning theory choose the kernel before the learning process.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Classification and Prediction: Regression Analysis
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Probability theory: (lecture 2 on AMLbook.com)
This week: overview on pattern recognition (related to machine learning)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014 for CS539 Machine Learning at WPI
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
CHAPTER 10: Logistic Regression. Binary classification Two classes Y = {0,1} Goal is to learn how to correctly classify the input into one of these two.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning 5. Parametric Methods.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
Support Vector Machines
CH 5: Multivariate Methods
Geometrical intuition behind the dual problem
An Introduction to Support Vector Machines
Classification Discriminant Analysis
INF 5860 Machine learning for image classification
Dimensionality Reduction
COSC 4368 Machine Learning Organization
Multivariate Methods Berlin Chen, 2005 References:
Supervised machine learning: creating a model
Support Vector Machines 2
Presentation transcript:

Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)

Usually only way to determine if data is linearly separable is to try a linear model. When number of attributes exceeds 2, viewing training data as a scatter plot is not practical

A linear model that has a small E in (g) means the bulk of training data is linearly separable. Since linear models usually generalize well, a linear model with small E in (g) is probably the best choice

When members of a class tend to cluster, an elliptical transformation, z =  (x) = (1, x 1 2, x 2 2 ), might lead to linearly separable features. attribute space feature space

When a linear model in attribute space separates most of the data, the transform to a space where E in (g) = 0 (linearly separable) is likely to be complex. attribute space Complex boundaries back-transformed from feature space Linear boundary

Data snooping Choosing a transform by looking at a scatter plot can be dangerous. Characteristics may only apply to this dataset

Non-linear transform usually discovered as improvement on a linear model. To find the optimum weight vector, w, replace attribute vectors x n in the X matrix by corresponding features, z n =  (x n ) min E in -> X T Xw lin = X T y min E in -> Z T Zw lin = Z T y

Learning curves: simple vs complex models Complex models require more data points for good performance. For N smaller than dotted line, simple model is better. Still larger than bound set by noise

Extending linear models by transforms can lead to “over-fitting” (smaller E in but larger E out ) VC dimension is a measure of complexity 2D linear model has d VC = 3 2D full quadratic model has d VC = 6 Model with d * VC has min E out not smallest E in

d vc as measure of complexity is usually not know What are some more useful measures of complexity? How do we estimate a good level of complexity?

11 “elbow” in estimate of E out indicates best complexity Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Approach used for 1D polynomial fitting applies to any measure of complexity Use validation set to estimate E out

Number features expands rapidly in multivariate polynomial models. z 2D quad =  (x) = (1, x 1, x 2, x 1 2, x 2 2, x 1 x 2 ) Add terms sequentially and see how E val changes

Extending the linear beer-bottle classifier to full quadratic changes the size of the Z matrix from 9 to 81. Some quadratic terms are more important than others. Ignore terms that do not decrease E val significantly. Large validation set makes this technique more affective Curse of dimensionality: glass data

Classification for digit recognition Examples of hand-written digits from zip codes

2-attribute digit model: intensity and symmetry Intensity: how much black is in the image Symmetry: how similar are mirror images intensity symmetry

Linear classifier has accuracy ~ 0.99 ones fives

One vs Not One: Linear is good; cubic slightly better

One vs Not One: finding the best complexity L +x 1 2 +x 2 2 +x 1 x 2 +x 1 3 +x 2 3 +x 1 x 2 2 +x 1 2 x 2 E val 8798 samples E in 500 samples Additional terms beyond linear Error

Discriminants in 2D binary classification ones fives

Discriminants: linear 2D binary classifier y fit (x) = w 0 + w 1 x 1 + w 2 x 2 r 1 and r 2 are numerical class labels y fit (x) = (r 1 + r 2 )/2 defines the a function of x 1 and x 2 that is the discriminant Solve this function for x 2 as a function of x 1

Discriminants: non-linear binary classifiers

y fit = w T  (x)r b = (r 1 +r 2 )/2 y fit = r b defines the discriminant For a given x 1 define f(x 2 ) = w T  (x) – r b Find the zeros of f(x 2 ) (x 1, x 2 ) are points on the discriminant By analogy with the linear 2D case