Linear Regression & Classification

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Applications of one-class classification
Introduction to Support Vector Machines (SVM)
Introduction to Machine Learning BITS C464/BITS F464
Polynomial Curve Fitting BITS C464/BITS F464 Navneet Goyal Department of Computer Science, BITS-Pilani, Pilani Campus, India.
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Preliminaries Prof. Navneet Goyal CS & IS BITS, Pilani.
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Pattern Recognition and Machine Learning
x – independent variable (input)
Data Mining Techniques Outline
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Data mining and statistical learning - lecture 13 Separating hyperplane.
Visual Recognition Tutorial
Support Vector Machines
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial Basis Function Networks
Crash Course on Machine Learning
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
This week: overview on pattern recognition (related to machine learning)
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Chapter 9 – Classification and Regression Trees
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
INTRODUCTION TO Machine Learning 3rd Edition
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Data Mining and Decision Support
Machine Learning 5. Parametric Methods.
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Pattern recognition – basic concepts. Sample input attribute, attribute, feature, input variable, independent variable (atribut, rys, příznak, vstupní.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
K Nearest Neighbor Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines 2
Presentation transcript:

Linear Regression & Classification Prof. Navneet Goyal CS & IS BITS, Pilani

Fundamentals of Modeling Abstract representation of a real-world process Y=3X+2 is a very simple model of how variable Y might relate to variable X Instance of a more general model structure Y =aX+b a & b are parameters θ is generally used to denote a generic parameter or a set (or vector) of parameters θ={a,b} Values of parameters are chosen by estimation – that is by min. or max. an appropriate score function measuring the fit of the model to the data Before we can choose the parameters, we must choose an app. functional form of the model itself

Fundamentals of Modeling Predictive modeling PM can be thought of as learning a mapping from an input set of vector measurements x to a scalar output y Vector output also possible but rarely used in practice One of the variable is expressed as a function of others (predictor variables) Response variable – Y and predictor variables – Xi Ÿ = f(x1,x2,….xp; θ) When Y is quantitative, this task of estimating a mapping from the p-dimensional X to Y is called as regression When Y is categorical, the task of learning a mapping from X to Y is called classification learning or supervised classification

Predictive Modeling Predictive modeling Predicts the value of some target characteristic of an object on the basis of observed values of other characteristics of the object Examples: Regression (Prediction in DM) & Classification

Predictive Modeling Prediction Classification (supervised learning) Linear regression Nonlinear regression Classification (supervised learning) Decision trees k-NN SVM ANN

Definition of Regression Regression is a (statistical) methodology that utilizes the relation between two or more quantitative variables so that one variable can be predicted from the other, or others. Examples: Sales of a product can be predicted by using the relationship between sales volume and amount of advertising The performance of an employee can be predicted by using the relationship between performance and aptitude tests The size of a child’s vocabulary can be predicted by using the relationship between the vocabulary size, the child’s age and the parents’ educational input.

Regression Problem Visualisation + + x y, y rmse = s y ^

Structure of a Linear Regression Model Given a set of features x, a linear predictor has the form: The output is a real-valued, quantitative variable x y, y ^

Classification Problem 4/15/2017 Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DC, where each ti is assigned to one class. Prediction is similar, but may be viewed as having infinite number of classes. Dr. Navneet Goyal, BITS,Pilani

Classification Classification is the task of assigning an object, described by a feature vector, to one of a set of mutually exclusive groups A linear classifier has a linear decision boundary The perceptron training algorithm is guaranteed to converge in a finite time when the data set is linearly separable

What is Classification? Classification is also known as (statistical) pattern recognition The aim is to build a machine/algorithm that can assign appropriate qualitative labels to new, previously unseen quantitative data using a priori knowledge and/or information contained in a training set. The patterns to be classified are usually groups of measurements/observations, that are believed to be informative for the classification task. Example: Face recognition Training data: D = {X,y} Prior knowledge Design/ learn Classifier m(q,x) ^ Predict ^ New pattern: x Predicted class label: y

Classification: Applications Spam mail IDS (rare event classification) Credit- rating Medical diagnosis Categorizing cells as malignant or benign based on MRI scans Classifying galaxies based on their shapes Predicting preterm births Crop yield prediction Identify mushrooms as poisonous or edible …

Classification: Applications Example: Credit Card Company Every purchase is placed in 1 of 4 classes Authorize Ask for further identification before authorizing Do not authorize Do not authorize but contact police Two functions of Data Mining Examine historical data to determine how the data fit into 4 classes Apply the model to each new purchase

Classification: 3 phase job Model building phase (learning phase) Testing phase Model usage phase

Distance-based Classification Nearest Neighbors If it walks like a duck, quacks like a duck, and looks like a duck, then it is probably a duck Training Records Test Record Compute Distance Choose k of the “nearest” records

Definition of Nearest Neighbor 4/15/2017 Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x Dr. Navneet Goyal, BITS,Pilani

Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data

Support Vector Machines One Possible Solution

Support Vector Machines Another possible solution

Support Vector Machines Other possible solutions

Support Vector Machines Which one is better? B1 or B2? How do you define better?

Support Vector Machines Find a hyperplane that maximizes the margin => B1 is better than B2

Support Vector Machines What if the problem is not linearly separable?

Nonlinear Support Vector Machines What if decision boundary is not linear?

Support Vector Machines Solid line is preferred Geometrically we can characterize the solid plane as being “furthest” from both classes How can we construct the plane “furthest’’ from both classes?

Support Vector Machines Figure – Best plane bisects closest points in the convex hulls Examine the convex hull of each class’ training data (indicated by dotted lines) and then find the closest points in the two convex hulls (circles labeled d and c). The convex hull of a set of points is the smallest convex set containing the points. If we construct the plane that bisects these two points (w=d-c), the resulting classifier should be robust in some sense.

Non-Convex or Concave Set Convex Sets Convex Set Non-Convex or Concave Set A function (in blue) is convex if and only if the region above its graph (in green) is a convex set.

Convex hull: elastic band analogy Convex Hulls Convex hull: elastic band analogy For planar objects, i.e., lying in the plane, the convex hull may be easily visualized by imagining an elastic band stretched open to encompass the given object; when released, it will assume the shape of the required convex hull.

Disadvantages of Linear Decision Surfaces Var1 Var2

Advantages of Non-Linear Surfaces Var1 Var2

Linear Classifiers in High-Dimensional Spaces Constructed Feature 2 Var1 Var2 Constructed Feature 1 Find function (x) to map to a different space Go back

Handwriting Recognition Task T recognizing and classifying handwritten words within images Performance measure P percent of words correctly classified Training experience E a database of handwritten words with given classifications

Handwriting Recognition

Pattern Recognition Example Handwriting Digit Recognition

Pattern Recognition Example Handwriting Digit Recognition Each digit represented by a 28x28 pixel image Can be represented by a vector of 784 real no.s Objective: to have an algorithm that will take such a vector as input and identify the digit it is representing Non-trivial problem due to variability in handwriting Take images of a large no. of digits (N) – training set Use training set to tune the parameters of an adaptive model Each digit in the training set has been identified by a target vector t, which represents the identity of the corresp. digit. Result of running a ML algo. can expressed as a fn. y(x) which takes input a new digit x and outputs a vector y. Vector y is encoded in the same way as t The form of y(x) is determined through the learning (training) phase

Pattern Recognition Example Generalization The ability to categorize correctly new examples that differ from those in training Generalization is a central goal in pattern recognition Preprocessing Input variables are preprocessed to transform them into some new space of variables where it is hoped that the problem will be easier to solve (see fig.) Images of digits are translated and scaled so that each digit is contained within a box of fixed size. This reduces variability. Preprocessing stage is referred to as feature extraction New test data must be preprocessed using the same steps as training data

Pattern Recognition Example Preprocessing Can also speed up computations For eg.: Face detection in a high resolution video stream Find useful features that are fast to compute and yet that also preserve useful discriminatory information enabling faces to be distinguished form non-faces Avg. value of image intensity in a rectangular sub-region can be evaluated extremely efficiently and a set of such features are very effective in fast face detection Such features are smaller in number than the number of pixels, it is referred to as a form of Dimensionality Reduction Care must be taken so that important information is not discarded during pre processing

Pattern Recognition Example Supervised & unsupervised learning If training data consists of both input vectors and target vectors – supervised learning Digit recognition problem – classification Predicting crop yield – regression If training data consists of only input vectors – unsupervised learning Discover groups of similar examples within data – clustering Find distribution of data within the input space – density estimation Project data from a HD space to 2-3 D space for the purpose of visualization

Reinforcement Learning The problem of finding suitable actions to take in a given situation in order to maximize a reward

Polynomial Curve Fitting Observe Real-valued input variable x • Use x to predict value of target variable t • Synthetic data generated from sin(2π x) • Random noise in target values Target Variable Input Variable

Polynomial Curve Fitting N observations of x x = (x1,..,xN)T t = (t1,..,tN)T • Goal is to exploit training set to predict value of from x • Inherently a difficult problem Target Variable Data Generation: N = 10 Spaced uniformly in range [0,1] Generated from sin(2πx) by adding small Gaussian noise Noise typical due to unobserved variables Input Variable

Polynomial Curve Fitting • Where M is the order of the polynomial • Is higher value of M better? We’ll see shortly! • Coefficients w0 ,…wM are denoted by vector w • Nonlinear function of x, linear function of coefficients w • Called Linear Models Target Variable Input Variable

Sum-of-Squares Error Function

Polynomial curve fitting

Polynomial curve fitting Choice of M?? Called the model selection or model comparison

0th Order Polynomial Poor representations of sin(2πx)

1st Order Polynomial Poor representations of sin(2πx)

3rd Order Polynomial Best Fit to sin(2πx)

9th Order Polynomial Over Fit: Poor representation of sin(2πx)

Polynomial Curve Fitting Good generalization is the objective Dependence of generalization performance on M? Consider a data set of 100 points Calculate E(w*) for both training data & test data Choose M which minimizes E(w*) Root Mean Square Error (RMS) Sometimes convenient to use as division by N allows us to compare different sizes of data sets on equal footing Square root ensures ERMS is measure on the same scale ( and in same units) as the target variable t

Over-fitting Why is it happening? For small M(0,1,2) Inflexible to handle oscillations of sin(2πx) M(3-8) flexible enough to handle oscillations of sin(2πx) For M=9 Too flexible!! TE = 0 GE = high Why is it happening?

Polynomial Coefficients

Data Set Size: M=9 - Larger the data set, the more complex model we can afford to fit to the data - No. of data pts should be no less than 5-10 times the no. of adaptive parameters in the model

Over-fitting Problem Should we limit the no. of parameters according to the available training set? Complexity of the model should depend only on the complexity of the problem! LSE represents a specific case of Maximum Likelihood Over-fitting is a general property of maximum likelihood Over-fitting problem can be avoided using the Bayesian Approach!

Regularization Penalize large coefficient values

Regularization:

Regularization:

Regularization: vs.

Polynomial Coefficients

Linear Models for Regression Polynomial is an example of a broad class of functions called linear regression models The role of regression is to predict the value of one or more continuous target variables t given the value of a D-dimensional vector x of input variables We have already discussed Polynomial Curve Fitting for Regression A polynomial is a specific example of a broad class of functions called Linear Regression Models Functions which are linear functions of the adjustable parameters Simplest form of linear regression models are also linear functions of the input variables A more useful class of functions can be obtained by taking a linear combination of a fixed set of nonlinear functions of the input variables, known as basis functions Linear functions of parameters Non-linear wrt input variables

Linear Models for Regression Linear models have significant limitations as practical techniques for ML, particularly for problems involving high dimensionality Linear models possess nice analytical properties and form the foundation for more sophisticated models

Linear Basis Function Models Simplest linear model for regression with d input variables: Where are the input variables Compare with linear regression with one variable Compare with polynomial regression with one variable Linear in both parameters and input variables Significant limitations since it is a linear fn. of input variables 1-D case – straight line fit

Linear Basis Function Models

Linear Basis Function Models Polynomial regression is a particular example of this model!! How?? Single input variable: x Basis function Polynomial basis Limitation of polynomial basis function? Global: • changes in one region of input space affects others Can divide input space into regions • use different polynomials in each region • equivalent to spline functions

Linear Basis Function Models Polynomial basis functions: These are global; a small change in x affect all basis functions.

Linear Basis Function Models (4) Gaussian basis functions: These are local; a small change in x only affect nearby basis functions. μj and s control location and scale (width).

Linear Basis Function Models (5) Sigmoidal basis functions: where Also these are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (slope).

Home Work Read about Gaussian, Sigmoidal, & Fourier basis functions Sequential Learning & Online algorithms Will discuss in the next class!

The Bias-Variance Decomposition Bias-variance decomposition is a formal method for analyzing the prediction error of a predictive model Bias = avg. distance bet the target and the location where the projectile hits the ground (depends on angle) Variance = deviation bet x and the avg. position where the projectile hits the floor (depends on force) Noise: if the target is not stationary then the observed distance is also affected by changes in the location of target

The Bias-Variance Decomposition Low degree polynomial has high bias (fits poorly) but has low variance with different data sets High degree polynomial has low bias (fits well) but has high variance with different data sets Interactive demo @: http://www.aiaccess.net/English/Glossaries/Glos Mod/e_gm_bias_variance.htm

The Bias-Variance Decomposition True height of Chinese emperor: 200cm, about 6’6”. Poll a random American: ask “How tall is the emperor?” We want to determine how wrong they are, on average

The Bias-Variance Decomposition Each scenario has expected value of 180 (or bias error = 20), but increasing variance in estimate Squared error = Square of bias error + Variance As variance increases, error increases

Effect of regularization parameter on the bias and variance terms high variance low variance low bias high bias

An example of the bias-variance trade-off

Beating the bias-variance trade-off We can reduce the variance term by averaging lots of models trained on different datasets. This seems silly. If we had lots of different datasets it would be better to combine them into one big training set. With more training data there will be much less variance. Weird idea: We can create different datasets by bootstrap sampling of our single training dataset. This is called “bagging” and it works surprisingly well. But if we have enough computation its better to do the right Bayesian thing: Combine the predictions of many models using the posterior probability of each parameter vector as the combination weight.