Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linear Regression & Classification

Similar presentations


Presentation on theme: "Linear Regression & Classification"— Presentation transcript:

1 Linear Regression & Classification
Prof. Navneet Goyal CS & IS BITS, Pilani

2 Fundamentals of Modeling
Abstract representation of a real-world process Y=3X+2 is a very simple model of how variable Y might relate to variable X Instance of a more general model structure Y =aX+b a & b are parameters θ is generally used to denote a generic parameter or a set (or vector) of parameters θ={a,b} Values of parameters are chosen by estimation – that is by min. or max. an appropriate score function measuring the fit of the model to the data Before we can choose the parameters, we must choose an app. functional form of the model itself

3 Fundamentals of Modeling
Predictive modeling PM can be thought of as learning a mapping from an input set of vector measurements x to a scalar output y Vector output also possible but rarely used in practice One of the variable is expressed as a function of others (predictor variables) Response variable – Y and predictor variables – Xi Ÿ = f(x1,x2,….xp; θ) When Y is quantitative, this task of estimating a mapping from the p-dimensional X to Y is called as regression When Y is categorical, the task of learning a mapping from X to Y is called classification learning or supervised classification

4 Predictive Modeling Predictive modeling
Predicts the value of some target characteristic of an object on the basis of observed values of other characteristics of the object Examples: Regression (Prediction in DM) & Classification

5 Predictive Modeling Prediction Classification (supervised learning)
Linear regression Nonlinear regression Classification (supervised learning) Decision trees k-NN SVM ANN

6 Definition of Regression
Regression is a (statistical) methodology that utilizes the relation between two or more quantitative variables so that one variable can be predicted from the other, or others. Examples: Sales of a product can be predicted by using the relationship between sales volume and amount of advertising The performance of an employee can be predicted by using the relationship between performance and aptitude tests The size of a child’s vocabulary can be predicted by using the relationship between the vocabulary size, the child’s age and the parents’ educational input.

7 Regression Problem Visualisation
+ + x y, y rmse = s y ^

8 Structure of a Linear Regression Model
Given a set of features x, a linear predictor has the form: The output is a real-valued, quantitative variable x y, y ^

9 Classification Problem
4/15/2017 Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DC, where each ti is assigned to one class. Prediction is similar, but may be viewed as having infinite number of classes. Dr. Navneet Goyal, BITS,Pilani

10 Classification Classification is the task of assigning an object, described by a feature vector, to one of a set of mutually exclusive groups A linear classifier has a linear decision boundary The perceptron training algorithm is guaranteed to converge in a finite time when the data set is linearly separable

11 What is Classification?
Classification is also known as (statistical) pattern recognition The aim is to build a machine/algorithm that can assign appropriate qualitative labels to new, previously unseen quantitative data using a priori knowledge and/or information contained in a training set. The patterns to be classified are usually groups of measurements/observations, that are believed to be informative for the classification task. Example: Face recognition Training data: D = {X,y} Prior knowledge Design/ learn Classifier m(q,x) ^ Predict ^ New pattern: x Predicted class label: y

12 Classification: Applications
Spam mail IDS (rare event classification) Credit- rating Medical diagnosis Categorizing cells as malignant or benign based on MRI scans Classifying galaxies based on their shapes Predicting preterm births Crop yield prediction Identify mushrooms as poisonous or edible

13 Classification: Applications
Example: Credit Card Company Every purchase is placed in 1 of 4 classes Authorize Ask for further identification before authorizing Do not authorize Do not authorize but contact police Two functions of Data Mining Examine historical data to determine how the data fit into 4 classes Apply the model to each new purchase

14 Classification: 3 phase job
Model building phase (learning phase) Testing phase Model usage phase

15 Distance-based Classification
Nearest Neighbors If it walks like a duck, quacks like a duck, and looks like a duck, then it is probably a duck Training Records Test Record Compute Distance Choose k of the “nearest” records

16 Definition of Nearest Neighbor
4/15/2017 Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x Dr. Navneet Goyal, BITS,Pilani

17 Support Vector Machines
Find a linear hyperplane (decision boundary) that will separate the data

18 Support Vector Machines
One Possible Solution

19 Support Vector Machines
Another possible solution

20 Support Vector Machines
Other possible solutions

21 Support Vector Machines
Which one is better? B1 or B2? How do you define better?

22 Support Vector Machines
Find a hyperplane that maximizes the margin => B1 is better than B2

23 Support Vector Machines
What if the problem is not linearly separable?

24 Nonlinear Support Vector Machines
What if decision boundary is not linear?

25 Support Vector Machines
Solid line is preferred Geometrically we can characterize the solid plane as being “furthest” from both classes How can we construct the plane “furthest’’ from both classes?

26 Support Vector Machines
Figure – Best plane bisects closest points in the convex hulls Examine the convex hull of each class’ training data (indicated by dotted lines) and then find the closest points in the two convex hulls (circles labeled d and c). The convex hull of a set of points is the smallest convex set containing the points. If we construct the plane that bisects these two points (w=d-c), the resulting classifier should be robust in some sense.

27 Non-Convex or Concave Set
Convex Sets Convex Set Non-Convex or Concave Set A function (in blue) is convex if and only if the region above its graph (in green) is a convex set.

28 Convex hull: elastic band analogy
Convex Hulls Convex hull: elastic band analogy For planar objects, i.e., lying in the plane, the convex hull may be easily visualized by imagining an elastic band stretched open to encompass the given object; when released, it will assume the shape of the required convex hull.

29 Disadvantages of Linear Decision Surfaces
Var1 Var2

30 Advantages of Non-Linear Surfaces
Var1 Var2

31 Linear Classifiers in High-Dimensional Spaces
Constructed Feature 2 Var1 Var2 Constructed Feature 1 Find function (x) to map to a different space Go back

32 Handwriting Recognition
Task T recognizing and classifying handwritten words within images Performance measure P percent of words correctly classified Training experience E a database of handwritten words with given classifications

33 Handwriting Recognition

34 Pattern Recognition Example
Handwriting Digit Recognition

35 Pattern Recognition Example
Handwriting Digit Recognition Each digit represented by a 28x28 pixel image Can be represented by a vector of 784 real no.s Objective: to have an algorithm that will take such a vector as input and identify the digit it is representing Non-trivial problem due to variability in handwriting Take images of a large no. of digits (N) – training set Use training set to tune the parameters of an adaptive model Each digit in the training set has been identified by a target vector t, which represents the identity of the corresp. digit. Result of running a ML algo. can expressed as a fn. y(x) which takes input a new digit x and outputs a vector y. Vector y is encoded in the same way as t The form of y(x) is determined through the learning (training) phase

36 Pattern Recognition Example
Generalization The ability to categorize correctly new examples that differ from those in training Generalization is a central goal in pattern recognition Preprocessing Input variables are preprocessed to transform them into some new space of variables where it is hoped that the problem will be easier to solve (see fig.) Images of digits are translated and scaled so that each digit is contained within a box of fixed size. This reduces variability. Preprocessing stage is referred to as feature extraction New test data must be preprocessed using the same steps as training data

37 Pattern Recognition Example
Preprocessing Can also speed up computations For eg.: Face detection in a high resolution video stream Find useful features that are fast to compute and yet that also preserve useful discriminatory information enabling faces to be distinguished form non-faces Avg. value of image intensity in a rectangular sub-region can be evaluated extremely efficiently and a set of such features are very effective in fast face detection Such features are smaller in number than the number of pixels, it is referred to as a form of Dimensionality Reduction Care must be taken so that important information is not discarded during pre processing

38 Pattern Recognition Example
Supervised & unsupervised learning If training data consists of both input vectors and target vectors – supervised learning Digit recognition problem – classification Predicting crop yield – regression If training data consists of only input vectors – unsupervised learning Discover groups of similar examples within data – clustering Find distribution of data within the input space – density estimation Project data from a HD space to 2-3 D space for the purpose of visualization

39 Reinforcement Learning
The problem of finding suitable actions to take in a given situation in order to maximize a reward

40 Polynomial Curve Fitting
Observe Real-valued input variable x • Use x to predict value of target variable t • Synthetic data generated from sin(2π x) • Random noise in target values Target Variable Input Variable

41 Polynomial Curve Fitting
N observations of x x = (x1,..,xN)T t = (t1,..,tN)T • Goal is to exploit training set to predict value of from x • Inherently a difficult problem Target Variable Data Generation: N = 10 Spaced uniformly in range [0,1] Generated from sin(2πx) by adding small Gaussian noise Noise typical due to unobserved variables Input Variable

42 Polynomial Curve Fitting
• Where M is the order of the polynomial • Is higher value of M better? We’ll see shortly! • Coefficients w0 ,…wM are denoted by vector w • Nonlinear function of x, linear function of coefficients w • Called Linear Models Target Variable Input Variable

43 Sum-of-Squares Error Function

44 Polynomial curve fitting

45 Polynomial curve fitting
Choice of M?? Called the model selection or model comparison

46 0th Order Polynomial Poor representations of sin(2πx)

47 1st Order Polynomial Poor representations of sin(2πx)

48 3rd Order Polynomial Best Fit to sin(2πx)

49 9th Order Polynomial Over Fit: Poor representation of sin(2πx)

50 Polynomial Curve Fitting
Good generalization is the objective Dependence of generalization performance on M? Consider a data set of 100 points Calculate E(w*) for both training data & test data Choose M which minimizes E(w*) Root Mean Square Error (RMS) Sometimes convenient to use as division by N allows us to compare different sizes of data sets on equal footing Square root ensures ERMS is measure on the same scale ( and in same units) as the target variable t

51 Over-fitting Why is it happening? For small M(0,1,2)
Inflexible to handle oscillations of sin(2πx) M(3-8) flexible enough to handle oscillations of sin(2πx) For M=9 Too flexible!! TE = 0 GE = high Why is it happening?

52 Polynomial Coefficients

53 Data Set Size: M=9 - Larger the data set, the more complex model we can afford to fit to the data - No. of data pts should be no less than 5-10 times the no. of adaptive parameters in the model

54 Over-fitting Problem Should we limit the no. of parameters according to the available training set? Complexity of the model should depend only on the complexity of the problem! LSE represents a specific case of Maximum Likelihood Over-fitting is a general property of maximum likelihood Over-fitting problem can be avoided using the Bayesian Approach!

55 Regularization Penalize large coefficient values

56 Regularization:

57 Regularization:

58 Regularization: vs.

59 Polynomial Coefficients

60 Linear Models for Regression
Polynomial is an example of a broad class of functions called linear regression models The role of regression is to predict the value of one or more continuous target variables t given the value of a D-dimensional vector x of input variables We have already discussed Polynomial Curve Fitting for Regression A polynomial is a specific example of a broad class of functions called Linear Regression Models Functions which are linear functions of the adjustable parameters Simplest form of linear regression models are also linear functions of the input variables A more useful class of functions can be obtained by taking a linear combination of a fixed set of nonlinear functions of the input variables, known as basis functions Linear functions of parameters Non-linear wrt input variables

61 Linear Models for Regression
Linear models have significant limitations as practical techniques for ML, particularly for problems involving high dimensionality Linear models possess nice analytical properties and form the foundation for more sophisticated models

62 Linear Basis Function Models
Simplest linear model for regression with d input variables: Where are the input variables Compare with linear regression with one variable Compare with polynomial regression with one variable Linear in both parameters and input variables Significant limitations since it is a linear fn. of input variables 1-D case – straight line fit

63 Linear Basis Function Models

64 Linear Basis Function Models
Polynomial regression is a particular example of this model!! How?? Single input variable: x Basis function Polynomial basis Limitation of polynomial basis function? Global: • changes in one region of input space affects others Can divide input space into regions • use different polynomials in each region • equivalent to spline functions

65 Linear Basis Function Models
Polynomial basis functions: These are global; a small change in x affect all basis functions.

66 Linear Basis Function Models (4)
Gaussian basis functions: These are local; a small change in x only affect nearby basis functions. μj and s control location and scale (width).

67 Linear Basis Function Models (5)
Sigmoidal basis functions: where Also these are local; a small change in x only affect nearby basis functions. ¹j and s control location and scale (slope).

68 Home Work Read about Gaussian, Sigmoidal, & Fourier basis functions
Sequential Learning & Online algorithms Will discuss in the next class!

69 The Bias-Variance Decomposition
Bias-variance decomposition is a formal method for analyzing the prediction error of a predictive model Bias = avg. distance bet the target and the location where the projectile hits the ground (depends on angle) Variance = deviation bet x and the avg. position where the projectile hits the floor (depends on force) Noise: if the target is not stationary then the observed distance is also affected by changes in the location of target

70 The Bias-Variance Decomposition
Low degree polynomial has high bias (fits poorly) but has low variance with different data sets High degree polynomial has low bias (fits well) but has high variance with different data sets Interactive Mod/e_gm_bias_variance.htm

71 The Bias-Variance Decomposition
True height of Chinese emperor: 200cm, about 6’6”. Poll a random American: ask “How tall is the emperor?” We want to determine how wrong they are, on average

72 The Bias-Variance Decomposition
Each scenario has expected value of 180 (or bias error = 20), but increasing variance in estimate Squared error = Square of bias error + Variance As variance increases, error increases

73 Effect of regularization parameter on the bias and variance terms
high variance low variance low bias high bias

74 An example of the bias-variance trade-off

75 Beating the bias-variance trade-off
We can reduce the variance term by averaging lots of models trained on different datasets. This seems silly. If we had lots of different datasets it would be better to combine them into one big training set. With more training data there will be much less variance. Weird idea: We can create different datasets by bootstrap sampling of our single training dataset. This is called “bagging” and it works surprisingly well. But if we have enough computation its better to do the right Bayesian thing: Combine the predictions of many models using the posterior probability of each parameter vector as the combination weight.


Download ppt "Linear Regression & Classification"

Similar presentations


Ads by Google