Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.

Slides:



Advertisements
Similar presentations
1 Classification using instance-based learning. 3 March, 2000Advanced Knowledge Management2 Introduction (lazy vs. eager learning) Notion of similarity.
Advertisements

Nonparametric Methods: Nearest Neighbors
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Linear Separators.
Indian Statistical Institute Kolkata
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Solving Linear Equations
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Econ 140 Lecture 121 Prediction and Fit Lecture 12.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Linear Separators. Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We will.
Ensemble Learning: An Introduction
The Simple Regression Model
Evaluating Hypotheses
The Basics of Regression continued
Variability Measures of spread of scores range: highest - lowest standard deviation: average difference from mean variance: average squared difference.
Decision Trees (2). Numerical attributes Tests in nodes are of the form f i > constant.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Measures of Variability: Range, Variance, and Standard Deviation
Classification and Prediction: Regression Analysis
Chapter 4 SUMMARIZING SCORES WITH MEASURES OF VARIABILITY.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Standard Deviation. Two classes took a recent quiz. There were 10 students in each class, and each class had an average score of 81.5.
Solving a System of Equations using Multiplication
Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.
by B. Zadrozny and C. Elkan
GEOMETRIC VIEW OF DATA David Kauchak CS 451 – Fall 2013.
Chapter 9 – Classification and Regression Trees
Learning from Observations Chapter 18 Through
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Regression Models Residuals and Diagnosing the Quality of a Model.
Measures of Dispersion
Warm-Up To become president of the United States, a candidate does not have to receive a majority of the popular vote. The candidate does have to win a.
Discussion of time series and panel models
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Standard Deviation Lecture 18 Sec Tue, Feb 15, 2005.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Pearson Correlation Coefficient 77B Recommender Systems.
Centrality revisited We have already seen how to compute the median. If we use the median as an axe we cut the data into two halves, each with an equal.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Machine Learning 5. Parametric Methods.
Brian Lukoff Stanford University October 13, 2006.
LECTURE 02: EVALUATING MODELS January 27, 2016 SDS 293 Machine Learning.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Kernel Methods Arie Nakhmani. Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Chapter 4 Variability PowerPoint Lecture Slides Essentials of Statistics for the Behavioral Sciences Seventh Edition by Frederick J Gravetter and Larry.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Data Analysis Student Text :Chapter 7. Data Analysis MM2D1. Using sample data, students will make informal inferences about population means and standard.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CHAPTER 1 Exploring Data
C4.5 - pruning decision trees
Instance Based Learning
CS 4/527: Artificial Intelligence
Data Mining Practical Machine Learning Tools and Techniques
Instance Based Learning
Prediction and Accuracy
Solving Linear Equations
Statistical Models and Machine Learning Algorithms --Review
Regression Methods.
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Regression

So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where the y's are numeric values. We'll see how to extend –nearest neighbor and –decision trees to solve regression problems.

Local Averaging As in nearest neighbor, we remember all your data. When someone asks a question, –Find the K nearest old data points –Return the average of the answers associated with them

K=1 When k = 1, this is like fitting a piecewise constant function to our data. It will track our data very closely, but, as in nearest neighbor, have high variance and be prone to overfitting.

Bigger K When K is larger, variations in the data will be smoothed out, but then there may be too much bias, making it hard to model the real variations in the function.

Locally weighted averaging One problem with plain local averaging, especially as K gets large, is that we are letting all K neighbors have equal influence on the predicting the output of the query point. In locally weighted averaging, we still average the y values of multiple neighbors, but we weight them according to how close they are to the target point. That way, we let nearby points have a larger influence than farther ones.

Kernel functions for weighting Given a target point x, with k ranging over neighbors, A "kernel" function K, takes the query point and a training point, and returns a weight, which indicates how much influence the y value of the training point should have on the predicted y value of the query point. Then, to compute the predicted y value, we just add up all of the y values of the points used in the prediction, multiplied by their weights, and divide by the sum of the weights.

Epanechnikov Kernel

Regression Trees Like decision trees, but with real-valued constant outputs at the leaves.

Leaf values Assume that multiple training points are in the leaf and we have decided, for whatever reason, to stop splitting. In the boolean case, we use the majority output value as the value for the leaf. In the numeric case, we'll use the average output value. So, if we're going to use the average value at a leaf as its output, we'd like to split up the data so that the leaf averages are not too far away from the actual items in the leaf. Statistics has a good measure of how spread out a set of numbers is –(and, therefore, how different the individuals are from the average); –it's called the variance of a set.

Variance Measure of how much spread out a set of numbers is. Mean of m values, z 1 through z m : Variance is essentially the average of the squared distance between the individual values and the mean. –If it's the average, then you might wonder why we're dividing by m-1 instead of m. –Dividing by m-1 makes it an unbiased estimator, which is a good thing.

Splitting We're going to use the average variance of the children to evaluate the quality of splitting on a particular feature. Here we have a data set, for which I've just indicated the y values.

Splitting Just as we did in the binary case, we can compute a weighted average variance, depending on the relative sizes of the two sides of the split.

Splitting We can see that the average variance of splitting on feature 3 is much lower than of splitting on f7, and so we'd choose to split on f3.

Stopping Stop when the variance at the leaf is small enough. Then, set the value at the leaf to be the mean of the y values of the elements.