More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

Slides:

Advertisements

Similar presentations

Lecture 18: Temporal-Difference Learning

Advertisements

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Linear Regression.

Unsupervised Learning

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Pattern Recognition and Machine Learning

Support Vector Machines and Margins

Supervised Learning Recap

Indian Statistical Institute Kolkata

Machine Learning Week 2 Lecture 1.

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

The loss function, the normal equation,

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Pattern Recognition and Machine Learning

MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 

CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Machine learning Image source:

Radial Basis Function Networks

Machine learning Image source:

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

Machine Learning CSE 681 CH2 - Supervised Learning.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

Model representation Linear regression with one variable

Andrew Ng Linear regression with one variable Model representation Machine Learning.

CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.

Learning from Observations Chapter 18 Through

Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

1 LING 696B: Midterm review: parametric and non-parametric inductive inference.

1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

M Machine Learning F# and Accord.net.

Gaussian Processes For Regression, Classification, and Prediction.

Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.

Logistic Regression William Cohen.

Data Mining and Decision Support

Machine Learning 5. Parametric Methods.

CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

CS 2750: Machine Learning Linear Regression Prof. Adriana Kovashka University of Pittsburgh February 10, 2016.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Lecture 2 Introduction to Neural Networks and Fuzzy Logic President UniversityErwin SitompulNNFL 2/1 Dr.-Ing. Erwin Sitompul President University

WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.

Machine Learning Basics （ 1/2 ）周岚. Machine Learning Basics What do we mean by learning? Mitchell (1997) : A computer program is said to learn from experience.

Deep Feedforward Networks

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Lecture 3: Linear Regression (with One Variable)

10701 / Machine Learning.

A Simple Artificial Neuron

Machine Learning Basics

CSCI 5822 Probabilistic Models of Human and Machine Learning

Lecture 6: Introduction to Machine Learning

Deep Learning for Non-Linear Control

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen

Multivariate Methods Berlin Chen, 2005 References:

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Linear regression with one variable

Patterson: Chap 1 A Review of Machine Learning

Presentation transcript:

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent

Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets Inference Mechanism: A*, variable elimination, Gibbs sampling Learning Mechanism: Maximum Likelihood, Laplace Smoothing, many more: linear regression, perceptron, k- Nearest Neighbor, … Evaluation Metric: Likelihood, many more: squared error, 0-1 loss, conditional likelihood, precision/recall, …

Recall: Types of Learning The techniques we have discussed so far are examples of a particular kind of learning: Supervised: the training examples included the correct labels or outputs. Vs. Unsupervised (or semi-supervised, or distantly-supervised, …): None (or some, or only part, …) of the labels in the training data are known. Parameter Estimation: We only tried to learn the parameters in the BN, not the structure of the BN graph. Vs. Structure learning: The BN graph is not given as an input, and the learning algorithm’s job is to figure out what the graph should look like. The distinctions below aren’t actually about the learning algorithm itself, but rather about the type of model being learned: Classification: the output is a discrete value, like Happy or not Happy, or Spam or Ham. Vs. Regression: the output is a real number. Generative: The model of the data represents a full joint distribution over all relevant variables. Vs. Discriminative: The model assumes some fixed subset of the variables will always be “inputs” or “evidence”, and it creates a distribution for the remaining variables conditioned on the evidence variables. Parametric vs. Nonparametric: I will explain this later. We won’t talk much about structure learning, but we will cover some other kinds of learning (regression, unsupervised, discriminative, nonparameteric, …) in later lectures.

Regression vs. Classification Our NBC spam detector was a classifier: the output Y was one of two options, Ham or Spam. More generally, classifiers give an output from a (usually small) finite (or countably infinite) set of options. E.g., predicting who will win the presidency in the next election is a classification problem (finite set of possible outcomes: US citizens). Regression models give a real number as output. E.g., predicting what the temperature will be tomorrow is a regression problem. Any real number greater than or equal to 0 (Kelvin) is a possible outcome.

Quiz: regression vs. classification For each prediction task below, determine whether regression or classification is more appropriate. TaskRegression or Classification? Predict who will win the Super Bowl next year Predict the gender of a baby when it’s born Predict the weight of a child one year from now Predict the average life expectancy of all babies born today Predict the price of Apple, Inc.’s stock at the close of trading tomorrow. Predict whether Microsoft or Apple will have a higher valuation at the close of trading tomorrow

Answers: regression vs. classification For each prediction task below, determine whether regression or classification is more appropriate. TaskRegression or Classification? Predict who will win the Super Bowl next yearC Predict the gender of a baby when it’s bornC Predict the weight of a child one year from nowR Predict the average life expectancy of all babies born todayR Predict the price of Apple, Inc.’s stock at the close of trading tomorrow. R Predict whether Microsoft or Apple will have a higher valuation at the close of trading tomorrow C

Concrete Example Suppose I want to buy a house that’s 2000 square feet. Predict how much it will cost

More realistic data Percentage of the population under the federal poverty level Violent Crime per Capita Reported Crime Statistics for U.S. Counties

Linear Regression Suppose there are N input variables, X 1, …, X N (all real numbers). A linear regression is a function that looks like this: Y = w 0 + w 1 X 1 + w 2 X 2 + … + w N X N The w i variables are called weights or parameters. Each one is a real number. The set of all functions that look like this (one function for each choice of weights w 0 through w N ) is called the Hypothesis Class for linear regression.

Hypotheses In this example, there is only one input variable: X 1 is square footage. The hypothesis class is all functions Y = w 0 + w 1 * (square footage). Several example elements of the hypothesis class are drawn *X *X *X1

Learning for Linear Regression Linear regression tells us a whole set of possible functions to use for prediction. How do we choose the best one from this set? This is the learning problem for linear regression: Input: a set of training examples, where each example contains a value for (X 1, …, X N, Y) Output: a set of weights (w 0, …, w N ) for the “best-fitting” linear regression model.

Quiz: Learning for Linear Regression XY For the data on the left, what’s the best fit linear regression model?

Answer: Learning for Linear Regression XY For the data on the left, what’s the best fit linear regression model? 80 = w0 + w1 * = w0 + w1 * = w0-w0 + w1 * 10-w1*30 40 = w1 * (-20) -2 = w1 80 = w0 + (-2)* = w0 Y= (-2) * X

Linear Regression with Noisy Data In the previous example, we could use only two points and find a line that passed through all of the remaining points. In this example, points are only “approximately” linear. No single line passes through all points exactly. We’ll need a more complex algorithm to handle this.

Quadratic Loss (a.k.a. “Squared Error”) X 11 X 12 …X 1N Y1Y1 X 21 X 22 …X 2N Y2Y2 …………… X M1 X M2 …X MN YMYM

Objective Function

Closed-form Solution for 1 input variable

“Closed-form” Result

Quiz: Learning for Linear Regression XY Using the closed-form solution for Quadratic Loss, compute w0 and w1 for this dataset.

Answer: Learning for Linear Regression XY Using the closed-form solution for Quadratic Loss, compute w0 and w1 for this dataset. Note that w1, w0 match what we calculated before!

Overfitting and Regularization Parameter loss

Gradient Descent For more complex loss functions, it is often NOT POSSIBLE to find closed-form solutions. Instead, people resort to “iterative methods” that iteratively find better and better parameter estimates, until they converge to the best setting. We’ll go over one example of this kind of method, called “gradient descent”.

Gradient Descent Learning rate

Quiz: Gradient About zero a b c w LOSS a b c Check the boxes that apply.

Answer: Gradient About zero ax bx cx w LOSS a b c Check the boxes that apply.

Quiz: Gradient a b c Equal everywhere w LOSS a b c

Answer: Gradient a b c Equal everywherex w LOSS a b c

Quiz: Gradient Descent Which point will allow gradient descent to reach the global minimum, if it is used as the initialization for parameter w? a b c w LOSS a b c

Answer: Gradient Descent Which point will allow gradient descent to reach the global minimum, if it is used as the initialization for parameter w? a b cx w LOSS a b c