CS 478 – Tools for Machine Learning and Data Mining Linear and Logistic Regression (Adapted from various sources) (e.g., Luiz Pessoa PY 206 class at Brown.

Slides:



Advertisements
Similar presentations
Linear Regression.
Advertisements

Brief introduction on Logistic Regression
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Binary Logistic Regression: One Dichotomous Independent Variable
Logistic Regression.
Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.
HSRP 734: Advanced Statistical Methods July 24, 2008.
Models with Discrete Dependent Variables
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Chapter 8 Logistic Regression 1. Introduction Logistic regression extends the ideas of linear regression to the situation where the dependent variable,
x – independent variable (input)
Introduction to Logistic Regression. Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women.
Probability & Statistics for Engineers & Scientists, by Walpole, Myers, Myers & Ye ~ Chapter 11 Notes Class notes for ISE 201 San Jose State University.
REGRESSION AND CORRELATION
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
CHAPTER 3 Describing Relationships
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Classification and Prediction: Regression Analysis
Logistic Regression In logistic regression the outcome variable is binary, and the purpose of the analysis is to assess the effects of multiple explanatory.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
9 - 1 Intrinsically Linear Regression Chapter Introduction In Chapter 7 we discussed some deviations from the assumptions of the regression model.
Relationships between Variables. Two variables are related if they move together in some way Relationship between two variables can be strong, weak or.
Statistics and Quantitative Analysis U4320 Segment 8 Prof. Sharyn O’Halloran.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Go to Table of Content Single Variable Regression Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
LOGISTIC REGRESSION A statistical procedure to relate the probability of an event to explanatory variables Used in epidemiology to describe and evaluate.
When and why to use Logistic Regression?  The response variable has to be binary or ordinal.  Predictors can be continuous, discrete, or combinations.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Regression. Types of Linear Regression Model Ordinary Least Square Model (OLS) –Minimize the residuals about the regression linear –Most commonly used.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 3 Describing Relationships 3.2 Least-Squares.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Multiple Regression  Similar to simple regression, but with more than one independent variable R 2 has same interpretation R 2 has same interpretation.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Logistic Regression. Linear regression – numerical response Logistic regression – binary categorical response eg. has the disease, or unaffected by the.
ANOVA, Regression and Multiple Regression March
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
1 Simple Linear Regression and Correlation Least Squares Method The Model Estimating the Coefficients EXAMPLE 1: USED CAR SALES.
Linear Models Binary Logistic Regression. One-Way IVR 2 Hawaiian Bats Examine data.frame on HO Section 1.1 Questions –Is subspecies related to canine.
1 Introduction to Modeling Beyond the Basics (Chapter 7)
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Logistic Regression Categorical Data Analysis.
Logistic Regression and Odds Ratios Psych DeShon.
Nonparametric Statistics
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Nonparametric Statistics
The simple linear regression model and parameter estimation
Chapter 4: Basic Estimation Techniques
Chapter 7. Classification and Prediction
Logistic Regression APKC – STATS AFAC (2016).
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Notes on Logistic Regression
SCATTERPLOTS, ASSOCIATION AND RELATIONSHIPS
John Loucks St. Edward’s University . SLIDES . BY.
Nonparametric Statistics
Chapter 3 Describing Relationships Section 3.2
Ch11 Curve Fitting II.
Product moment correlation
Ch 4.1 & 4.2 Two dimensions concept
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
CHAPTER 3 Describing Relationships
Logistic Regression.
Presentation transcript:

CS 478 – Tools for Machine Learning and Data Mining Linear and Logistic Regression (Adapted from various sources) (e.g., Luiz Pessoa PY 206 class at Brown University)

Regression A form of statistical modeling that attempts to evaluate the relationship between one variable (termed the dependent variable) and one or more other variables (termed the independent variables). It is a form of global analysis as it only produces a single equation for the relationship. A model for predicting one variable from another.

Linear Regression Regression used to fit a linear model to data where the dependent variable is continuous: Given a set of points (Xi,Yi), we wish to find a linear function (or line in 2 dimensions) that “goes through” these points. In general, the points are not exactly aligned: – Find line that best fits the points

Residue Error or residue: – Observed value - Predicted value

Sum-squared Error (SSE)

What is Best Fit? The smaller the SSE, the better the fit Hence, – Linear regression attempts to minimize SSE (or similarly to maximize R2) Assume 2 dimensions

Analytical Solution

Example (I) xyx^2xy Target: y=2x+1.5

Example (II)

Example (III)

Logistic Regression Regression used to fit a curve to data in which the dependent variable is binary, or dichotomous Typical application: Medicine – We might want to predict response to treatment, where we might code survivors as 1 and those who don’t survive as 0

Example Observations: For each value of SurvRate, the number of dots is the number of patients with that value of NewOut Regression: Standard linear regression Problem: extending the regression line a few units left or right along the X axis produces predicted probabilities that fall outside of [0,1]

A Better Solution Regression Curve: Sigmoid function! (bounded by asymptotes y=0 and y=1)

Odds Given some event with probability p of being 1, the odds of that event are given by: Consider the following data The odds of being delinquent if you are in the Normal group are: pdelinquent/(1–pdelinquent) = (402/4016) / (1 - (402/4016)) = / = Testosterone Delinquent odds = p / (1–p)

Odds Ratio The odds of being not delinquent in the Normal group is the reciprocal of this: – / = 8.99 Now, for the High testosterone group – odds(delinquent) = 101/345 = – odds(not delinquent) = 345/101 = When we go from Normal to High, the odds of being delinquent nearly triple: – Odds ratio: 0.293/0.111 = 2.64 – 2.64 times more likely to be delinquent with high testosterone levels

Logit Transform The logit is the natural log of the odds logit(p) = ln(odds) = ln (p/(1-p))

Logistic Regression In logistic regression, we seek a model: That is, the log odds (logit) is assumed to be linearly related to the independent variable X So, now we can focus on solving an ordinary (linear) regression!

Recovering Probabilities which gives p as a sigmoid function!

Logistic Response Function When the response variable is binary, the shape of the response function is often sigmoidal:

Interpretation of  1 Let: – odds1 = odds for value X (p/(1–p)) – odds2 = odds for value X + 1 unit Then: Hence, the exponent of the slope describes the proportionate rate at which the predicted odds ratio changes with each successive unit of X

Sample Calculations Suppose a cancer study yields: – log odds = – SurvRate Consider a patient with SurvRate = 40 – log odds = – (40) = – odds = e = – patient is times more likely to be improved than not Consider another patient with SurvRate = 41 – log odds = – (41) = – odds = e = – patient’s odds are 1.907/1.758 = times (or 8.5%) better than those of the previous patient Using probabilities – p40 = and p41 = – Improvements appear different with odds and with p

Example 1 (I) A systems analyst studied the effect of computer programming experience on ability to complete a task within a specified time Twenty-five persons selected for the study, with varying amounts of computer experience (in months) Results are coded in binary fashion: Y = 1 if task completed successfully; Y = 0, otherwise Loess: form of local regression

Example 1 (II) Results from a standard package give: –  0 = – and  1 = Estimated logistic regression function: For example, the fitted value for X = 14 is: (Estimated probability that a person with 14 months experience will successfully complete the task)

Example 1 (III) We know that the probability of success increases sharply with experience – Odds ratio: exp(  1) = e = – Odds increase by 17.5% with each additional month of experience A unit increase of one month is quite small, and we might want to know the change in odds for a longer difference in time – For c units of X: exp(c  1)

Example 1 (IV) Suppose we want to compare individuals with relatively little experience to those with extensive experience, say 10 months versus 25 months (c = 15) – Odds ratio: e15x = 11.3 – Odds of completing the task increase 11-fold!

Example 2 (I) In a study of the effectiveness of coupons offering a price reduction, 1,000 homes were selected and coupons mailed Coupon price reductions: 5, 10, 15, 20, and 30 dollars 200 homes assigned at random to each coupon value X: amount of price reduction Y: binary variable indicating whether or not coupon was redeemed

Example 2 (II) Fitted response function –  0 = and  1 = Odds ratio: exp(  1 ) = e = Odds of a coupon being redeemed are estimated to increase by 10.2% with each $1 increase in the coupon value (i.e., $1 in price reduction)

Putting it to Work For each value of X, you may not have probability but rather a number of pairs from which you can extract frequencies and hence probabilities – Raw data:,,,,,,, – Probability data (p=1, 3rd entry is number of occurrences in raw data):, – Odds ratio data…

Coronary Heart Disease (I)

Coronary Heart Disease (II)

Coronary Heart Disease (III) Note: the sums reflect the number of occurrences (Sum(X) = X1.#occ(X1)+…+X8.#occ(X8), etc.)

Coronary Heart Disease (IV) Results from regression: –  0 = and  1 =

Summary Regression is a powerful data mining technique – It provides prediction – It offers insight on the relative power of each variable We have focused on the case of a single independent variable – What about the general case?