Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.

Slides:



Advertisements
Similar presentations
“Students” t-test.
Advertisements

Hypothesis Testing. To define a statistical Test we 1.Choose a statistic (called the test statistic) 2.Divide the range of possible values for the test.
Happiness comes not from material wealth but less desire. 1.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Sampling: Final and Initial Sample Size Determination
1 1 Slide STATISTICS FOR BUSINESS AND ECONOMICS Seventh Edition AndersonSweeneyWilliams Slides Prepared by John Loucks © 1999 ITP/South-Western College.
EPI 809 / Spring 2008 Chapter 9 Nonparametric Statistics.
Comparing k Populations Means – One way Analysis of Variance (ANOVA)
The General Linear Model. The Simple Linear Model Linear Regression.
1 MF-852 Financial Econometrics Lecture 4 Probability Distributions and Intro. to Hypothesis Tests Roy J. Epstein Fall 2003.
Elementary hypothesis testing
Elementary hypothesis testing
Lesson #25 Nonparametric Tests for a Single Population.
Topic 2: Statistical Concepts and Market Returns
Statistics 07 Nonparametric Hypothesis Testing. Parametric testing such as Z test, t test and F test is suitable for the test of range variables or ratio.
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Binomial Probability Distribution.
Mann-Whitney and Wilcoxon Tests.
Chapter 15 Nonparametric Statistics
The Neymann-Pearson Lemma Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  ) where  is either  1 or  2. Let g(x 1, …,
Stats 244.3(02) Review. Summarizing Data Graphical Methods.
Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Estimation Basic Concepts & Estimation of Proportions
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Confidence Interval Estimation.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
Chapter 4 Statistics. 4.1 – What is Statistics? Definition Data are observed values of random variables. The field of statistics is a collection.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Applications The General Linear Model. Transformations.
Mid-Term Review Final Review Statistical for Business (1)(2)
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
CHAPTER 14 MULTIPLE REGRESSION
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Linear Regression Hypothesis testing and Estimation.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
Sampling distributions rule of thumb…. Some important points about sample distributions… If we obtain a sample that meets the rules of thumb, then…
Nonparametric Statistical Methods. Definition When the data is generated from process (model) that is known except for finite number of unknown parameters.
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
Hypothesis testing and Estimation
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Analisis Non-Parametrik Antonius NW Pratama MK Metodologi Penelitian Bagian Farmasi Klinik dan Komunitas Fakultas Farmasi Universitas Jember.
Statistical Inference Making decisions regarding the population base on a sample.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 6.
§2.The hypothesis testing of one normal population.
Nonparametric Statistical Methods. Definition When the data is generated from process (model) that is known except for finite number of unknown parameters.
Statistical Inference Making decisions regarding the population base on a sample.
Two-Sample-Means-1 Two Independent Populations (Chapter 6) Develop a confidence interval for the difference in means between two independent normal populations.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Comparing k Populations Means – One way Analysis of Variance (ANOVA)
Nonparametric Statistics
The simple linear regression model and parameter estimation
BINARY LOGISTIC REGRESSION
Logistic Regression.
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Simple Linear Regression - Introduction
Hypothesis testing and Estimation
Comparing k Populations
CONCEPTS OF HYPOTHESIS TESTING
SA3202 Statistical Methods for Social Sciences
Chapter 9 Hypothesis Testing.
Hypothesis testing and Estimation
Comparing k Populations
Confidence Intervals.
Multiple Testing Tukey’s Multiple comparison procedure
Presentation transcript:

Logistic regression

Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from a continuous independent variable x. This model can be extended to Multiple linear regression model: y =  0 +  1 x 1 +  2 x 2 + … + +  p x p +  Here we are trying to predict a continuous dependent variable y from a several continuous dependent variables x 1, x 2, …, x p.

Now suppose the dependent variable y is binary. It takes on two values “Success” (1) or “Failure” (0) This is the situation in which Logistic Regression is used We are interested in predicting a y from a continuous dependent variable x.

Example We are interested how the success (y) of a new antibiotic cream is curing “acne problems” and how it depends on the amount (x) that is applied daily. The values of y are 1 (Success) or 0 (Failure). The values of x range over a continuum

The logisitic Regression Model Let p denote P[y = 1] = P[Success]. This quantity will increase with the value of x. The ratio: is called the odds ratio This quantity will also increase with the value of x, ranging from zero to infinity. The quantity: is called the log odds ratio

Example: odds ratio, log odds ratio Suppose a die is rolled: Success = “roll a six”, p = 1/6 The odds ratio The log odds ratio

The logisitic Regression Model i. e. : In terms of the odds ratio Assumes the log odds ratio is linearly related to x.

The logisitic Regression Model or Solving for p in terms x.

Interpretation of the parameter  0 (determines the intercept) p x

Interpretation of the parameter  1 (determines when p is 0.50 (along with  0 )) p x when

Also when is the rate of increase in p with respect to x when p = 0.50

Interpretation of the parameter  1 (determines slope when p is 0.50 ) p x

The data The data will for each case consist of 1.a value for x, the continuous independent variable 2.a value for y (1 or 0) (Success or Failure) Total of n = 250 cases

Estimation of the parameters The parameters are estimated by Maximum Likelihood estimation and require a statistical package such as SPSS

Using SPSS to perform Logistic regression Open the data file:

Choose from the menu: Analyze -> Regression -> Binary Logistic

The following dialogue box appears Select the dependent variable (y) and the independent variable (x) (covariate). Press OK.

Here is the output The Estimates and their S.E.

The parameter Estimates

Interpretation of the parameter  0 (determines the intercept) Interpretation of the parameter  1 (determines when p is 0.50 (along with  0 ))

Another interpretation of the parameter  1 is the rate of increase in p with respect to x when p = 0.50

Nonparametric Statistical Methods

Definition When the data is generated from process (model) that is known except for finite number of unknown parameters the model is called a parametric model. Otherwise, the model is called a non- parametric model Statistical techniques that assume a non- parametric model are called non-parametric.

Example – Parametric model Normal distribution – known except for the two parameters  and .  

Example – Non parametric model No assumptions are made about the distribution could be normal, skewed bimodal etc 0 0

The sign test A nonparametric test for the central location of a distribution

We want to test: H 0 : median =  0 H A : median   0 against (or against a one-sided alternative)

The Sign test: S = the number of observations that exceed  0 Comment: If H 0 : median =  0 is true we would expect 50% of the observations to be above  0, and 50% of the observations to be below  0, 1.The test statistic:

50% median =  0 If H 0 is true then S will have a binomial distribution with p = 0.50, n = sample size.

median If H 0 is not true then S will still have a binomial distribution. However p will not be equal to 00 p  0 > median p < 0.50

median 00 p  0 < median p > 0.50 p = the probability that an observation is greater than  0.

n = 10 Summarizing: If H 0 is true then S will have a binomial distribution with p = 0.50, n = sample size.

n = 10 The critical and acceptance region: Choose the critical region so that  is close to 0.05 or e. g. If critical region is {0,1,9,10} then  = =.0216

n = 10 e. g. If critical region is {0,1,2,8,9,10} then  = =.1094

If n is large we can use the Normal approximation to the Binomial. Namely S has a Binomial distribution with p = ½ and n = sample size. Hence for large n, S has approximately a Normal distribution with mean and standard deviation

Hence for large n,use as the test statistic (in place of S) Choose the critical region for z from the Standard Normal distribution. i.e. Reject H 0 if z z  /2 two tailed ( a one tailed test can also be set up.

Nonparametric Confidence Intervals

Now arrange the data x 1, x 2, x 3, … x n in increasing order Assume that the data, x 1, x 2, x 3, … x n is a sample from an unknown distribution. Hence x (1) < x (2) < x (3) < … < x (n) x (1) = the smallest observation x (2) = the 2 nd smallest observation x (n) = the largest observation

Consider the k th smallest observation and the k th largest observation in the data x 1, x 2, x 3, … x n Hence x (k) and x (n – k + 1) P[x (k) < median < x (n – k + 1) ] = P[at least k observations lie below the median and at least k observations lie above the median ] If at least k observations lie below the median than x (k) < median If at least k observations lie above the median than median < x (n – k + 1)

Thus P[x (k) < median < x (n – k + 1) ] = P[at least k observations lie below the median and at least k observations lie above the median ] = P[The number of observations below the median is at least k and at most n-k] = P[k  S  n-k] S has a binomial distribution with n = the sample size and p =1/2. where S = the number of observations below the median

Hence P[x (k) < median < x (n – k + 1) ] = p(k) + p(k + 1) + … + p(n-k) = P = P[k  S  n-k] where p(i)’s are binomial probabilities with n = the sample size and p =1/2. This means that x (k) to x (n – k + 1) is a (1 – P)100% confidence interval for the median

Summarizing where P = p(k) + p(k + 1) + … + p(n-k) and p(i)’s are binomial probabilities with n = the sample size and p =1/2. x (k) to x (n – k + 1) is a (1 – P)100% confidence interval for the median

n = 10 and k =2 Example: P = p(2) + p(3) + p(4) + p(5) + p(6) + p(7) + p(8)=.9784 Binomial probabilities Hence x (2) to x (9) is a 97.84% confidence interval for the median

Example Suppose that we are interested in determining if a new drug is effective in reducing cholesterol. Hence we administer the drug to n = 10 patients with high cholesterol and measure the reduction.

The data

The data arranged in order x (2) = -3 to x (9) =15 is a 97.84% confidence interval for the median

Example In the previous example to repeat the study with n = 20 patients with high cholesterol.

The data

The binomial distribution with n = 20, p = 0.5 Note: p(6) + p(7) + p(8) + p(9) + p(10) + p(11) + p(12) + p(13) + p(14) = = Hence x (6) to x (15) is a 95.86% confidence interval for the median reduction in cholesterol

The data arranged in order x (6) = -1 to x (15) = 9 is a 95.86% confidence interval for the median

For large values of n one can use the normal approximation to the Binomial to find the value of k so that x (k) to x (n – k + 1) is a 95% confidence interval for the median. i.e. we want to find k so that

Next day we will consider: 1.The Wilcoxon signed rank test The Wilcoxon signed rank test is an alternative to the Sign test, a test for the central location of a single population and 2.The Wilcoxon rank sum test The Wilcoxon rank sum test is a nonparametric test for comparing the central location of two populations