Mean, Proportion, CLT Bootstrap

Slides:



Advertisements
Similar presentations
Estimation of Means and Proportions
Advertisements

Chapter 6 Confidence Intervals.
Chapter 6 Sampling and Sampling Distributions
SAMPLING DISTRIBUTIONS Chapter How Likely Are the Possible Values of a Statistic? The Sampling Distribution.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Estimation in Sampling
Chapter 8: Estimating with Confidence
Sampling: Final and Initial Sample Size Determination
Statistics for Managers Using Microsoft® Excel 5th Edition
Copyright © 2009 Cengage Learning 9.1 Chapter 9 Sampling Distributions.
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
Sampling Distributions
Sampling Distributions
Simple Linear Regression
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
1 Sampling Distributions Chapter Introduction  In real life calculating parameters of populations is prohibitive because populations are very.
Chapter 7 Sampling and Sampling Distributions
Point and Confidence Interval Estimation of a Population Proportion, p
Topic 2: Statistical Concepts and Market Returns
Evaluating Hypotheses
1 The Basics of Regression Regression is a statistical technique that can ultimately be used for forecasting.
1 Hypothesis Testing In this section I want to review a few things and then introduce hypothesis testing.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 7 Sampling.
Sampling Distributions
Part III: Inference Topic 6 Sampling and Sampling Distributions
7-1 Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall Chapter 7 Sampling and Sampling Distributions Statistics for Managers using Microsoft.
BCOR 1020 Business Statistics
Standard error of estimate & Confidence interval.
POSC 202A: Lecture 9 Lecture: statistical significance.
June 18, 2008Stat Lecture 11 - Confidence Intervals 1 Introduction to Inference Sampling Distributions, Confidence Intervals and Hypothesis Testing.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 7 Sampling Distributions.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Sampling Distributions
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 6 Sampling Distributions.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Chapter Twelve Census: Population canvass - not really a “sample” Asking the entire population Budget Available: A valid factor – how much can we.
Stat 13, Tue 5/8/ Collect HW Central limit theorem. 3. CLT for 0-1 events. 4. Examples. 5.  versus  /√n. 6. Assumptions. Read ch. 5 and 6.
AP Statistics Section 9.3A Sample Means. In section 9.2, we found that the sampling distribution of is approximately Normal with _____ and ___________.
Agresti/Franklin Statistics, 1e, 1 of 139  Section 6.4 How Likely Are the Possible Values of a Statistic? The Sampling Distribution.
Copyright ©2011 Pearson Education 7-1 Chapter 7 Sampling and Sampling Distributions Statistics for Managers using Microsoft Excel 6 th Global Edition.
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 9 Samples.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
February 2012 Sampling Distribution Models. Drawing Normal Models For cars on I-10 between Kerrville and Junction, it is estimated that 80% are speeding.
Chapter Thirteen Copyright © 2004 John Wiley & Sons, Inc. Sample Size Determination.
Inferential Statistics Part 1 Chapter 8 P
STA Lecture 171 STA 291 Lecture 17 Chap. 10 Estimation – Estimating the Population Proportion p –We are not predicting the next outcome (which is.
Sampling distributions rule of thumb…. Some important points about sample distributions… If we obtain a sample that meets the rules of thumb, then…
Statistics and Quantitative Analysis U4320 Segment 5: Sampling and inference Prof. Sharyn O’Halloran.
Chapter 19 Confidence intervals for proportions
Chapter 7: Sampling Distributions Section 7.1 How Likely Are the Possible Values of a Statistic? The Sampling Distribution.
Copyright © 2009 Cengage Learning 9.1 Chapter 9 Sampling Distributions ( 표본분포 )‏
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Sampling and Sampling Distributions Basic Business Statistics 11 th Edition.
Basic Business Statistics
Sampling Distributions Sampling Distributions. Sampling Distribution Introduction In real life calculating parameters of populations is prohibitive because.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Chapter Eleven Sample Size Determination Chapter Eleven.
Estimating a Population Proportion ADM 2304 – Winter 2012 ©Tony Quon.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 7 Sampling Distributions Section 7.1 How Sample Proportions Vary Around the Population.
Chapter 6 Sampling and Sampling Distributions
Sampling and Sampling Distributions
Introduction to Inference
Sampling Distributions
Understanding Sampling Distributions: Statistics as Random Variables
Sample Size Determination
Chapter 6: Sampling Distributions
Week 10 Chapter 16. Confidence Intervals for Proportions
CONCEPTS OF ESTIMATION
Lecture 7 Sampling and Sampling Distributions
Presentation transcript:

Mean, Proportion, CLT Bootstrap Inference Mean, Proportion, CLT Bootstrap

From Probability to Statistics In all our probability calculations, we have assumed that we know all quantities needed to solve the problem: Portfolio problems: To find the expected return and standard deviation of a portfolio, we assumed we knew the mean and standard deviation of the returns of the underlying stocks. Potato chip example: To find the proportion of bags below the 8-ounce minimum, we assumed we knew the mean and standard deviation of the weight of chips in each bags. In practice, these types of parameters are not given to us; we must estimate them from data. Statistical analysis usually proceeds along the following lines: Postulate a probability model (usually including unknown parameters) for a situation involving uncertainty; e.g., assume that a certain quantity follows a normal distribution. Use data to estimate the unknown parameters in the model. Plug the estimated parameters into the model in order to do make predictions from the model.

How do we start with? The first step, picking a model, must be based on an understanding of the situation to be modeled. Which assumptions are plausible? Which are not? These questions are answered by judgment, not by precise statistical techniques. Examples: Assume that daily changes in a stock price follow a normal distribution. Use historical data to estimate the mean and standard deviation. Once we have estimates, we might use the model to predict future price ranges or to value an option on the stock. Assume that demand for a fashion item is normally distributed. Once we have estimates, we might use the model to set production levels.

How do we get data and make inference? The first step in understanding the process of estimation is understanding basic properties of sampled data and sample statistics, since these are the basis of estimation. When we talk about sampling it is always in the context of a fixed underlying population: If we look at 50 daily changes in IBM stock, we are looking at a sample of size 50 from the population of all daily changes in IBM stock. If we ask 150 shoppers whether or not they buy corn flakes, we have a sample of size 150 from all possible shoppers. If the population is very large (as in these examples), we generally treat it as though it were infinite; this simplifies matters. Thus, we are primarily concerned with finite samples from infinite populations. A single sample from a population is a random variable. Its distri-bution is the population distribution; e.g., The distribution of a randomly selected daily change in IBM stock is the distribution over all daily changes; The probability that a randomly selected shopper buys corn flakes is the proportion of the entire population that buys corn flakes.

Random Sample A random sample from a population is a set of randomly selected observations from that population. If X1,…, Xn are a random sample, then they are independent; they are identically distributed, all with the distribution of the underlying population. A sample statistic is any quantity calculated from a random sample. The most familiar example of a sample statistic is the sample mean , given by = (X1 + X2 + … + Xn)/n The sample mean gives an estimate of the the population mean m = E[Xi].

Distribution of the Sample Mean Every sample statistic is a random variable. Randomness is introduced through the sampling mechanism. As noted above, the sample mean of a random sample X1,…, Xn is an estimate of the population mean m = E[Xi]. How good an estimate is it? How can we assess the uncertainty in the estimate? To answer these questions, we need to examine the sampling distribution of the sample mean; that is, the distribution of the random variable . Assume that the underlying population is normal with mean m and variance s2. This means that Xi ~ N(m,s2) for all i. The Xi's are independent, since we assume we have a random sample. The sum of independent normal random variables is normally distributed. The usual rules for means and variances apply: The expected value of the sum is the sum of the expected values. The variance of the sum is the sum of the variances (by independence). Any linear transformation of a normal random variable is normal; in particular, multiplication by a constant preserves normality.

Distribution of the Sample Mean Using these two facts, we find that if Xi ~ N(m,s2) for all i, then X1 + X2 + … + Xn ~ N(nm,ns2); The sample mean from a normal population has a normal distribution. First consequence: The expected value of the sample mean is the population mean; “on average" the sample mean correctly estimates the underlying mean. The standard deviation of a sample statistic is called its standard error. Thus, we have shown that the standard error of the sample mean is s/√n, where s is the underlying standard deviation and n is the sample size. Second consequence: Because the standard error of sample mean is s/√n, the uncertainty in this estimate decreases as the sample size n increases. (That's good.) The uncertainty (as measured by the standard deviation) decreases rather slowly: to cut the standard deviation in half, we need to collect four times as much data, because of the square root. (That's not so good, but that's life.)

Example: Suppose the number of miles driven each week by US car owners is normally distributed with a standard deviation of s = 75 miles. Suppose we plan to estimate the population mean number of miles driven per week by US car owners using a random sample of size n = 100. What is the probability that our estimate will differ from the true value by more than 10 miles? Denote the population mean by m and the sample mean by . We need to find . By symmetry of the normal distribution, it is Thus, the probability that our estimate will be o by more than 10 miles is 18.36%. If the underlying population is not normal, what can be done?

Central Limit Theorem By the central limit theorem, regardless of the underlying population, the distribution of sample mean tends towards N(m,s2/n) as n becomes large. If we accept the use of this approximation, we don't need to assume that the number of miles driven per week in the example is normally distributed (as long as our sample size n is large). repeatedly to assess the error in X as an estimate of . How large should n be for the normal approximation to be accurate? There is no simple answer (it depends on the underlying distribution), but n≧ 30 is a reasonable rule of thumb. If the underlying population is finite of size N, and if the sample size n is not a small proportion of N, we use the following small sample correction to the standard error:

Sampling Distribution of the Sample Proportion Consider estimating any of the following quantities: Proportion of voters who will vote for a third-party candidate in the next election. Proportion of visits to a web site that result in a sale. Proportion of shoppers who prefer crunchy over creamy. In each of these examples, we are trying to estimate a population proportion. Denote a generic population proportion by the symbol p. Estimate a population proportion using a sample proportion. For example, if a poll surveys 1000 voters and finds that 85 of those surveyed plan to vote for a third-party candidate, then the sample proportion is 8.5%. The population proportion is what the poll would find if it could ask every voter in the population. Denote the sample proportion by the symbol Once we have collected a random sample, the sample proportion is known. We use it to estimate the true, unknown population proportion p.

Consider again the example of a poll of 1000 voters. Estimating a proportion can be formulated as a special case of estimating a population mean. Consider again the example of a poll of 1000 voters. Imagine encoding responses to a question about third-party candidates as follows: for the ith person polled, Xi = 1; if ith person plans to vote for third-party candidate; = 0; otherwise. Our random sample consists of X1,…, X1000. If 85 respondents indicated that they would vote for a third-party candidate, then X1+…+ X1000 = 85; because 85 of the Xi 's are equal to 1 and all the rest are equal to 0. The sample proportion is just a special case of the sample mean. How good an estimate of the population proportion p is the sample proportion? How effective are polls and surveys? By how much is the sample proportion likely to deviate from the true population proportion p? This is measured by the standard deviation of sample proportion (standard error). [p(1-p)/n]1/2, It is greatest when p = 0.5.

EXAMPLE Suppose that the true, unknown proportion p of voters who will vote for a third-party candidate in the next election is 9%. What is the probability that a poll of 1000 voters will find a sample proportion that differs from the true proportion by more than 2%? We need to find We conclude that the probability that the poll will be off by more than two percentage points is .027.

Confidence Intervals For the mean m of a population a 100(1-a)% CI is: When the population is normal and SD s is known - , where za/2 comes from the normal table. Reason: When the population is normal, σ is not known, but n is large (maybe >50). Use the same formula with s in place of σ. When the population is not necessarily normal, but n is large (maybe > 50 to 100) (depending on how close to normal the population is ?or seems to be) Use the same formula with σ, if known, or with s if σ is not known. Summary: These intervals have probability approximately 1 - α of containing the true value of µ.

Demonstration with R Take 1000 samples of size 200 from a Normal(µ=0,σ2=1) population. Calculate a 95% CI for each sample. Check to see how many of these contain the true µ. Answer = ___. Check to see the percentage is approximately 1-a. x<- rnorm(200) #generate 200 standard normal rv mu<- mean(x); sd<- sqrt(var(x)) #calculate sample mean and sd q95<- qnorm(c(0.025,0.975)) #find quantiles of normal distribution q95<- qt(0.975,199) #find quantiles of normal distribution lower<- mu-q95*sd/sqrt(199); upper<- mu+q95 *sd/sqrt(199) #CI if(lower*upper > 0) contain<- 0 else contain<- 1

Demonstration with R Write a function to find whether the confidence interval contains mean. demons<- function(nsize,conf){ x<- rnorm(nsize) #generate 200 standard normal rv mu<- mean(x); sd<- sqrt(var(x)) #calculate sample mean and sd q95<- qt((1+conf)/2,nsize-1) #find quantiles of normal distribution lower<- mu-q95*sd/sqrt(nsize-1); upper<- mu+q95 *sd/sqrt(nsize-1) if(lower*upper > 0) contain<- 0 else contain<- 1 contain } Conduct a simulation study to check the validity of confidence interval based on t-statistic. nsimu<- 1000 contain<- 1:nsimu for (i in 1:nsimu) contain[i]<- demons(200,0.95)

Higher confidence(Good!) = Wider interval(Bad!) ! The only way to control both confidence and interval size is to choose sufficiently large n. For confidence 100(1-a)% and width w we need w = 2za/2s/√n. If s or s is not known, use your best guess (or preliminary data). Example. Fisher’s Iris data had n=50, s=3.5, and a 95% CI of 5.0±1.96× (3.5/√50) = 5.0 ± 0.97 = (4.03, 5.97). * This CI has width w=2×0.97=1.94. *The sample size 50, here, is on the borderline of what could be acceptable for the use of this procedure. It would be (slightly) better to use the t-procedure discussed below. Suppose we want a CI of total width w = 0.5 (ignoring the data we have already gathered). How large a sample size should we use? Our best guess for s is 3.5. (We don’t have any other information to give us a better idea.) We should choose n≒ (2×1.96×3.5/0.5)2 =753.** **This value of n is large. If the answer to a question like this works out to be a small n (suggesting use of a t-test) then it’s not really a valid answer – or, at best, it should be thought of as only a very rough estimate.

(p.s.: How do we tell whether the population is normal?) t-Interval When the population is normal but 2 is not known and n is not large! (p.s.: How do we tell whether the population is normal?) What we’ve done so far doesn’t work. Demonstration: Repeat the previous demonstration, but with 50,000 samples of size 4 from an exponential distribution.

Bootstrap As a general term, bootstrapping describes any operation which allows a system to generate itself from its own small well-defined subsets (e.g. compilers, software to read tapes written in computer-independent form). The word is borrowed from the saying pull yourself up by your own by your own bootstraps. In statistics, the bootstrap is a method allowing one to judge the uncertainty of estimators obtained from small samples, without prior assumptions about the underlying probability distributions. The method consists of forming many new samples of the same size as the observed sample, by drawing a random selection of the original observations, i.e. usually introducing some of the observations several times. The estimator under study (e.g. a mean, a correlation coefficient) is then formed for every one of the samples thus generated, and will show a probability distribution of its own. From this distribution, confidence limits can be given. For details, see B. Efron (Computers and the Theory of Statistics, SIAM Rev. 21 (1979) 460.) or Efron (The Jackknife, the Bootstrap and Other Resampling Plans, SIAM, Bristol, 1982. )

Random Numbers Random numbers are particular occurrences of random variables. They are used in Monte Carlo calculations, where three different types may be distinguished according to the method used to generate them: Truly random numbers are unpredictable in advance and can only be generated by a physical process such as radioactive decay: in the presence of radiation, a Geiger counter will record particles at time intervals that follow a truly random (exponential) distribution. Pseudo random numbers are those most often used in Monte Carlo calculations. They are generated by a numerical algorithm, and are therefore predictable in principle, but appear to be truly random to someone who does not know the algorithm. Quasi random numbers are also generated by a numerical algorithm, but are not intended to appear to have the properties of a truly random sequence, rather they are optimized to give the fastest convergence of the Monte Carlo calculation.

Pseudo Random Numbers Generated in a digital computer by a numerical algorithm, pseudorandom numbers are not random, but should appear to be random when used in Monte Carlo calculations. The most widely used and best understood pseudorandom generator is the Lehmer multiplicative congruential generator, in which each number r is calculated as a function of the preceding number in the sequence: ri ≡ ari-1 (mod m) or ri ≡ ari-1 + c (mod m) where a and c are carefully chosen constants, and m is usually a power of two, 2k. All quantities appearing in the formula (except m) are integers of k bits. The expression in brackets is an integer of length 2k bits, and the effect of the modulo m is to mask off the most significant part of the result of the multiplication. r0 is the seed of a generation sequence; many generators allow one to start with a different seed for each run of a program, to avoid re-generating the same sequence, or to preserve the seed at the end of one run for the beginning of a subsequent one. Before being used in calculations, the ri are usually transformed to floating point numbers normalized into the range [0,1].

Generators of this type can be found which attain the maximum possible period of 2k-2, and whose sequences pass all reasonable tests of ``randomness'', provided one does not exhaust more than a few percent of the full period. D.E. Knuth, The Art of Computer Programming, Addison-Wesley, 1981. A detailed discussion can be found in G. Marsaglia, A Current View of Random Number Generators in Computer Science and Statistics, Elsevier, Amsterdam, 1985.

Jackknife The jackknife is a method in statistics allowing one to judge the uncertainties of estimators derived from small samples, without assumptions about the underlying probability distributions. The method consists of forming new samples by omitting, in turn, one of the observations of the original sample. For each of the samples thus generated, the estimator under study can be calculated, and the probability distribution thus obtained will allow one to draw conclusions about the estimator's sensitivity to individual observations.