# Sampling and estimation

## Presentation on theme: "Sampling and estimation"— Presentation transcript:

Sampling and estimation

Parameters and statistics
A parameter is a quantity used to describe a population, and a statistic is a quantity computed from a sample and is used to estimate a population parameter and describe the sample. We typically use statistics to estimate parameters because It isn’t possible to examine the entire population. It isn’t feasible to examine the entire population (e.g., too expensive). In order for a calculated statistic to convey information about the related population parameter, certain conditions must be met; those conditions are generally satisfied if the sample used to calculate the statistics is random. Pages 215–216 Although it is not in the LOS, a reminder of the relationship between statistics and parameters as the basic foundation for sampling processes motivates the discussion of samples, sampling error, sampling biases, etc.

Random samples A simple random sample is a subset of the population drawn in such a way that each element of the population has an equal probability of being selected. The key to random sampling lies in the lack of any patterns in the collection of the data elements. Finite and limited populations can be sampled by assigning random numbers to all of the elements in the population, and then selecting the sample elements by using a random number generator and matching the generated numbers to the assigned numbers. If you can enumerate the population, why don’t you just use it? When we can’t identify all the members of the population, we often use kth member sampling, where we select every kth member we observe until we have the necessary sample size. LOS: Define simple random sampling. Page 216 One of the most famous examples of a nonrandom sample occurred in the presidential poll predictions in The Literary Digest, a famous magazine of the time, predicted a victory for Alfred Landon against President Franklin D. Roosevelt. I’ll carry this example across several different slides to illustrate several different concepts. Tell students that the Literary Digest had collected its mailing address sample of 10 million people (resulting in 2.4 million responses) using the phone book, club memberships, and magazine subscriptions. Everyone was mailed a mock ballot and asked to return it to the magazine—a very, very expensive process, but it resulted in a very large sample. Survey Prediction: Alfred Landon wins over FDR with 57% of the vote to 43% of the vote. – Literary Digest, 1936

Sampling error and sampling distribution
The difference between the observed value of a statistic and the value of the parameter is known as the sampling error. Random sampling should reflect the characteristics of the underlying population in such a way that the sample statistics computed from the sample are valid estimates of the population parameter. Because the sample represents only a fraction of the observations in the population, we can extract more than one sample from any given population. The samples drawn, if truly random, may contain common elements. Sample statistics, calculated from multiple samples from the same population, will then have a distribution of differing values that is known as the sampling distribution. LOS: Define and interpret sampling error. Page 217 The Literary Digest predicted the outcome as 57% to 43%, Landon to FDR. The final outcome of the FDR/Landon race was 38% to 62%, Landon to FDR. This result is a sample error of 19%, seen as 57 – 38 or 62 – 43. For such a large sample, this error rate is incredibly large. Why? The sample wasn’t random. We’ll come back to this story when we cover biases. Outcome: FDR gets 62% of the vote  Sampling error was 19%!

Sampling distribution
The distribution of possible outcomes of a sample statistic that would result from repeated sampling from the population. The samples drawn from the population to derive the distribution should be the same size and drawn from the same underlying population. We generally refer to a sampling distribution by indicating the statistic to which the distribution applies: “the sampling distribution of the sample mean.” LOS: Define a sampling distribution. Page 217 For ease of discussion, the slide shows three random normal distributions taken from the same family with their means and sampling distributions plotted on top of the underlying normal population that was used to generate them.

Stratified random sampling
A set of simple random samples drawn from an overall population in such a way that subpopulations are accurately represented in the overall sample. In a large population, we may have subpopulations, known as strata, for which we want to ensure inclusion in a representative way in the sample. To do so, we can use stratified sampling, wherein we draw simple random samples from each strata and then combine those samples to form the overall sample on which we perform our analysis. This method guarantees proportional representation in each strata relative to the population representation. Stratified random sample statistics have greater precision (less variance) than simple random samples. Stratified random sampling is commonly used with bond indexing. LOS: Distinguish between simple random and stratified random sampling. Pages 217–218 Stratified random sampling may be particularly important when the population has one classification that forms a very large portion of the population but the subpopulations have an important impact on the parameter estimates. If simple random sampling is used, the subpopulations are likely to be underrepresented in any given sample and so their effect on any sample statistics, which is an important part of the population parameter, may be understated. For bond indexing, contrast full replication (an attempt to capture the population), which is expensive and time consuming, to simple random, which will likely not match the risk factors (duration in particular), with strata in which we can ensure that the salient characteristics—such as duration, tax effects, and call exposure—are matched.

Time-Series and Cross-sectional data
Time-series samples are constructed by collecting the data of interest at regularly spaced intervals of time and are known as time-series data. Cross-sectional samples are constructed by collecting the data of interest across observational units (firms, people, precincts) at a single point in time and are known as cross-sectional data. The combination of the two is known as panel data. LOS: Distinguish between time-series and cross-sectional data. Pages 219–221 The potential for the lack of a random sample is almost always a problem with both time-series and cross-sectional data. With time series, the biggest concerns are often regime shifts (changes in the underlying data-generating process resulting from structural changes) and changes in the macroeconomic environment. With cross-sectional, the concerns are more likely to center around differences in the data-generating process resulting from industry or country differences. There are generally no financial or economic theories guiding the “correct” interval of time to use. Common sense should prevail, and the focus should be on ensuring that the samples come from the same underlying population. We need to be careful with cross-sectional data, as well, to ensure that the data come from the same underlying population. Depending on the issue being examined, we may need to group analysis by industry, country, exchange, and so on.

Central limit theorem The central limit theorem (CLT) allows us to make precise probability statements about the population mean using the sample mean, regardless of the underlying distribution. Recall that we can estimate the population mean of a random sample by calculating the average value of the sample observations, a statistic known as the sample mean. Given a population described by any probability distribution having mean µ and finite variance σ2, the sampling distribution of the sample mean, X, computed from samples of size n from this population, will be approximately normal with mean µ (the population mean) and variance σ2/n (the population variance divided by n) when the sample size n is large. LOS: State the central limit theorem and describe its importance. Page 222 Arguably, as it is the engine that drives much of modern probability theory, the importance of the central limit theorem to statisticians can’t be overstated. In fact, it is called the “central limit theorem” because it is central to the functioning of statistics, NOT because it is most commonly used to describe the center of the sample distribution.

The standard error of the sample mean
By combining the CLT, the sample mean, and the standard error of the sample mean, we can make probability statements about population mean. The standard deviation of the distribution of the sample mean is known as the standard error of the sample mean. When the sample size is large (generally n > 30 or so), the distribution of the sample mean will be approximately normal when the sample is randomly generated (collected). This distribution is independent of the distribution from which the sample is drawn. The standard error of the mean can then be shown to take the value of Population variance known Population variance unknown LOS: Calculate and interpret the standard error of the sample mean. Page 222–223 σ 𝑋 = σ 𝑛 𝑠 𝑋 = 𝑠 𝑛 where

Central limit theorem Focus On: Importance With a large, random sample
The distribution of the sample mean will be approximately normal. The mean of the distribution will be equal to the mean of the population from which the samples are drawn. The variance of the distribution will be equal to the variance of the population divided by the sample size when the population variance is known, and equal to the sample variance divided by the sample size when the population variance is unknown. LOS: State the central limit theorem and describe its importance. Pages 222–225 Arguably, as it is the engine that drives much of modern probability theory, the importance of the central limit theorem to statisticians can’t be overstated. In fact, it is called the “central limit theorem” because it is central to the functioning of statistics, NOT because it is most commonly used to describe the center of the sample distribution.

Point estimates & confidence intervals
Estimators are the generalized mathematical expressions for the calculation of sample statistics, and an estimate is a specific outcome of one estimation. Estimates take on a single numerical value and are, therefore, referred to as point estimates. It is a fixed number specific to that sample. It has no sampling distribution. In contrast, a confidence interval (CI) specifies a range that contains the parameter in which we are interested (1 – a)% at the time. The (1 – a)% is known as the degree of confidence. Confidence intervals are generally expressed as lower confidence limit or upper confidence limit. LOS: Distinguish between a point estimate and a confidence interval estimate of a population parameter. Pages 225–227 Estimators do, of course, have sampling distributions. It is sometimes difficult to think of an estimator as having a distribution and an estimate not having one. Rather, think of the estimator as a general expression (function) for which many outcomes are possible (hence, there is a distribution) and the estimate as a single result of applying that function to a sample of data. This clarification will come in handy later when trying to explain the degrees of freedom concept.

Estimator properties There are a variety of estimators for each population parameter; accordingly, we prefer estimators that exhibit certain valuable properties. Unbiasedness Occurs when the estimator expected value is equal to the value of the parameter being estimated. Examples: sample mean, sample standard deviation Efficiency Occurs when no other estimator has a smaller variance. Consistency Asymptotic in nature, thereby requiring a large number of observations. Occurs when the probability of obtaining estimates close to the value of the population parameter increases as sample size increases. LOS: Identify and describe the desirable properties of an estimator. Page 226 The first two hold for any sample size; unbiased is better than biased, efficient is better than not efficient, and unbiased and efficient is the best. Sometimes, there is no identifiable efficient estimator in small samples, so we turn to the property of consistency. A consistent estimator is better than one that is not consistent. It is worth noting that the asymptotic nature of this property means it only applies in (relatively) large samples. Another way of wording the consistency property is that a consistent estimator has a sampling distribution that becomes concentrated around the parameter value when the sample size approaches infinity.

Confidence intervals Focus On: Constructing Confidence Intervals (CIs)
Point estimate ± Reliability factor × Standard error Point estimate = A point estimate of the parameter (a value of a sample statistic), such as the sample mean. Reliability factor = A number based on the assumed distribution of the point estimate and the degree of confidence (1 − α) for the confidence interval. Standard error = The standard error of the sample statistic providing the point estimate. Normal pop, known s Unknown s, large sample Unknown s, small sample LOS: Explain the construction of confidence intervals. Pages 227–229, 232 𝑋 ± 𝑧 α 2 σ 𝑛 𝑋 ± 𝑧 α 2 s 𝑛 or 𝑋 ± 𝑡 α 2 s 𝑛 𝑋 ± 𝑡 α 2 s 𝑛

Student’s t-distribution
When the population variance is unknown and the sample is random, the distribution that correctly describes the sample mean is known as the t-distribution. The t-distribution has larger reliability (cutoff) values for a given level of alpha than the normal distribution, but as the sample size increases, the cutoff values approach those of the normal distribution. For small sample sizes, use of the t-distribution instead of the z-distribution to determine reliability factors is critical. The t-distribution is a symmetrical distribution whose probability density function is defined by a single parameter known as the degrees of freedom (df). LOS: Describe the properties of Student’s t-distribution. Pages 229–231 Source of Student’s t-distribution: The name of the distribution comes from the pseudonym (Student) of a researcher in Guinness Brewery’s quality control department. He used a pseudonym when he published his works because Guinness didn’t allow its scientists to publish or because it didn’t want its competition to figure out what it was using for quality control.

Degrees of freedom The parameter that completely characterizes a t-distribution. The degrees of freedom for a given t-distribution are equal to the sample size minus 1. For a sample size of 45, the degrees of freedom are 44. Consider that our calculation of the sample standard deviation is and that the sample mean is measured with error because it is not the true population mean, m. For our sample of 45, because we have already estimated our sample mean, when we have enumerated 44 of the sample observations, the 45th must be the one that ensures our estimated sample mean. Hence, we are only free to choose 44 of the observations; the 45th must be that value that gives the estimated sample mean. LOS: Calculate and explain degrees of freedom. Page 230 The concept of degrees of freedom will come up again with almost every test we conduct later. It is important to understand why the df = n – 1 in this case. In future cases, it will be n – 2, n – 3, and so on.

Statistic for Small Sample Size Statistic for Large Sample Size
Confidence Intervals Focus On: When to Use What Sampling from: Statistic for Small Sample Size Statistic for Large Sample Size Normal distribution with known variance z Normal distribution with unknown variance t t* Non-normal distribution with known variance Not available Non-normal distribution with unknown variance LOS: Calculate and interpret a confidence interval for a population mean when sampling from a normal distribution with (1) a known population variance, (2) an unknown population variance, or (3) when sampling from a population with an unknown variance and the sample size is large. Pages 227–229, 232 *Use of z is also acceptable.

Confidence intervals Focus On: Calculations Portfolio
Normal Population Known Variance Unknown Variance, Small Sample Large Sample Target = 0.1 (1) (2) (3) E(R) = .014 0.11 0.25 Std Dev = .020 0.08 0.27 n = 15 20 45 You have a client with a target rate of return of 10% who would like to be 90% certain her realized return will include her target return. Construct a 90% confidence interval for each of the investments in the table, and determine whether each contains her target return. LOS: Calculate and interpret a confidence interval for a population mean when sampling from a normal distribution with (1) a known population variance, (2) an unknown population variance, or (3) when sampling from a population with an unknown variance and the sample size is large. Pages 227–229, 232 We use a 90% interval here because so many book applications use a 95%. Although 95% is certainly the most common application, one can gain a better understanding of the process if you use different levels occasionally.

Confidence intervals Focus On: Calculations Portfolio
Normal Population Known Variance Unknown Variance, Small Sample Large Sample Target = 0.1 (1) (2) (3) E(R) = .014 0.11 0.25 Std Dev = .020 0.08 0.27 n = 15 20 45 For strategy (1), we use a z-statistic because the population variance (standard deviation) is known and the population is normally distributed. LOS: Calculate and interpret a confidence interval for a population mean when sampling from a normal distribution with (1) a known population variance, (2) an unknown population variance, or (3) when sampling from a population with an unknown variance and the sample size is large. Page 228 We use a 90% interval here because so many book applications use a 95%. Although 95% is certainly the most common application, one can gain a better understanding of the process if you use different levels occasionally. 𝑋 ± 𝑧 α 2 σ 𝑛 =0.01± =[0.0055,0.0225]

Confidence intervals Focus On: Calculations Portfolio
Normal Population Known Variance Unknown Variance, Small Sample Large Sample Target = 0.1 (1) (2) (3) E(R) = .014 0.11 0.25 Std Dev = .020 0.08 0.27 n = 15 20 45 For strategy (2), we use a t-statistic because the population variance (standard deviation) is unknown and the population is small and normally distributed. LOS: Calculate and interpret a confidence interval for a population mean when sampling from a normal distribution with (1) a known population variance, (2) an unknown population variance, or (3) when sampling from a population with an unknown variance and the sample size is large. Page 232 We use a 90% interval here because so many book applications use a 95%. Although 95% is certainly the most common application, one can gain a better understanding of the process if you use different levels occasionally. 𝑋 ± 𝑡 α 2 𝑠 𝑛 =0.11± =[0.0791,0.1409]

Confidence intervals Focus On: Calculations Portfolio
Normal Population Known Variance Unknown Variance, Small Sample Large Sample Target = 0.1 (1) (2) (3) E(R) = .014 0.11 0.25 Std Dev = .020 0.08 0.27 n = 15 20 45 For strategy (3), we can use a z-statistic or a t-statistic even though the population variance (standard deviation) is unknown because we have a large sample. LOS: Calculate and interpret a confidence interval for a population mean when sampling from a normal distribution with (1) a known population variance, (2) an unknown population variance, or (3) when sampling from a population with an unknown variance and the sample size is large. Page 232 If we had used the t instead, the cutoff would be 1.68 and the interval [0.1824,0.3176]. This is a nice place to demonstrate the more conservative t-statistic (broader interval) and the closeness of the t and z cutoffs as n gets large. 𝑋 ± 𝑧 α 2 𝑠 𝑛 =0.25± =[0.1838,0.3162]

Sample size selection There are inherent trade-offs in selecting a sample based on both statistical and economic factors. Point estimate ± Reliability factor × Standard error Benefits of a large sample include: Increased precision through Large samples that enable the use of z-statistics rather than t-statistics. The estimate of the standard error that decreases with increased sample size. Drawbacks of a large sample include: Increased likelihood of sampling from more than one population. Increased cost. LOS: Discuss the issues surrounding selection of the appropriate sample size. Page 233 Use the general confidence interval equation at the top of the slide to illustrate the benefits.

Data-mining bias No story  No future
“If you torture the data long enough, it will confess.” –reportedly said in a speech by Ronald Coase, Nobel laureate Data-mining bias results from the overuse and/or repeated use of the same data to repeatedly search for patterns in the data. If we were to test 1,000 different variables, 50 of them would be significant at the 5% level even though the significance is just an artifact of the testing error rate. This approach is sometimes called a “kitchen sink” problem. Economic and financial decisions made on the basis of these tests will be inherently flawed. There is no true underlying economic rationale for the relationship distinct from the testing phenomenon. To verify the relationship and/or discover data-mining biases, we can conduct out-of-sample tests. LOS: Define and discuss data-mining bias. Pages 236–238 For analysts, intergenerational data mining may actually be a worse subcategory. Most analysis is based on guidance from past research, which may already have data-mining bias in it. To the extent that the same databases are then used to confirm the findings of prior research, the data-mining bias is compounded. No story  No future

Even a large sample size doesn’t fix a biased sample.

Look-ahead bias occurs when researchers use data not available at the test date to test a model and use it for predictions. May be particularly pronounced when using accounting data, which is typically reported with a lag in time. Time-period bias occurs when the model uses data from a time period when the data is not representative of all possible values of the data across time. Too short of a time period increases the likelihood of period-specific results. Too long of a time period increases the chance of a regime change. LOS: Define and discuss sample selection bias, survivorship bias, look-ahead bias, and time-period bias. Page 240

Summary The quality of the sample is critically important when conducting or evaluating the results of a study. To draw valid inferences, the sample must be random in order to avoid a host of potential, often insidious, biases. When we have a random sample or samples, we can use the central limit theorem to conduct tests that compare the mean value of the sample with its value relative to a possible underlying population value. The appropriate test will differ as a function of our knowledge of the underlying population.

Bond Index and Stratified Sampling
Suppose you are the manager of a mutual fund indexed to the Lehman Brothers Government Index. You are exploring several approaches to indexing, including a stratified sampling approach. You first distinguish agency bonds from US Treasury bonds. For each of these two groups, you define 10 maturity intervals: 1–2 years, 2–3 years, 3–4 years, 4–6 years, 6–8 years, 8–10 years, 10–12 years, 12–15 years, 15–20 years, and 20–30 years. You also separate the bonds with coupons (annual interest rates) of 6% or less from the bonds with coupons of greater than 6%.

Bond Index and Stratified Sampling
How many cells or strata does this sampling plan entail? 2(10)(2) = 40 If you use this sampling plan, what is the minimum number of issues the indexed portfolio can have? 40, so that you have no empty cells Suppose that in selecting among the securities that qualify for selection within each cell, you apply a criterion concerning the liquidity of the security’s market. Is the sample obtained random? Explain your answer. No. Now not every bond has an equal probability of being accepted.

Similar presentations