2Parameters and statistics A parameter is a quantity used to describe a population, and a statistic is a quantity computed from a sample and is used to estimate a population parameter and describe the sample.We typically use statistics to estimate parameters becauseIt isn’t possible to examine the entire population.It isn’t feasible to examine the entire population (e.g., too expensive).In order for a calculated statistic to convey information about the related population parameter, certain conditions must be met; those conditions are generally satisfied if the sample used to calculate the statistics is random.Pages 215–216Although it is not in the LOS, a reminder of the relationship between statistics and parameters as the basic foundation for sampling processes motivates the discussion of samples, sampling error, sampling biases, etc.
3Random samplesA simple random sample is a subset of the population drawn in such a way that each element of the population has an equal probability of being selected.The key to random sampling lies in the lack of any patterns in the collection of the data elements.Finite and limited populations can be sampled by assigning random numbers to all of the elements in the population, and then selecting the sample elements by using a random number generator and matching the generated numbers to the assigned numbers.If you can enumerate the population, why don’t you just use it?When we can’t identify all the members of the population, we often use kth member sampling, where we select every kth member we observe until we have the necessary sample size.LOS: Define simple random sampling. Page 216One of the most famous examples of a nonrandom sample occurred in the presidential poll predictions in The Literary Digest, a famous magazine of the time, predicted a victory for Alfred Landon against President Franklin D. Roosevelt. I’ll carry this example across several different slides to illustrate several different concepts.Tell students that the Literary Digest had collected its mailing address sample of 10 million people (resulting in 2.4 million responses) using the phone book, club memberships, and magazine subscriptions. Everyone was mailed a mock ballot and asked to return it to the magazine—a very, very expensive process, but it resulted in a very large sample.Survey Prediction: Alfred Landon wins over FDR with 57% of thevote to 43% of the vote.– Literary Digest, 1936
4Sampling error and sampling distribution The difference between the observed value of a statistic and the value of the parameter is known as the sampling error.Random sampling should reflect the characteristics of the underlying population in such a way that the sample statistics computed from the sample are valid estimates of the population parameter.Because the sample represents only a fraction of the observations in the population, we can extract more than one sample from any given population.The samples drawn, if truly random, may contain common elements.Sample statistics, calculated from multiple samples from the same population, will then have a distribution of differing values that is known as the sampling distribution.LOS: Define and interpret sampling error.Page 217The Literary Digest predicted the outcome as 57% to 43%, Landon to FDR. The final outcome of the FDR/Landon race was 38% to 62%, Landon to FDR. This result is a sample error of 19%, seen as 57 – 38 or 62 – 43. For such a large sample, this error rate is incredibly large. Why? The sample wasn’t random. We’ll come back to this story when we cover biases.Outcome: FDR gets 62% of the vote Sampling error was 19%!
5Sampling distribution The distribution of possible outcomes of a sample statistic that would result from repeated sampling from the population.The samples drawn from the population to derive the distribution should be the same size and drawn from the same underlying population.We generally refer to a sampling distribution by indicating the statistic to which the distribution applies:“the sampling distribution of the sample mean.”LOS: Define a sampling distribution.Page 217For ease of discussion, the slide shows three random normal distributions taken from the same family with their means and sampling distributions plotted on top of the underlying normal population that was used to generate them.
6Stratified random sampling A set of simple random samples drawn from an overall population in such a way that subpopulations are accurately represented in the overall sample.In a large population, we may have subpopulations, known as strata, for which we want to ensure inclusion in a representative way in the sample.To do so, we can use stratified sampling, wherein we draw simple random samples from each strata and then combine those samples to form the overall sample on which we perform our analysis.This method guarantees proportional representation in each strata relative to the population representation.Stratified random sample statistics have greater precision (less variance) than simple random samples.Stratified random sampling is commonly used with bond indexing.LOS: Distinguish between simple random and stratified random sampling.Pages 217–218Stratified random sampling may be particularly important when the population has one classification that forms a very large portion of the population but the subpopulations have an important impact on the parameter estimates. If simple random sampling is used, the subpopulations are likely to be underrepresented in any given sample and so their effect on any sample statistics, which is an important part of the population parameter, may be understated.For bond indexing, contrast full replication (an attempt to capture the population), which is expensive and time consuming, to simple random, which will likely not match the risk factors (duration in particular), with strata in which we can ensure that the salient characteristics—such as duration, tax effects, and call exposure—are matched.
7Time-Series and Cross-sectional data Time-series samples are constructed by collecting the data of interest at regularly spaced intervals of time and are known as time-series data.Cross-sectional samples are constructed by collecting the data of interest across observational units (firms, people, precincts) at a single point in time and are known as cross-sectional data.The combination of the two is known as panel data.LOS: Distinguish between time-series and cross-sectional data. Pages 219–221The potential for the lack of a random sample is almost always a problem with both time-series and cross-sectional data. With time series, the biggest concerns are often regime shifts (changes in the underlying data-generating process resulting from structural changes) and changes in the macroeconomic environment. With cross-sectional, the concerns are more likely to center around differences in the data-generating process resulting from industry or country differences.There are generally no financial or economic theories guiding the “correct” interval of time to use. Common sense should prevail, and the focus should be on ensuring that the samples come from the same underlying population.We need to be careful with cross-sectional data, as well, to ensure that the data come from the same underlying population. Depending on the issue being examined, we may need to group analysis by industry, country, exchange, and so on.
8Central limit theoremThe central limit theorem (CLT) allows us to make precise probability statements about the population mean using the sample mean, regardless of the underlying distribution.Recall that we can estimate the population mean of a random sample by calculating the average value of the sample observations, a statistic known as the sample mean.Given a population described by any probability distribution having mean µ and finite variance σ2, the sampling distribution of the sample mean, X, computed from samples of size n from this population, will be approximately normal with mean µ (the population mean) and variance σ2/n (the population variance divided by n) when the sample size n is large.LOS: State the central limit theorem and describe its importance.Page 222Arguably, as it is the engine that drives much of modern probability theory, the importance of the central limit theorem to statisticians can’t be overstated. In fact, it is called the “central limit theorem” because it is central to the functioning of statistics, NOT because it is most commonly used to describe the center of the sample distribution.
9The standard error of the sample mean By combining the CLT, the sample mean, and the standard error of the sample mean, we can make probability statements about population mean.The standard deviation of the distribution of the sample mean is known as the standard error of the sample mean.When the sample size is large (generally n > 30 or so), the distribution of the sample mean will be approximately normal when the sample is randomly generated (collected).This distribution is independent of the distribution from which the sample is drawn.The standard error of the mean can then be shown to take the value ofPopulation variance known Population variance unknownLOS: Calculate and interpret the standard error of the sample mean.Page 222–223σ 𝑋 = σ 𝑛𝑠 𝑋 = 𝑠 𝑛where
10Central limit theorem Focus On: Importance With a large, random sample The distribution of the sample mean will be approximately normal.The mean of the distribution will be equal to the mean of the population from which the samples are drawn.The variance of the distribution will be equal to the variance of the population divided by the sample size when the population variance is known, and equal to the sample variance divided by the sample size when the population variance is unknown.LOS: State the central limit theorem and describe its importance.Pages 222–225Arguably, as it is the engine that drives much of modern probability theory, the importance of the central limit theorem to statisticians can’t be overstated. In fact, it is called the “central limit theorem” because it is central to the functioning of statistics, NOT because it is most commonly used to describe the center of the sample distribution.
11Point estimates & confidence intervals Estimators are the generalized mathematical expressions for the calculation of sample statistics, and an estimate is a specific outcome of one estimation.Estimates take on a single numerical value and are, therefore, referred to as point estimates.It is a fixed number specific to that sample.It has no sampling distribution.In contrast, a confidence interval (CI) specifies a range that contains the parameter in which we are interested (1 – a)% at the time.The (1 – a)% is known as the degree of confidence.Confidence intervals are generally expressed aslower confidence limit or upper confidence limit.LOS: Distinguish between a point estimate and a confidence interval estimate of a population parameter.Pages 225–227Estimators do, of course, have sampling distributions. It is sometimes difficult to think of an estimator as having a distribution and an estimate not having one. Rather, think of the estimator as a general expression (function) for which many outcomes are possible (hence, there is a distribution) and the estimate as a single result of applying that function to a sample of data. This clarification will come in handy later when trying to explain the degrees of freedom concept.
12Estimator propertiesThere are a variety of estimators for each population parameter; accordingly, we prefer estimators that exhibit certain valuable properties.UnbiasednessOccurs when the estimator expected value is equal to the value of the parameter being estimated.Examples: sample mean, sample standard deviationEfficiencyOccurs when no other estimator has a smaller variance.ConsistencyAsymptotic in nature, thereby requiring a large number of observations.Occurs when the probability of obtaining estimates close to the value of the population parameter increases as sample size increases.LOS: Identify and describe the desirable properties of an estimator.Page 226The first two hold for any sample size; unbiased is better than biased, efficient is better than not efficient, and unbiased and efficient is the best. Sometimes, there is no identifiable efficient estimator in small samples, so we turn to the property of consistency. A consistent estimator is better than one that is not consistent. It is worth noting that the asymptotic nature of this property means it only applies in (relatively) large samples. Another way of wording the consistency property is that a consistent estimator has a sampling distribution that becomes concentrated around the parameter value when the sample size approaches infinity.
13Confidence intervals Focus On: Constructing Confidence Intervals (CIs) Point estimate ± Reliability factor × Standard errorPoint estimate = A point estimate of the parameter (a value of a sample statistic), such as the sample mean.Reliability factor = A number based on the assumed distribution of the point estimate and the degree of confidence (1 − α) for the confidence interval.Standard error = The standard error of the sample statistic providing the point estimate.Normal pop, known s Unknown s, large sample Unknown s, small sampleLOS: Explain the construction of confidence intervals.Pages 227–229, 232𝑋 ± 𝑧 α 2 σ 𝑛𝑋 ± 𝑧 α 2 s 𝑛or𝑋 ± 𝑡 α 2 s 𝑛𝑋 ± 𝑡 α 2 s 𝑛
14Student’s t-distribution When the population variance is unknown and the sample is random, the distribution that correctly describes the sample mean is known as the t-distribution.The t-distribution has larger reliability (cutoff) values for a given level of alpha than the normal distribution, but as the sample size increases, the cutoff values approach those of the normal distribution.For small sample sizes, use of the t-distributioninstead of the z-distribution to determinereliability factors is critical.The t-distribution is a symmetrical distributionwhose probability density function is definedby a single parameter known as thedegrees of freedom (df).LOS: Describe the properties of Student’s t-distribution.Pages 229–231Source of Student’s t-distribution: The name of the distribution comes from the pseudonym (Student) of a researcher in Guinness Brewery’s quality control department. He used a pseudonym when he published his works because Guinness didn’t allow its scientists to publish or because it didn’t want its competition to figure out what it was using for quality control.
15Degrees of freedomThe parameter that completely characterizes a t-distribution.The degrees of freedom for a given t-distribution are equal to the sample size minus 1.For a sample size of 45, the degrees of freedom are 44.Consider that our calculation of the sample standard deviation isand that the sample mean is measured with error because it is not the true population mean, m.For our sample of 45, because we have already estimated our sample mean, when we have enumerated 44 of the sample observations, the 45th must be the one that ensures our estimated sample mean. Hence, we are only free to choose 44 of the observations; the 45th must be that value that gives the estimated sample mean.LOS: Calculate and explain degrees of freedom.Page 230The concept of degrees of freedom will come up again with almost every test we conduct later. It is important to understand why the df = n – 1 in this case. In future cases, it will be n – 2, n – 3, and so on.
16Statistic for Small Sample Size Statistic for Large Sample Size Confidence IntervalsFocus On: When to Use WhatSampling from:Statistic for Small Sample SizeStatistic for Large Sample SizeNormal distribution with known variancezNormal distribution with unknown variancett*Non-normal distribution with known varianceNot availableNon-normal distribution with unknown varianceLOS: Calculate and interpret a confidence interval for a population mean when sampling from a normal distribution with (1) a known population variance, (2) an unknown population variance, or (3) when sampling from a population with an unknown variance and the sample size is large.Pages 227–229, 232*Use of z is also acceptable.
17Confidence intervals Focus On: Calculations Portfolio Normal PopulationKnown VarianceUnknown Variance,Small SampleLarge SampleTarget = 0.1(1)(2)(3)E(R) =.0140.110.25Std Dev =.0200.080.27n =152045You have a client with a target rate of return of 10% who would like to be 90% certain her realized return will include her target return. Construct a 90% confidence interval for each of the investments in the table, and determine whether each contains her target return.LOS: Calculate and interpret a confidence interval for a population mean when sampling from a normal distribution with (1) a known population variance, (2) an unknown population variance, or (3) when sampling from a population with an unknown variance and the sample size is large.Pages 227–229, 232We use a 90% interval here because so many book applications use a 95%. Although 95% is certainly the most common application, one can gain a better understanding of the process if you use different levels occasionally.
18Confidence intervals Focus On: Calculations Portfolio Normal PopulationKnown VarianceUnknown Variance,Small SampleLarge SampleTarget = 0.1(1)(2)(3)E(R) =.0140.110.25Std Dev =.0200.080.27n =152045For strategy (1), we use a z-statistic because the population variance (standard deviation) is known and the population is normally distributed.LOS: Calculate and interpret a confidence interval for a population mean when sampling from a normal distribution with (1) a known population variance, (2) an unknown population variance, or (3) when sampling from a population with an unknown variance and the sample size is large.Page 228We use a 90% interval here because so many book applications use a 95%. Although 95% is certainly the most common application, one can gain a better understanding of the process if you use different levels occasionally.𝑋 ± 𝑧 α 2 σ 𝑛 =0.01± =[0.0055,0.0225]
19Confidence intervals Focus On: Calculations Portfolio Normal PopulationKnown VarianceUnknown Variance,Small SampleLarge SampleTarget = 0.1(1)(2)(3)E(R) =.0140.110.25Std Dev =.0200.080.27n =152045For strategy (2), we use a t-statistic because the population variance (standard deviation) is unknown and the population is small and normally distributed.LOS: Calculate and interpret a confidence interval for a population mean when sampling from a normal distribution with (1) a known population variance, (2) an unknown population variance, or (3) when sampling from a population with an unknown variance and the sample size is large.Page 232We use a 90% interval here because so many book applications use a 95%. Although 95% is certainly the most common application, one can gain a better understanding of the process if you use different levels occasionally.𝑋 ± 𝑡 α 2 𝑠 𝑛 =0.11± =[0.0791,0.1409]
20Confidence intervals Focus On: Calculations Portfolio Normal PopulationKnown VarianceUnknown Variance,Small SampleLarge SampleTarget = 0.1(1)(2)(3)E(R) =.0140.110.25Std Dev =.0200.080.27n =152045For strategy (3), we can use a z-statistic or a t-statistic even though the population variance (standard deviation) is unknown because we have a large sample.LOS: Calculate and interpret a confidence interval for a population mean when sampling from a normal distribution with (1) a known population variance, (2) an unknown population variance, or (3) when sampling from a population with an unknown variance and the sample size is large.Page 232If we had used the t instead, the cutoff would be 1.68 and the interval [0.1824,0.3176]. This is a nice place to demonstrate the more conservative t-statistic (broader interval) and the closeness of the t and z cutoffs as n gets large.𝑋 ± 𝑧 α 2 𝑠 𝑛 =0.25± =[0.1838,0.3162]
21Sample size selectionThere are inherent trade-offs in selecting a sample based on both statistical and economic factors.Point estimate ± Reliability factor × Standard errorBenefits of a large sample include:Increased precision throughLarge samples that enable the use of z-statistics rather than t-statistics.The estimate of the standard error that decreases with increased sample size.Drawbacks of a large sample include:Increased likelihood of sampling from more than one population.Increased cost.LOS: Discuss the issues surrounding selection of the appropriate sample size.Page 233Use the general confidence interval equation at the top of the slide to illustrate the benefits.
22Data-mining bias No story No future “If you torture the data long enough, it will confess.”–reportedly said in a speech by Ronald Coase, Nobel laureateData-mining bias results from the overuse and/or repeated use of the same data to repeatedly search for patterns in the data.If we were to test 1,000 different variables, 50 of them would be significant at the 5% level even though the significance is just an artifact of the testing error rate.This approach is sometimes called a “kitchen sink” problem.Economic and financial decisions made on the basis of these tests will be inherently flawed.There is no true underlying economic rationale for the relationship distinct from the testing phenomenon.To verify the relationship and/or discover data-mining biases, we can conduct out-of-sample tests.LOS: Define and discuss data-mining bias.Pages 236–238For analysts, intergenerational data mining may actually be a worse subcategory. Most analysis is based on guidance from past research, which may already have data-mining bias in it. To the extent that the same databases are then used to confirm the findings of prior research, the data-mining bias is compounded.No story No future
23Even a large sample size doesn’t fix a biased sample. Sample selection biasEffectively represents a nonrandom sample.Often caused by data for some portion of the population being unavailable.If that portion is different in systematic ways, it results in a selection bias.Survivorship bias is a particular kind of selection bias wherein we only observe those firms that have succeeded and, therefore, survive.Selection bias in Literary Digest example Only wealthy people had phones.Moral of the StoryEven a large sample size doesn’t fix a biased sample.LOS: Define and discuss sample selection bias, survivorship bias, look-ahead bias, and time-period bias.Pages 236–238Note that new asset classes, like hedge funds, often have few reporting requirements, which leads to self-reporting in that the only units reporting are those that are succeeding. This is another form of sample bias known as self-selection.We return now to our Literary Digest example wherein the Literary Digest had collected its mailing address sample of 10 million people (with 2.4 million responses) using the phone book, club memberships, and magazine subscriptions to get the necessary mailing addresses. Everyone was mailed a mock ballot and asked to return it to the magazine—a very, very expensive process, but it resulted in a very large sample.The Literary Digest sample had two primary sample selection biases: (1) Only wealthier people had phones and/or could afford club memberships and magazines (recall that this is during the Great Depression). Therefore, the sample was from a set of upper and middle class people who were more likely to vote Republican (against FDR). (2) With only 2.4 million people responding, there was significant nonresponse bias (people who respond to surveys may be different from those who don’t). Mail is known to have significant nonresponse biases. (Think about what you call such surveys: junk mail.)End result: Sample error of 19% in a HUGE sample means that increasing sample size doesn’t necessarily fix a biased sample.Additional moral of the story: A competitor used a sample of 55,000 and got better results. That competitor? Gallup polls!
24Look-ahead bias and time-period bias Look-ahead bias occurs when researchers use data not available at the test date to test a model and use it for predictions.May be particularly pronounced when using accounting data, which is typically reported with a lag in time.Time-period bias occurs when the model uses data from a time period when the data is not representative of all possible values of the data across time.Too short of a time period increases the likelihood of period-specific results.Too long of a time period increases the chance of a regime change.LOS: Define and discuss sample selection bias, survivorship bias, look-ahead bias, and time-period bias.Page 240
25SummaryThe quality of the sample is critically important when conducting or evaluating the results of a study.To draw valid inferences, the sample must be random in order to avoid a host of potential, often insidious, biases.When we have a random sample or samples, we can use the central limit theorem to conduct tests that compare the mean value of the sample with its value relative to a possible underlying population value.The appropriate test will differ as a function of our knowledge of the underlying population.
26Bond Index and Stratified Sampling Suppose you are the manager of a mutual fund indexed to the Lehman Brothers Government Index. You are exploring several approaches to indexing, including a stratified sampling approach. You first distinguish agency bonds from US Treasury bonds. For each of these two groups, you define 10 maturity intervals: 1–2 years, 2–3 years, 3–4 years, 4–6 years, 6–8 years, 8–10 years, 10–12 years, 12–15 years, 15–20 years, and 20–30 years. You also separate the bonds with coupons (annual interest rates) of 6% or less from the bonds with coupons of greater than 6%.
27Bond Index and Stratified Sampling How many cells or strata does this sampling plan entail?2(10)(2) = 40If you use this sampling plan, what is the minimum number of issues the indexed portfolio can have?40, so that you have no empty cellsSuppose that in selecting among the securities that qualify for selection within each cell, you apply a criterion concerning the liquidity of the security’s market. Is the sample obtained random? Explain your answer.No. Now not every bond has an equal probability of being accepted.