Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transformations, Z-scores, and Sampling September 21, 2011.

Similar presentations


Presentation on theme: "Transformations, Z-scores, and Sampling September 21, 2011."— Presentation transcript:

1 Transformations, Z-scores, and Sampling September 21, 2011

2 Changing the unit of measurement Variables can be recorded in different units of measurement. Most often, one measurement unit is a linear transformation of another measurement unit: x new = a + bx. Temperatures can be expressed in degrees Fahrenheit or degrees Celsius. Temperature Fahrenheit = 32 + (9/5)* Temperature Celsius  a + bx. Linear transformations do not change the basic shape of a distribution (skew, symmetry, multimodal). But they do change the measures of center and spread: – Multiplying each observation by a positive number b multiplies both measures of center (mean, median) and spread (IQR, s) by b. – Adding the same number a (positive or negative) to each observation adds a to measures of center and to quartiles but it does not change measures of spread (IQR, s).

3 Density curves A density curve is a mathematical model of a distribution. The total area under the curve, by definition, is equal to 1, or 100%. The area under the curve for a range of values is the proportion of all observations for that range. Histogram of a sample with the smoothed, density curve describing theoretically the population.

4 Density curves come in any imaginable shape. Some are well known mathematically and others aren’t.

5 Median and mean of a density curve The median of a density curve is the equal-areas point: the point that divides the area under the curve in half. The mean of a density curve is the balance point, at which the curve would balance if it were made of solid material. The median and mean are the same for a symmetric density curve. The mean of a skewed curve is pulled in the direction of the long tail.

6 Normal distributions e = 2.71828… The base of the natural logarithm π = pi = 3.14159… Normal – or Gaussian – distributions are a family of symmetrical, bell-shaped density curves defined by a mean  (mu) and a standard deviation  (sigma) : N(  ). xx

7 A family of density curves Here, means are different (  = 10, 15, and 20) while standard deviations are the same (  = 3). Here, means are the same (  = 15) while standard deviations are different (  = 2, 4, and 6).

8 mean µ = 64.5 standard deviation  = 2.5 N(µ,  ) = N(64.5, 2.5) The 68-95-99.7% Rule for Normal Distributions Reminder: µ (mu) is the mean of the idealized curve, while is the mean of a sample. σ (sigma) is the standard deviation of the idealized curve, while s is the s.d. of a sample.  About 68% of all observations are within 1 standard deviation (  of the mean (  ).  About 95% of all observations are within 2  of the mean .  Almost all (99.7%) observations are within 3  of the mean. Inflection point

9 Because all Normal distributions share the same properties, we can standardize our data to transform any Normal curve N(  ) into the standard Normal curve N(0,1). The standard Normal distribution For each x we calculate a new value, z (called a z-score). N(0,1) => N(64.5, 2.5) Standardized height (no units)

10 A z-score measures the number of standard deviations that a data value x is from the mean . Standardizing: calculating z-scores When x is larger than the mean, z is positive. When x is smaller than the mean, z is negative. When x is 1 standard deviation larger than the mean, then z = 1. When x is 2 standard deviations larger than the mean, then z = 2.

11 mean µ = 64.5" standard deviation  = 2.5" x (height) = 67" We calculate z, the standardized value of x: Because of the 68-95-99.7 rule, we can conclude that the percent of women shorter than 67” should be, approximately,.68 + half of (1 -.68) =.84 or 84%. Area= ??? N(µ,  ) = N(64.5, 2.5)  = 64.5” x = 67” z = 0z = 1 Ex. Women heights Women’s heights follow the N(64.5”,2.5”) distribution. What percent of women are shorter than 67 inches tall (that’s 5’6”)?

12 Using the standard Normal table (…) Table A gives the area under the standard Normal curve to the left of any z value..0082 is the area under N(0,1) left of z = - 2.40.0080 is the area under N(0,1) left of z = -2.41 0.0069 is the area under N(0,1) left of z = -2.46

13 Area ≈ 0.84 Area ≈ 0.16 N(µ,  ) = N(64.5”, 2.5”)  = 64.5” x = 67” z = 1 Conclusion: 84.13% of women are shorter than 67”. By subtraction, 1 - 0.8413, or 15.87% of women are taller than 67". For z = 1.00, the area under the standard Normal curve to the left of z is 0.8413. Percent of women shorter than 67”

14 Tips on using Table A Because the Normal distribution is symmetrical, there are 2 ways that you can calculate the area under the standard Normal curve to the right of a z value. area right of z = 1 - area left of z Area = 0.9901 Area = 0.0099 z = -2.33 area right of z = area left of -z

15 Tips on using Table A To calculate the area between 2 z- values, first get the area under N(0,1) to the left for each z-value from Table A. area between z 1 and z 2 = area left of z 1 – area left of z 2 A common mistake made by students is to subtract both z values. But the Normal curve is not uniform. Then subtract the smaller area from the larger area.  The area under N(0,1) for a single value of z is zero. (Try calculating the area to the left of z minus that same area!)

16 N(0,1) The cool thing about working with normally distributed data is that we can manipulate it, and then find answers to questions that involve comparing seemingly non- comparable distributions. We do this by “standardizing” the data. All this involves is changing the scale so that the mean now = 0 and the standard deviation =1. If you do this to different distributions it makes them comparable.

17 Population versus sample Sample: The part of the population we actually examine and for which we do have data. How well the sample represents the population depends on the sample design. A statistic is a number describing a characteristic of a sample.  Population: The entire group of individuals in which we are interested but can’t usually assess directly. Example: All humans, all working- age people in California, all crickets  A parameter is a number describing a characteristic of the population. Population Sample

18 Convenience sampling: Just ask whoever is around. – Example: “Man on the street” survey (cheap, convenient, often quite opinionated, or emotional => now very popular with TV “journalism”) Which men, and on which street? – Ask about gun control or legalizing marijuana “on the street” in Berkeley or in some small town in Idaho and you would probably get totally different answers. – Even within an area, answers would probably differ if you did the survey outside a high school or a country western bar. Bias: Opinions limited to individuals present. Sampling methods

19 Voluntary Response Sampling: Individuals choose to be involved. These samples are very susceptible to being biased because different people are motivated to respond or not. Often called “public opinion polls,” these are not considered valid or scientific. Bias: Sample design systematically favors a particular outcome. Ann Landers summarizing responses of readers 70% of (10,000) parents wrote in to say that having kids was not worth it—if they had to do it over again, they wouldn’t. Bias: Most letters to newspapers are written by disgruntled people. A random sample showed that 91% of parents WOULD have kids again.

20 CNN on-line surveys: Bias: People have to care enough about an issue to bother replying. This sample is probably a combination of people who hate “wasting the taxpayers’ money” and “animal lovers.”

21 In contrast : Probability or random sampling: Individuals are randomly selected. No one group should be over-represented. Random samples rely on the absolute objectivity of random numbers. There are tables and books of random digits available for random sampling. Statistical software can generate random digits (e.g., Excel “=random()”). Sampling randomly gets rid of bias.

22 Simple random samples A Simple Random Sample (SRS) is made of randomly selected individuals. Each individual in the population has the same probability of being in the sample. All possible samples of size n have the same chance of being drawn. The simplest way to use chance to select a sample is to place names in a hat (the population) and draw out a handful (the sample).

23 Stratified samples There is a slightly more complex form of random sampling: A stratified random sample is essentially a series of SRSs performed on subgroups of a given population. The subgroups are chosen to contain all the individuals with a certain characteristic. For example: – Divide the population of UCI students into males and females. – Divide the population of California by major ethnic group. – Divide the counties in America as either urban or rural based on criteria of population density. The SRS taken within each group in a stratified random sample need not be of the same size. For example: – A stratified random sample of 100 male and 150 female UCI students – A stratified random sample of a total of 100 Californians, representing proportionately the major ethnic groups

24 What is a sampling distribution? The sampling distribution of a statistic is the distribution of all possible values taken by the statistic when all possible samples of a fixed size n are taken from the population. It is a theoretical idea—we do not actually build it. The sampling distribution of a statistic is the probability distribution of that statistic.

25 Sampling distribution of the sample mean We take many random samples of a given size n from a population with mean  and standard deviation  Some sample means will be above the population mean  and some will be below, making up the sampling distribution. Sampling distribution of “x bar” Histogram of some sample averages

26 Sampling distribution of x bar  √n√n For any population with mean  and standard deviation  :  The mean, or center of the sampling distribution of, is equal to the population mean .  The standard deviation of the sampling distribution is  /√n, where n is the sample size : .

27 Mean of a sampling distribution of There is no tendency for a sample mean to fall systematically above or below  even if the distribution of the raw data is skewed. Thus, the mean of the sampling distribution is an unbiased estimate of the population mean  — it will be “correct on average” in many samples. Standard deviation of a sampling distribution of The standard deviation of the sampling distribution measures how much the sample statistic varies from sample to sample. It is smaller than the standard deviation of the population by a factor of √n.  Averages are less variable than individual observations.

28 For normally distributed populations When a variable in a population is normally distributed, the sampling distribution of for all possible samples of size n is also normally distributed. If the population is N(  ) then the sample means distribution is N(  /√n). Population Sampling distribution

29 The central limit theorem Central Limit Theorem: When randomly sampling from any population with mean  and standard deviation , when n is large enough, the sampling distribution of is approximately normal: ~ N(  /√n). Population with strongly skewed distribution Sampling distribution of for n = 2 observations Sampling distribution of for n = 10 observations Sampling distribution of for n = 25 observations

30

31 The National Collegiate Athletic Association (NCAA) requires Division I athletes to score at least 820 on the combined math and verbal SAT exam to compete in their first college year. The SAT scores of 2003 were approximately normal with mean 1026 and standard deviation 209. What proportion of all students would be NCAA qualifiers (SAT ≥ 820)? Note: The actual data may contain students who scored exactly 820 on the SAT. However, the proportion of scores exactly equal to 820 is 0 for a normal distribution is a consequence of the idealized smoothing of density curves. area right of 820= total area - area left of 820 =1 - 0.1611 ≈ 84%

32 The NCAA defines a “partial qualifier” eligible to practice and receive an athletic scholarship, but not to compete, with a combined SAT score of at least 720. What proportion of all students who take the SAT would be partial qualifiers? That is, what proportion have scores between 720 and 820? About 9% of all students who take the SAT have scores between 720 and 820. area between = area left of 820 - area left of 720 720 and 820=0.1611 - 0.0721 ≈ 9%

33 IQ scores: population vs. sample In a large population of adults, the mean IQ is 112 with standard deviation 20. Suppose 200 adults are randomly selected for a market research campaign. The distribution of the sample mean IQ is: A) Exactly normal, mean 112, standard deviation 20 B) Approximately normal, mean 112, standard deviation 20 C) Approximately normal, mean 112, standard deviation 1.414 D) Approximately normal, mean 112, standard deviation 0.1 C) Approximately normal, mean 112, standard deviation 1.414 Population distribution : N(  = 112;  = 20) Sampling distribution for n = 200 is N(  = 112;  /√n = 1.414)

34 , P(z < −3) = 0.0013 ≈ 0.1% Note: Make sure to standardize (z) using the standard deviation for the sampling distribution. Application Hypokalemia is diagnosed when blood potassium levels are below 3.5mEq/dl. Let’s assume that we know a patient whose measured potassium levels vary daily according to a normal distribution N(  = 3.8,  = 0.2). If only one measurement is made, what is the probability that this patient will be misdiagnosed with Hypokalemia?, P(z < −1.5) = 0.0668 ≈ 7% Instead, if measurements are taken on 4 separate days, what is the probability of a misdiagnosis?

35 Practical note Large samples are not always attainable. – Sometimes the cost, difficulty, or preciousness of what is studied drastically limits any possible sample size. – Blood samples/biopsies: No more than a handful of repetitions are acceptable. Oftentimes, we even make do with just one. – Opinion polls have a limited sample size due to time and cost of operation. During election times, though, sample sizes are increased for better accuracy. Not all variables are normally distributed. – Income, for example, is typically strongly skewed. – Is still a good estimator of  then?


Download ppt "Transformations, Z-scores, and Sampling September 21, 2011."

Similar presentations


Ads by Google