Presentation on theme: "Chapter 1 Review. Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical."— Presentation transcript:
Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categories – The values of a categorical variable are labels for the differentcategories – The distribution of a categorical variable lists the count orpercent of individuals who fall into each category. Frequency Table FormatCount of Stations Adult Contemporary1556 Adult Standards1196 Contemporary Hit569 Country2066 News/Talk2179 Oldies1060 Religious2014 Rock869 Spanish Language750 Other Formats1579 Total13838 Relative Frequency Table FormatPercent of Stations Adult Contemporary11.2 Adult Standards8.6 Contemporary Hit4.1 Country14.9 News/Talk15.7 Oldies7.7 Religious14.6 Rock6.3 Spanish Language5.4 Other Formats11.4 Total99.9 Example, page 8 Count Percent Variable Values
Analyzing Categorical Data Two-Way Tables and Marginal Distributions When a dataset involves two categorical variables, we begin by examining the counts or percents invarious categories for one of the variables. Definition: Two-way Table – describes two categorical variables, organizing counts according to a row variable and a column variable. Young adults by gender and chance of getting rich FemaleMaleTotal Almost no chance9698194 Some chance, but probably not426286712 A 50-50 chance6967201416 A good chance6637581421 Almost certain4865971083 Total236724594826 Example, p. 12 What are the variables described by this two- way table? How many young adults were surveyed?
Analyzing Categorical Data Two-Way Tables and Marginal Distributions Definition: The Marginal Distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table. Note: Percents are often more informative than counts, especially when comparing groups of different sizes. To examine a marginal distribution, 1)Use the data in the table to calculate the marginal distribution (in percents) of the row or column totals. 2)Make a graph to display the marginal distribution.
Analyzing Categorical Data Relationships Between Categorical Variables Marginal distributions tell us nothing aboutthe relationship between two variables. Definition: A Conditional Distribution of a variable describes the values of that variable among individuals who have a specific value of another variable. To examine or compare conditional distributions, 1)Select the row(s) or column(s) of interest. 2)Use the data in the table to calculate the conditional distribution (in percents) of the row(s) or column(s). 3)Make a graph to display the conditional distribution. Use a side-by-side bar graph or segmented bar graph to compare distributions.
Young adults by gender and chance of getting rich FemaleMaleTotal Almost no chance9698194 Some chance, but probably not426286712 A 50-50 chance6967201416 A good chance6637581421 Almost certain4865971083 Total236724594826 Analyzing Categorical Data Two-Way Tables and Conditional Distributions ResponseMale Almost no chance 98/2459 = 4.0% Some chance 286/2459 = 11.6% A 50-50 chance 720/2459 = 29.3% A good chance 758/2459 = 30.8% Almost certain 597/2459 = 24.3% Example, p. 15 Calculate the conditional distribution of opinion among males. Examine the relationship between gender and opinion. Female 96/2367 = 4.1% 426/2367 = 18.0% 696/2367 = 29.4% 663/2367 = 28.0% 486/2367 = 20.5%
1) Draw a horizontal axis (a number line) and label it with the variable name. 2) Scale the axis from the minimum to the maximum value. 3) Mark a dot above the location on the horizontal axis corresponding to each data value. Displaying Quantitative Data Dotplots – One of the simplest graphs to construct and interpret is adotplot. Each data value is shown as a dot above its location on a number line. How to Make a Dotplot Number of Goals Scored Per Game by the 2004 US Women’s Soccer Team 30278243511453113 33212224356155115
Examining the Distribution of aQuantitative Variable The purpose of a graph is to help usunderstand the data. After you make agraph, always ask, “What do I see?” In any graph, look for the overall pattern and for striking departures from that pattern. Describe the overall pattern of a distribution by its: Shape Center Spread Note individual values that fall outside the overall pattern. These departures are called outliers. How to Examine the Distribution of a Quantitative Variable Displaying Quantitative Data Don’t forget your SOCS!
Displaying Quantitative Data Describing Shape – When you describe a distribution’s shape, concentrate on themain features. Look for rough symmetry or clear skewness. Definitions: A distribution is roughly symmetric if the right and left sides of the graph are approximately mirror images of each other. A distribution is skewed to the right (right-skewed) if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side. It is skewed to the left (left-skewed) if the left side of the graph is much longer than the right side. Symmetric Skewed-left Skewed-right
1) Separate each observation into a stem (all but the final digit) and a leaf (the final digit). 2) Write all possible stems from the smallest to the largest in a vertical column and draw a vertical line to the right of the column. 3) Write each leaf in the row to the right of its stem. 4) Arrange the leaves in increasing order out from the stem. 5) Provide a key that explains in context what the stems and leaves represent. Displaying Quantitative Data Stemplots (Stem-and-Leaf Plots) – Another simple graphical display for small data sets is astemplot. Stemplots give us a quick picture of the distributionwhile including the actual numerical values. How to Make a Stemplot
Displaying Quantitative Data Splitting Stems and Back-to-Back Stemplots – When data values are “bunched up”, we can get abetter picture of the distribution by splitting stems. – Two distributions of the same quantitative variablecan be compared using a back-to-back stemplot with common stems. 5026 31571924222338 13501334233049131551 001122334455001122334455 Key: 4|9 represents a student who reported having 49 pairs of shoes. Females 1476512388710 1145227510357 Males 0 4 0 555677778 1 0000124 1 2 3 3 58 4 5 Females 333 95 4332 66 410 8 9 100 7 Males “split stems”
1) Divide the range of data into classes of equal width. 2) Find the count (frequency) or percent (relative frequency) of individuals in each class. 3) Label and scale your axes and draw the histogram. The height of the bar equals its frequency. Adjacent bars should touch, unless a class contains no individuals. Displaying Quantitative Data Histograms – Quantitative variables often take many values. A graph of thedistribution may be clearer if nearby values are groupedtogether. – The most common graph of the distribution of one quantitativevariable is a histogram. How to Make a Histogram
Making a Histogram The table on page 35 presents data on the percent of residentsfrom each state who were born outside of the U.S. Displaying Quantitative Data Example, page 35 Frequency Table ClassCount 0 to <520 5 to <1013 10 to <159 15 to <205 20 to <252 25 to <301 Total50 Percent of foreign-born residents Number of States
1) Don’t confuse histograms and bar graphs. 2) Don’t use counts (in a frequency table) or percents (in a relative frequency table) as data. 3) Use percents instead of counts on the vertical axis when comparing distributions with different numbers of observations. 4) Just because a graph looks nice, it’s not necessarily a meaningful display of data. Displaying Quantitative Data Using Histograms Wisely – Here are several cautions based on common mistakes studentsmake when using histograms. Cautions
Describing Quantitative Data Measuring Center: The Mean – The most common measure of center is the ordinaryarithmetic average, or mean. Definition: To find the mean (pronounced “x-bar”) of a set of observations, add their values and divide by the number of observations. If the n observations are x 1, x 2, x 3, …, x n, their mean is: In mathematics, the capital Greek letter Σis short for “add them all up.” Therefore, the formula for the mean can be written in more compact notation:
Describing Quantitative Data Measuring Center: The Median – Another common measure of center is the median. In section 1.2, we learned that the median describes the midpoint of adistribution. Definition: The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger. To find the median of a distribution: 1) Arrange all observations from smallest to largest. 2) If the number of observations n is odd, the median M is the center observation in the ordered list. 3) If the number of observations n is even, the median M is the average of the two center observations in the ordered list.
Comparing the Mean and the Median The mean and median measure center in differentways, and both are useful. – Don’t confuse the “average” value of a variable (themean) with its “typical” value, which we might describeby the median. The mean and median of a roughly symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is usually farther out in the long tail than is the median. Comparing the Mean and the Median Describing Quantitative Data
Measuring Spread: The Interquartile Range ( IQR ) – A measure of center alone can be misleading. – A useful numerical description of a distribution requires both ameasure of center and a measure of spread. To calculate the quartiles: 1) Arrange the observations in increasing order and locate the median M. 2) The first quartile Q 1 is the median of the observations located to the left of the median in the ordered list. 3) The third quartile Q 3 is the median of the observations located to the right of the median in the ordered list. The interquartile range (IQR) is defined as: IQR = Q 3 – Q 1 How to Calculate the Quartiles and the Interquartile Range
510 15 20 2530 40 4560 6585 Describing Quantitative Data Find and Interpret the IQR Example, page 57 103052540201015302015208515651560 4045 Travel times to work for 20 randomly selected New Yorkers 510 15 20 2530 40 4560 6585 M = 22.5 Q 3 = 42.5 Q 1 = 15 IQR= Q 3 – Q 1 = 42.5 – 15 = 27.5 minutes Interpretation: The range of the middle half of travel times for the New Yorkers in the sample is 27.5 minutes.
Describing Quantitative Data Identifying Outliers – In addition to serving as a measure of spread, the interquartilerange (IQR) is used as part of a rule of thumb for identifyingoutliers. Definition: The 1.5 x IQR Rule for Outliers Call an observation an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile. Example, page 57 In the New York travel time data, we found Q 1 =15 minutes, Q 3 =42.5 minutes, and IQR=27.5 minutes. For these data, 1.5 x IQR = 1.5(27.5) = 41.25 Q 1 - 1.5 x IQR = 15 – 41.25 = -26.25 Q 3 + 1.5 x IQR = 42.5 + 41.25 = 83.75 Any travel time shorter than -26.25 minutes or longer than 83.75 minutes is considered an outlier. 0 5 1 005555 2 0005 3 00 4 005 5 6 005 7 8 5
The Five-Number Summary The minimum and maximum values alonetell us little about the distribution as awhole. Likewise, the median and quartilestell us little about the tails of a distribution. To get a quick summary of both center andspread, combine all five numbers. Describing Quantitative Data Definition: The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest. Minimum Q 1 M Q 3 Maximum
Boxplots (Box-and-Whisker Plots) The five-number summary divides the distribution roughlyinto quarters. This leads to a new way to display quantitativedata, the boxplot. Draw and label a number line that includes the range of the distribution. Draw a central box from Q 1 to Q 3. Note the median M inside the box. Extend lines (whiskers) from the box out to the minimum and maximum values that are not outliers. How to Make a Boxplot Describing Quantitative Data
Measuring Spread: The Standard Deviation – The most common measure of spread looks at how far eachobservation is from the mean. This measure is called thestandard deviation. Let’s explore it! – Consider the following data on the number of pets owned by agroup of 9 children. 1) Calculate the mean. 2) Calculate each deviation. deviation = observation – mean = 5 deviation: 1 - 5 = -4 deviation: 8 - 5 = 3
Describing Quantitative Data Measuring Spread: The Standard Deviation xixi (x i -mean)(x i -mean) 2 11 - 5 = -4(-4) 2 = 16 33 - 5 = -2(-2) 2 = 4 44 - 5 = -1(-1) 2 = 1 44 - 5 = -1(-1) 2 = 1 44 - 5 = -1(-1) 2 = 1 55 - 5 = 0(0) 2 = 0 77 - 5 = 2(2) 2 = 4 88 - 5 = 3(3) 2 = 9 99 - 5 = 4(4) 2 = 16 Sum=? 3) Square each deviation. 4) Find the “average” squared deviation. Calculate the sum of the squared deviations divided by (n-1)…this is called the variance. 5) Calculate the square root of the variance…this is the standard deviation. “average” squared deviation = 52/(9-1) = 6.5 This is the variance. Standard deviation = square root of variance =
Describing Quantitative Data Measuring Spread: The Standard Deviation Definition: The standard deviation s x measures the average distance of the observations from their mean. It is calculated by finding an average of the squared distances and then taking the square root. This average squared distance is called the variance.
Choosing Measures of Center and Spread We now have a choice between twodescriptions for center and spread – Mean and Standard Deviation – Median and Interquartile Range The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers. Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers. NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA! Choosing Measures of Center and Spread Describing Quantitative Data
Describing Location in a Distribution Measuring Position: Percentiles – One way to describe the location of a value in a distribution isto tell what percent of observations are less than it. Definition: The p th percentile of a distribution is the value with p percent of the observations less than it. 6 7 7 2334 7 5777899 8 00123334 8 569 9 03 Jenny earned a score of 86 on her test. How did she perform relative to the rest of the class? Example, p. 85 Her score was greater than 21 of the 25 observations. Since 21 of the 25, or 84%, of the scores are below hers, Jenny is at the 84 th percentile in the class’s test score distribution. 6 7 7 2334 7 5777899 8 00123334 8 569 9 03
Describing Location in a Distribution Use the graph from page 88 to answer the following questions. Was Barack Obama, who was inaugurated at age 47, unusuallyyoung? Estimate and interpret the 65 th percentile of the distribution Interpreting Cumulative Relative Frequency Graphs 47 11 65 58
Describing Location in a Distribution Measuring Position: z -Scores – A z -score tells us how many standard deviations from the mean an observation falls, and in what direction. Definition: If x is an observation from a distribution that has known mean and standard deviation, the standardized value of x is: A standardized value is often called a z-score. Jenny earned a score of 86 on her test. The class mean is 80 and the standard deviation is 6.07. What is her standardized score?
Describing Location in a Distribution Using z -scores for Comparison We can use z-scores to compare the position of individuals in different distributions. Jenny earned a score of 86 on her statistics test. The class mean was 80 and the standard deviation was 6.07. She earned a score of 82 on her chemistry test. The chemistry scores had a fairly symmetric distribution with a mean 76 and standard deviation of 4. On which test did Jenny perform better relative to the rest of her class? Example, p. 91
Example, p. 93 Describing Location in a Distribution Transforming Data Transforming converts the original observations from the original units of measurements to another scale. Transformations can affect the shape, center, and spread of a distribution. Adding the same number a (either positive, zero, or negative) to each observation: adds a to measures of center and location (mean, median, quartiles, percentiles), but Does not change the shape of the distribution or measures of spread (range, IQR, standard deviation). Effect of Adding (or Subracting) a Constant nMeansxsx MinQ1Q1 MQ3Q3 MaxIQRRange Guess(m)4416.027.14811151740632 Error (m)443.027.14-5-22427632
Example, p. 95 Describing Location in a Distribution Transforming Data Multiplying (or dividing) each observation by the same number b (positive, negative, or zero): multiplies (divides) measures of center and location by b multiplies (divides) measures of spread by |b|, but does not change the shape of the distribution Effect of Multiplying (or Dividing) by a Constant nMeansxsx MinQ1Q1 MQ3Q3 MaxIQRRange Error(ft)449.9123.43-16.4-6.566.5613.1288.5619.68104.96 Error (m)443.027.14-5-22427632
Describing Location in a Distribution Density Curve Definition: A density curve is a curve that is always on or above the horizontal axis, and has area exactly 1 underneath it. A density curve describes the overall pattern of a distribution. The area under the curve and above any interval of values on the horizontal axis is the proportion of all observations that fall in that interval. The overall pattern of this histogram of the scores of all 947 seventh-grade students in Gary, Indiana, on the vocabulary part of the Iowa Test of Basic Skills (ITBS) can be described by a smooth curve drawn through the tops of the bars.
Normal Distributions Definition: A Normal distribution is described by a Normal density curve. Any particular Normal distribution is completely specified by two numbers: its mean µ and standard deviation σ. The mean of a Normal distribution is the center of the symmetric Normal curve. The standard deviation is the distance from the center to the change-of-curvature points on either side. We abbreviate the Normal distribution with mean µ and standard deviation σ as N(µ,σ). Normal distributions are good descriptions for some distributions of real data. Normal distributions are good approximations of the results of many kinds of chance outcomes. Many statistical inference procedures are based on Normal distributions.
Normal Distributions Although there are many Normal curves, they all have properties in common. The 68-95-99.7 Rule Definition: The 68-95-99.7 Rule (“The Empirical Rule”) In the Normal distribution with mean µ and standard deviation σ: Approximately 68% of the observations fall within σ of µ. Approximately 95% of the observations fall within 2σ of µ. Approximately 99.7% of the observations fall within 3σ of µ.
Normal Distributions The distribution of Iowa Test of Basic Skills (ITBS) vocabulary scores for 7 th grade students in Gary, Indiana, is close to Normal. Suppose the distribution is N (6.84, 1.55). a) Sketch the Normal density curve for this distribution. b) What percent of ITBS vocabulary scores are less than 3.74? c) What percent of the scores are between 5.29 and 9.94? Example, p. 113
Normal Distributions The Standard Normal Distribution – All Normal distributions are the same if we measure in units ofsize σ from the mean µ as center. Definition: The standard Normal distribution is the Normal distribution with mean 0 and standard deviation 1. If a variable x has any Normal distribution N(µ,σ) with mean µ and standard deviation σ, then the standardized variable has the standard Normal distribution, N(0,1).
Normal Distributions The Standard Normal Table Because all Normal distributions are the same when we standardize, we can find areas under any Normal curve from a single table. Definition: The Standard Normal Table Table A is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z. Z.00.01.02 0.7.7580.7611.7642 0.8.7881.7910.7939 0.9.8159.8186.8212 P(z < 0.81) =.7910 Suppose we want to find the proportion of observations from the standard Normal distribution that are less than 0.81. We can use Table A:
Normal Distributions Normal Distribution Calculations When Tiger Woods hits his driver, the distance the ball travels can be described by N(304, 8). What percent of Tiger’s drives travel between 305 and 325 yards? Using Table A, we can find the area to the left of z=2.63 and the area to the left of z=0.13. 0.9957 – 0.5517 = 0.4440. About 44% of Tiger’s drives travel between 305 and 325 yards.
Normal Distributions Assessing Normality The Normal distributions provide good modelsfor some distributions of real data. Manystatistical inference procedures are based on theassumption that the population is approximatelyNormally distributed. Consequently, we need astrategy for assessing Normality. Plot the data. Make a dotplot, stemplot, or histogram and see if the graph is approximately symmetric and bell-shaped. Check whether the data follow the 68-95-99.7 rule. Count how many observations fall within one, two, and three standard deviations of the mean and check to see if these percents are close to the 68%, 95%, and 99.7% targets for a Normal distribution.
Normal Distributions Normal Probability Plots Most software packages can construct Normal probability plots. Theseplots are constructed by plotting each observation in a data set against itscorresponding percentile’s z-score. If the points on a Normal probability plot lie close to a straight line, the plot indicates that the data are Normal. Systematic deviations from a straight line indicate a non-Normal distribution. Outliers appear as points that are far away from the overall pattern of the plot. Interpreting Normal Probability Plots
Scatterplots and Correlation Explanatory and Response Variables Most statistical studies examine data on more than one variable. In many of these settings, the two variables play differentroles. Definition: A response variable measures an outcome of a study. An explanatory variable may help explain or influence changes in a response variable. Note: In many studies, the goal is to show that changes in one or more explanatory variables actually cause changes in a response variable. However, other explanatory-response relationships don’t involve direct causation.
Scatterplots and Correlation Displaying Relationships: Scatterplots The most useful graph for displaying the relationship between two quantitative variables is a scatterplot. Definition: A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data appears as a point on the graph. 1. Decide which variable should go on each axis. Remember, the eXplanatory variable goes on the X-axis! 2. Label and scale your axes. 3. Plot individual data values. 1. Decide which variable should go on each axis. Remember, the eXplanatory variable goes on the X-axis! 2. Label and scale your axes. 3. Plot individual data values. How to Make a Scatterplot
Scatterplots and Correlation Interpreting ScatterplotsTo interpret a scatterplot, follow the basic strategy of data analysis from Chapters 1 and 2. Look for patterns andimportant departures from those patterns. As in any graph of data, look for the overall pattern and for striking departures from that pattern. You can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship. An important kind of departure is an outlier, an individual value that falls outside the overall pattern of the relationship. Remember DOFS As in any graph of data, look for the overall pattern and for striking departures from that pattern. You can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship. An important kind of departure is an outlier, an individual value that falls outside the overall pattern of the relationship. Remember DOFS How to Examine a Scatterplot
Scatterplots and Correlation Measuring Linear Association: Correlation A scatterplot displays the strength, direction, and form of the relationship between two quantitative variables. Linear relationships are important because a straight line is a simple pattern that is quite common. Unfortunately,our eyes are not good judges of how strong a linearrelationship is. Definition: The correlation r measures the strength of the linear relationship between two quantitative variables. r is always a number between -1 and 1 r > 0 indicates a positive association. r < 0 indicates a negative association. Values of r near 0 indicate a very weak linear relationship. The strength of the linear relationship increases as r moves away from 0 towards -1 or 1. The extreme values r = -1 and r = 1 occur only in the case of a perfect linear relationship.
Scatterplots and Correlation Facts about Correlation How correlation behaves is more important than the details of the formula. Here are some important facts about r. 1.Correlation makes no distinction between explanatory and response variables. 2.r does not change when we change the units of measurement of x, y, or both. 3.The correlation r itself has no unit of measurement. Cautions: Correlation requires that both variables be quantitative. Correlation does not describe curved relationships between variables, no matter how strong the relationship is. Correlation is not resistant. r is strongly affected by a few outlying observations. Correlation is not a complete summary of two-variable data. Cautions: Correlation requires that both variables be quantitative. Correlation does not describe curved relationships between variables, no matter how strong the relationship is. Correlation is not resistant. r is strongly affected by a few outlying observations. Correlation is not a complete summary of two-variable data.
Least-Squares Regression Interpreting a Regression Line A regression line is a model for the data, much like density curves. The equation of a regression line gives a compactmathematical description of what this model tells us about therelationship between the response variable y and the explanatory variable x. Definition: Suppose that y is a response variable (plotted on the vertical axis) and x is an explanatory variable (plotted on the horizontal axis). A regression line relating y to x has an equation of the form ŷ = a + bx In this equation, ŷ (read “y hat”) is the predicted value of the response variable y for a given value of the explanatory variable x. b is the slope, the amount by which y is predicted to change when x increases by one unit. a is the y intercept, the predicted value of y when x = 0.
Least-Squares Regression Interpreting a Regression Line Consider the regression line from the example “Does Fidgeting Keep You Slim?” Identify the slope and y -intercept and interpret each value in context. The y-intercept a = 3.505 kg is the fat gain estimated by this model if NEA does not change when a person overeats. The slope b = -0.00344 tells us that the amount of fat gained is predicted to go down by 0.00344 kg for each added calorie of NEA.
Least-Squares Regression Prediction We can use a regression line to predict the response ŷ for a specific value of the explanatory variable x. Use the NEA and fat gain regression line to predict the fat gain for a person whose NEA increases by 400 cal when she overeats. We predict a fat gain of 2.13 kg when a person with NEA = 400 calories.
Least-Squares Regression Extrapolation We can use a regression line to predict the response ŷ for a specific value of the explanatory variable x. The accuracy of the prediction depends on how much the data scatter aboutthe line. While we can substitute any value of x into the equation of the regression line, we must exercise caution in making predictionsoutside the observed values of x. Definition: Extrapolation is the use of a regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line. Such predictions are often not accurate. Don’t make predictions using values of x that are much larger or much smaller than those that actually appear in your data.
Least-Squares Regression Residuals In most cases, no line will pass exactly through all the points in a scatterplot. A good regression line makes the vertical distances of the points from theline as small as possible. Definition: A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is, residual = observed y – predicted y residual = y - ŷ residual Positive residuals (above line) Positive residuals (above line) Negative residuals (below line) Negative residuals (below line)
Least-Squares Regression Least-Squares Regression Line We can use technology to find the equation of the least-squares regression line. We can also write it in terms of the means andstandard deviations of the two variables and their correlation. Definition: Equation of the least-squares regression line We have data on an explanatory variable x and a response variable y for n individuals. From the data, calculate the means and standard deviations of the two variables and their correlation. The least squares regression line is the line ŷ = a + bx with slope and y intercept
Least-Squares Regression Interpreting Residual Plots A residual plot magnifies the deviations of the points from the line, making it easier to see unusual observations and patterns. 1) The residual plot should show no obvious patterns 2) The residuals should be relatively small in size. Definition: If we use a least-squares regression line to predict the values of a response variable y from an explanatory variable x, the standard deviation of the residuals (s) is given by Pattern in residuals Linear model not appropriate Pattern in residuals Linear model not appropriate
Interpreting Computer Regression Output A number of statistical software packages produce similar regression output. Be sure you can locate the slope b, the y intercept a, and the values of s and r 2. Least-Squares Regression
Correlation and Regression Wisdom 2. Correlation and regression lines describe only linear relationships. 3. Correlation and least-squares regression lines are not resistant. Definition: An outlier is an observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction but not the x direction of a scatterplot have large residuals. Other outliers may not have large residuals. An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line.
Sampling and Surveys Population and Sample The distinction between population and sample is basic to statistics. To make senseof any sample result, you must know whatpopulation the sample represents Definition: The population in a statistical study is the entire group of individuals about which we want information. A sample is the part of the population from which we actually collect information. We use information from a sample to draw conclusions about the entire population. Population Sample Collect data from a representative Sample... Make an Inference about the Population.
Sampling and Surveys The Idea of a Sample Survey We often draw conclusions about a whole population on the basis of a sample. Choosing a sample from a large, varied population is not that easy. Step 1: Define the population we want to describe. Step 2: Say exactly what we want to measure. A “sample survey” is a study that uses an organized plan to choose a sample that represents some specific population. Step 3: Decide how to choose a sample from the population. Step 1: Define the population we want to describe. Step 2: Say exactly what we want to measure. A “sample survey” is a study that uses an organized plan to choose a sample that represents some specific population. Step 3: Decide how to choose a sample from the population.
Sampling and Surveys How to Sample Badly How can we choose a sample that we can trust to represent the population? There are a number ofdifferent methods to select samples. Definition: Choosing individuals who are easiest to reach results in a convenience sample. Definition: The design of a statistical study shows bias if it systematically favors certain outcomes. Convenience samples often produce unrepresentative data…why?
Sampling and Surveys How to Sample Badly Convenience samples are almost guaranteed to showbias. So are voluntary response samples, in which people decide whether to join the sample in responseto an open invitation. Definition: A voluntary response sample consists of people who choose themselves by responding to a general appeal. Voluntary response samples show bias because people with strong opinions (often in the same direction) are most likely to respond.
Sampling and Surveys How to Sample Well: Random Sampling The statistician’s remedy is to allow impersonal chance tochoose the sample. A sample chosen by chance rules outboth favoritism by the sampler and self-selection byrespondents. Random sampling, the use of chance to select a sample, is the central principle of statistical sampling. Definition: A simple random sample (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected. In practice, people use random numbers generated by a computer or calculator to choose samples. If you don’t have technology handy, you can use a table of random digits.
Sampling and Surveys How to Choose an SRS Step 1: Label. Give each member of the population a numerical label of the same length. Step 2: Table. Read consecutive groups of digits of the appropriate length from Table D. Your sample contains the individuals whose labels you find. Step 1: Label. Give each member of the population a numerical label of the same length. Step 2: Table. Read consecutive groups of digits of the appropriate length from Table D. Your sample contains the individuals whose labels you find. How to Choose an SRS Using Table D Definition: A table of random digits is a long string of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 with these properties: Each entry in the table is equally likely to be any of the 10 digits 0 - 9. The entries are independent of each other. That is, knowledge of one part of the table gives no information about any other part.
Sampling and Surveys Other Sampling Methods The basic idea of sampling is straightforward: takean SRS from the population and use your sampleresults to gain information about the population.Sometimes there are statistical advantages tousing more complex sampling methods. One common alternative to an SRS involvessampling important groups (called strata) withinthe population separately. These “sub-samples”are combined to form one stratified randomsample. Definition: To select a stratified random sample, first classify the population into groups of similar individuals, called strata. Then choose a separate SRS in each stratum and combine these SRSs to form the full sample.
Sampling and Surveys Other Sampling Methods Although a stratified random sample can sometimesgive more precise information about a population thanan SRS, both sampling methods are hard to use whenpopulations are large and spread out over a wide area. In that situation, we’d prefer a method that selectsgroups of individuals that are “near” one another. Definition: To take a cluster sample, first divide the population into smaller groups. Ideally, these clusters should mirror the characteristics of the population. Then choose an SRS of the clusters. All individuals in the chosen clusters are included in the sample.
Experiments Observational Study versus Experiment In contrast to observational studies, experiments don’t just observe individuals or ask themquestions. They actively impose some treatmentin order to measure the response. Definition: An observational study observes individuals and measures variables of interest but does not attempt to influence the responses. An experiment deliberately imposes some treatment on individuals to measure their responses. When our goal is to understand cause and effect, experiments are the only source of fully convincing data. The distinction between observational study and experiment is one of the most important in statistics.
Experiments The Language of Experiments An experiment is a statistical study in which we actually do something (a treatment ) to people, animals, or objects (the experimental units ) to observe the response. Here is the basic vocabulary of experiments. Definition: A specific condition applied to the individuals in an experiment is called a treatment. If an experiment has several explanatory variables, a treatment is a combination of specific values of these variables. The experimental units are the smallest collection of individuals to which treatments are applied. When the units are human beings, they often are called subjects. Sometimes, the explanatory variables in an experiment are called factors. Many experiments study the joint effects of several factors. In such an experiment, each treatment is formed by combining a specific value (often called a level) of each of the factors.
Experiments How to Experiment Well: The Randomized Comparative Experiment The remedy for confounding is to perform a comparative experiment in which some units receive one treatment and similar units receive another. Most well designedexperiments compare two or more treatments. Comparison alone isn’t enough, if the treatments are givento groups that differ greatly, bias will result. The solution to the problem of bias is random assignment. Definition: In an experiment, random assignment means that experimental units are assigned to treatments at random, that is, using some sort of chance process.
Experiments The Randomized Comparative Experiment Definition: In a completely randomized design, the treatments are assigned to all the experimental units completely by chance. Some experiments may include a control group that receives an inactive treatment or an existing baseline treatment. Experimental Units Random Assignment Group 1 Group 2 Treatment 1 Treatment 2 Compare Results
Experiments Three Principles of Experimental Design Randomized comparative experimentsare designed to give good evidence thatdifferences in the treatments actuallycause the differences we see in theresponse. 1. Control for lurking variables that might affect the response: Use a comparative design and ensure that the only systematic difference between the groups is the treatment administered. 2. Random assignment: Use impersonal chance to assign experimental units to treatments. This helps create roughly equivalent groups of experimental units by balancing the effects of lurking variables that aren’t controlled on the treatment groups. 3. Replication: Use enough experimental units in each group so that any differences in the effects of the treatments can be distinguished from chance differences between the groups. 1. Control for lurking variables that might affect the response: Use a comparative design and ensure that the only systematic difference between the groups is the treatment administered. 2. Random assignment: Use impersonal chance to assign experimental units to treatments. This helps create roughly equivalent groups of experimental units by balancing the effects of lurking variables that aren’t controlled on the treatment groups. 3. Replication: Use enough experimental units in each group so that any differences in the effects of the treatments can be distinguished from chance differences between the groups. Principles of Experimental Design
Experiments Experiments: What Can Go Wrong? The logic of a randomized comparativeexperiment depends on our ability to treatall the subjects the same in every way exceptfor the actual treatments being compared. Good experiments, therefore, require carefulattention to details to ensure that allsubjects really are treated identically. A response to a dummy treatment is called a placebo effect. The strength of the placebo effect is a strong argument for randomized comparative experiments. Whenever possible, experiments with human subjects should be double-blind. Definition: In a double-blind experiment, neither the subjects nor those who interact with them and measure the response variable know which treatment a subject received.
Experiments Inference for Experiments In an experiment, researchers usually hope to seea difference in the responses so large that it isunlikely to happen just because of chancevariation. We can use the laws of probability, whichdescribe chance behavior, to learn whether thetreatment effects are larger than we wouldexpect to see if only chance were operating. If they are, we call them statistically significant. Definition: An observed effect so large that it would rarely occur by chance is called statistically significant. A statistically significant association in data from a well-designed experiment does imply causation.
Experiments Blocking Completely randomized designs are the simplest statistical designs forexperiments. But just as with sampling, there are times when the simplestmethod doesn’t yield the most precise results. Definition A block is a group of experimental units that are known before the experiment to be similar in some way that is expected to affect the response to the treatments. In a randomized block design, the random assignment of experimental units to treatments is carried out separately within each block. Form blocks based on the most important unavoidable sources of variability (lurking variables) among the experimental units. Randomization will average out the effects of the remaining lurking variables and allow an unbiased comparison of the treatments. Control what you can, block on what you can’t control, and randomize to create comparable groups.
Experiments Matched-Pairs Design A common type of randomized block design for comparing two treatmentsis a matched pairs design. The idea is to create blocks by matching pairs ofsimilar experimental units. Definition A matched-pairs design is a randomized blocked experiment in which each block consists of a matching pair of similar experimental units. Chance is used to determine which unit in each pair gets each treatment. Sometimes, a “pair” in a matched-pairs design consists of a single unit that receives both treatments. Since the order of the treatments can influence the response, chance is used to determine with treatment is applied first for each unit.
Using Studies Wisely Scope of Inference What type of inference can be made from a particular study? The answer depends on the design of thestudy. Well-designed experiments randomly assign individuals to treatment groups. However, most experimentsdon’t select experimental units at random from thelarger population. That limits such experiments toinference about cause and effect. Observational studies don’t randomly assign individuals to groups, which rules out inference about cause andeffect. Observational studies that use randomsampling can make inferences about the population.
Using Studies Wisely Data Ethics Complex issues of data ethics arise when we collect data from people. Here are some basic standardsof data ethics that must be obeyed by all studiesthat gather data from human subjects, bothobservational studies and experiments. All planned studies must be reviewed in advance by an institutional review board charged with protecting the safety and well-being of the subjects. All individuals who are subjects in a study must give their informed consent before data are collected. All individual data must be kept confidential. Only statistical summaries for groups of subjects may be made public. All planned studies must be reviewed in advance by an institutional review board charged with protecting the safety and well-being of the subjects. All individuals who are subjects in a study must give their informed consent before data are collected. All individual data must be kept confidential. Only statistical summaries for groups of subjects may be made public. Basic Data Ethics