2Analyzing Categorical Data Categorical Variables place individuals into one of several groups or categoriesThe values of a categorical variable are labels for the different categoriesThe distribution of a categorical variable lists the count or percent of individuals who fall into each category.Analyzing Categorical DataExample, page 8Frequency TableFormatCount of StationsAdult Contemporary1556Adult Standards1196Contemporary Hit569Country2066News/Talk2179Oldies1060Religious2014Rock869Spanish Language750Other Formats1579Total13838Relative Frequency TableFormatPercent of StationsAdult Contemporary11.2Adult Standards8.6Contemporary Hit4.1Country14.9News/Talk15.7Oldies7.7Religious14.6Rock6.3Spanish Language5.4Other Formats11.4Total99.9VariableCountPercentValues
3Analyzing Categorical Data Two-Way Tables and Marginal DistributionsWhen a dataset involves two categorical variables, we begin by examining the counts or percents in various categories for one of the variables.Analyzing Categorical DataDefinition:Two-way Table – describes two categorical variables, organizing counts according to a row variable and a column variable.Example, p. 12What are the variables described by this two-way table?How many young adults were surveyed?Young adults by gender and chance of getting richFemaleMaleTotalAlmost no chance9698194Some chance, but probably not426286712A chance6967201416A good chance6637581421Almost certain4865971083236724594826Alternate Example: Super PowersA sample of 200 children from the United Kingdom ages 9-17 was selected from the CensusAtSchool website (www.censusatschool.com). The gender of each student was recorded along with which super power they would most like to have: invisibility, super strength, telepathy (ability to read minds), ability to fly, or ability to freeze time. Here are the results:
4Analyzing Categorical Data Two-Way Tables and Marginal DistributionsAnalyzing Categorical DataDefinition:The Marginal Distribution of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table.Note: Percents are often more informative than counts, especially when comparing groups of different sizes.To examine a marginal distribution,Use the data in the table to calculate the marginal distribution (in percents) of the row or column totals.Make a graph to display the marginal distribution.
5Analyzing Categorical Data Relationships Between Categorical VariablesMarginal distributions tell us nothing about the relationship between two variables.Analyzing Categorical DataDefinition:A Conditional Distribution of a variable describes the values of that variable among individuals who have a specific value of another variable.To examine or compare conditional distributions,Select the row(s) or column(s) of interest.Use the data in the table to calculate the conditional distribution (in percents) of the row(s) or column(s).Make a graph to display the conditional distribution.Use a side-by-side bar graph or segmented bar graph to compare distributions.
6Analyzing Categorical Data Two-Way Tables and Conditional DistributionsAnalyzing Categorical DataExample, p. 15Young adults by gender and chance of getting richFemaleMaleTotalAlmost no chance9698194Some chance, but probably not426286712A chance6967201416A good chance6637581421Almost certain4865971083236724594826Calculate the conditional distribution of opinion among males.Examine the relationship between gender and opinion.ResponseMaleAlmost no chance98/2459 = 4.0%Some chance286/2459 = 11.6%A chance720/2459 = 29.3%A good chance758/2459 = 30.8%Almost certain597/2459 = 24.3%Female96/2367 = 4.1%426/2367 = 18.0%696/2367 = 29.4%663/2367 = 28.0%486/2367 = 20.5%
7Displaying Quantitative Data DotplotsOne of the simplest graphs to construct and interpret is a dotplot. Each data value is shown as a dot above its location on a number line.Displaying Quantitative DataHow to Make a DotplotDraw a horizontal axis (a number line) and label it with the variable name.Scale the axis from the minimum to the maximum value.Mark a dot above the location on the horizontal axis corresponding to each data value.Number of Goals Scored Per Game by the 2004 US Women’s Soccer Team32784516
8Displaying Quantitative Data Examining the Distribution of a Quantitative VariableThe purpose of a graph is to help us understand the data. After you make a graph, always ask, “What do I see?”Displaying Quantitative DataIn any graph, look for the overall pattern and for striking departures from that pattern.Describe the overall pattern of a distribution by its:ShapeCenterSpreadNote individual values that fall outside the overall pattern. These departures are called outliers.How to Examine the Distribution of a Quantitative VariableDon’t forget your SOCS!
9Displaying Quantitative Data Describing ShapeWhen you describe a distribution’s shape, concentrate on the main features. Look for rough symmetry or clear skewness.Displaying Quantitative DataDefinitions:A distribution is roughly symmetric if the right and left sides of the graph are approximately mirror images of each other.A distribution is skewed to the right (right-skewed) if the right side of the graph (containing the half of the observations with larger values) is much longer than the left side.It is skewed to the left (left-skewed) if the left side of the graph is much longer than the right side.SymmetricSkewed-leftSkewed-right
10Displaying Quantitative Data Stemplots (Stem-and-Leaf Plots)Another simple graphical display for small data sets is a stemplot. Stemplots give us a quick picture of the distribution while including the actual numerical values.Displaying Quantitative DataHow to Make a StemplotSeparate each observation into a stem (all but the final digit) and a leaf (the final digit).Write all possible stems from the smallest to the largest in a vertical column and draw a vertical line to the right of the column.Write each leaf in the row to the right of its stem.Arrange the leaves in increasing order out from the stem.Provide a key that explains in context what the stems and leaves represent.
11Displaying Quantitative Data Splitting Stems and Back-to-Back StemplotsWhen data values are “bunched up”, we can get a better picture of the distribution by splitting stems.Two distributions of the same quantitative variable can be compared using a back-to-back stemplot with common stems.Displaying Quantitative DataFemalesMales5026315719242223381334304915511476512388101142235Females33395433266410891007Males0 412 2233 584512345“split stems”Key: 4|9 represents a student who reported having 49 pairs of shoes.
12Displaying Quantitative Data HistogramsQuantitative variables often take many values. A graph of the distribution may be clearer if nearby values are grouped together.The most common graph of the distribution of one quantitative variable is a histogram.Displaying Quantitative DataHow to Make a HistogramDivide the range of data into classes of equal width.Find the count (frequency) or percent (relative frequency) of individuals in each class.Label and scale your axes and draw the histogram. The height of the bar equals its frequency. Adjacent bars should touch, unless a class contains no individuals.
13Displaying Quantitative Data Making a HistogramThe table on page 35 presents data on the percent of residents from each state who were born outside of the U.S.Example, page 35Displaying Quantitative DataFrequency TableClassCount0 to <5205 to <101310 to <15915 to <20520 to <25225 to <301Total50Percent of foreign-born residentsNumber of States
14Displaying Quantitative Data Using Histograms WiselyHere are several cautions based on common mistakes students make when using histograms.Displaying Quantitative DataCautionsDon’t confuse histograms and bar graphs.Don’t use counts (in a frequency table) or percents (in a relative frequency table) as data.Use percents instead of counts on the vertical axis when comparing distributions with different numbers of observations.Just because a graph looks nice, it’s not necessarily a meaningful display of data.
15Describing Quantitative Data Measuring Center: The MeanThe most common measure of center is the ordinary arithmetic average, or mean.Describing Quantitative DataDefinition:To find the mean (pronounced “x-bar”) of a set of observations, add their values and divide by the number of observations. If the n observations are x1, x2, x3, …, xn, their mean is:In mathematics, the capital Greek letter Σis short for “add them all up.” Therefore, the formula for the mean can be written in more compact notation:
16Describing Quantitative Data Measuring Center: The MedianAnother common measure of center is the median. In section 1.2, we learned that the median describes the midpoint of a distribution.Describing Quantitative DataDefinition:The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger.To find the median of a distribution:Arrange all observations from smallest to largest.If the number of observations n is odd, the median M is the center observation in the ordered list.If the number of observations n is even, the median M is the average of the two center observations in the ordered list.
17Describing Quantitative Data Comparing the Mean and the MedianThe mean and median measure center in different ways, and both are useful.Don’t confuse the “average” value of a variable (the mean) with its “typical” value, which we might describe by the median.Describing Quantitative DataComparing the Mean and the MedianThe mean and median of a roughly symmetric distribution are close together.If the distribution is exactly symmetric, the mean and median are exactly the same.In a skewed distribution, the mean is usually farther out in the long tail than is the median.
18Describing Quantitative Data Measuring Spread: The Interquartile Range (IQR)A measure of center alone can be misleading.A useful numerical description of a distribution requires both a measure of center and a measure of spread.Describing Quantitative DataHow to Calculate the Quartiles and the Interquartile RangeTo calculate the quartiles:Arrange the observations in increasing order and locate the median M.The first quartile Q1 is the median of the observations located to the left of the median in the ordered list.The third quartile Q3 is the median of the observations located to the right of the median in the ordered list.The interquartile range (IQR) is defined as:IQR = Q3 – Q1
19Describing Quantitative Data Find and Interpret the IQRExample, page 57Describing Quantitative DataTravel times to work for 20 randomly selected New Yorkers103052540201585656045510152025304045606585510152025304045606585Q1 = 15M = 22.5Q3= 42.5IQR = Q3 – Q1= 42.5 – 15= 27.5 minutesInterpretation: The range of the middle half of travel times for the New Yorkers in the sample is 27.5 minutes.
20Describing Quantitative Data Identifying OutliersIn addition to serving as a measure of spread, the interquartile range (IQR) is used as part of a rule of thumb for identifying outliers.Describing Quantitative DataDefinition:The 1.5 x IQR Rule for OutliersCall an observation an outlier if it falls more than 1.5 x IQR above the third quartile or below the first quartile.Example, page 57In the New York travel time data, we found Q1=15 minutes, Q3=42.5 minutes, and IQR=27.5 minutes.For these data, 1.5 x IQR = 1.5(27.5) = 41.25Q x IQR = 15 – =Q x IQR = = 83.75Any travel time shorter than minutes or longer than minutes is considered an outlier.0 53 00578 5
21The Five-Number Summary The minimum and maximum values alone tell us little about the distribution as a whole. Likewise, the median and quartiles tell us little about the tails of a distribution.To get a quick summary of both center and spread, combine all five numbers.Describing Quantitative DataDefinition:The five-number summary of a distribution consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from smallest to largest.Minimum Q1 M Q3 Maximum
22Describing Quantitative Data Boxplots (Box-and-Whisker Plots)The five-number summary divides the distribution roughly into quarters. This leads to a new way to display quantitative data, the boxplot.Describing Quantitative DataHow to Make a BoxplotDraw and label a number line that includes the range of the distribution.Draw a central box from Q1 to Q3.Note the median M inside the box.Extend lines (whiskers) from the box out to the minimum and maximum values that are not outliers.
23Describing Quantitative Data Measuring Spread: The Standard DeviationThe most common measure of spread looks at how far each observation is from the mean. This measure is called the standard deviation. Let’s explore it!Consider the following data on the number of pets owned by a group of 9 children.Describing Quantitative DataCalculate the mean.Calculate each deviation.deviation = observation – meandeviation: = -4deviation: = 3= 5
24Describing Quantitative Data Measuring Spread: The Standard DeviationDescribing Quantitative Dataxi(xi-mean)(xi-mean)211 - 5 = -4(-4)2 = 1633 - 5 = -2(-2)2 = 444 - 5 = -1(-1)2 = 155 - 5 = 0(0)2 = 077 - 5 = 2(2)2 = 488 - 5 = 3(3)2 = 999 - 5 = 4(4)2 = 16Sum=?3) Square each deviation.4) Find the “average” squared deviation. Calculate the sum of the squared deviations divided by (n-1)…this is called the variance.5) Calculate the square root of the variance…this is the standard deviation.“average” squared deviation = 52/(9-1) = This is the variance.Standard deviation = square root of variance =
25Describing Quantitative Data Measuring Spread: The Standard DeviationDescribing Quantitative DataDefinition:The standard deviation sx measures the average distance of the observations from their mean. It is calculated by finding an average of the squared distances and then taking the square root. This average squared distance is called the variance.
26Describing Quantitative Data Choosing Measures of Center and SpreadWe now have a choice between two descriptions for center and spreadMean and Standard DeviationMedian and Interquartile RangeDescribing Quantitative DataChoosing Measures of Center and SpreadThe median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.Use mean and standard deviation only for reasonably symmetric distributions that don’t have outliers.NOTE: Numerical summaries do not fully describe the shape of a distribution. ALWAYS PLOT YOUR DATA!
28Describing Location in a Distribution Measuring Position: PercentilesOne way to describe the location of a value in a distribution is to tell what percent of observations are less than it.Describing Location in a DistributionDefinition:The pth percentile of a distribution is the value with p percent of the observations less than it.6 79 03Jenny earned a score of 86 on her test. How did she perform relative to the rest of the class?Example, p. 85Her score was greater than 21 of the 25 observations. Since 21 of the 25, or 84%, of the scores are below hers, Jenny is at the 84th percentile in the class’s test score distribution.6 79 03
29Describing Location in a Distribution Interpreting Cumulative Relative Frequency GraphsUse the graph from page 88 to answer the following questions.Was Barack Obama, who was inaugurated at age 47, unusually young?Estimate and interpret the 65th percentile of the distributionDescribing Location in a Distribution65115847
30Describing Location in a Distribution Measuring Position: z-ScoresA z-score tells us how many standard deviations from the mean an observation falls, and in what direction.Describing Location in a DistributionDefinition:If x is an observation from a distribution that has known mean and standard deviation, the standardized value of x is:A standardized value is often called a z-score.Jenny earned a score of 86 on her test. The class mean is 80 and the standard deviation is What is her standardized score?
31Describing Location in a Distribution Using z-scores for ComparisonDescribing Location in a DistributionWe can use z-scores to compare the position of individuals in different distributions.Example, p. 91Jenny earned a score of 86 on her statistics test. The class mean was 80 and the standard deviation was She earned a score of 82 on her chemistry test. The chemistry scores had a fairly symmetric distribution with a mean 76 and standard deviation of 4. On which test did Jenny perform better relative to the rest of her class?
32Describing Location in a Distribution Transforming DataDescribing Location in a DistributionTransforming converts the original observations from the original units of measurements to another scale. Transformations can affect the shape, center, and spread of a distribution.Effect of Adding (or Subracting) a ConstantAdding the same number a (either positive, zero, or negative) to each observation:adds a to measures of center and location (mean, median, quartiles, percentiles), butDoes not change the shape of the distribution or measures of spread (range, IQR, standard deviation).Example, p. 93nMeansxMinQ1MQ3MaxIQRRangeGuess(m)4416.027.14811151740632Error (m)3.02-5-22427
33Describing Location in a Distribution Transforming DataDescribing Location in a DistributionEffect of Multiplying (or Dividing) by a ConstantMultiplying (or dividing) each observation by the same number b (positive, negative, or zero):multiplies (divides) measures of center and location by bmultiplies (divides) measures of spread by |b|, butdoes not change the shape of the distributionnMeansxMinQ1MQ3MaxIQRRangeError(ft)449.9123.43-16.4-6.566.5613.1288.5619.68104.96Error (m)3.027.14-5-22427632Example, p. 95
34Describing Location in a Distribution Density CurveDefinition:A density curve is a curve thatis always on or above the horizontal axis, andhas area exactly 1 underneath it.A density curve describes the overall pattern of a distribution. The area under the curve and above any interval of values on the horizontal axis is the proportion of all observations that fall in that interval.Describing Location in a DistributionThe overall pattern of this histogram of the scores of all 947 seventh-grade students in Gary, Indiana, on the vocabulary part of the Iowa Test of Basic Skills (ITBS) can be described by a smooth curve drawn through the tops of the bars.
35Normal Distributions Normal Distributions Definition: A Normal distribution is described by a Normal density curve. Any particular Normal distribution is completely specified by two numbers: its mean µ and standard deviation σ.The mean of a Normal distribution is the center of the symmetric Normal curve.The standard deviation is the distance from the center to the change-of-curvature points on either side.We abbreviate the Normal distribution with mean µ and standard deviation σ as N(µ,σ).Normal distributions are good descriptions for some distributions of real data.Normal distributions are good approximations of the results of many kinds of chance outcomes.Many statistical inference procedures are based on Normal distributions.
36Normal Distributions The 68-95-99.7 Rule Although there are many Normal curves, they all have properties in common.Normal DistributionsDefinition: The Rule (“The Empirical Rule”)In the Normal distribution with mean µ and standard deviation σ:Approximately 68% of the observations fall within σ of µ.Approximately 95% of the observations fall within 2σ of µ.Approximately 99.7% of the observations fall within 3σ of µ.
37The distribution of Iowa Test of Basic Skills (ITBS) vocabulary scores for 7th grade students in Gary, Indiana, is close to Normal. Suppose the distribution is N(6.84, 1.55).Sketch the Normal density curve for this distribution.What percent of ITBS vocabulary scores are less than 3.74?What percent of the scores are between 5.29 and 9.94?Example, p. 113Normal Distributions
38The Standard Normal Distribution All Normal distributions are the same if we measure in units of size σ from the mean µ as center.Normal DistributionsDefinition:The standard Normal distribution is the Normal distribution with mean 0 and standard deviation 1.If a variable x has any Normal distribution N(µ,σ) with mean µ and standard deviation σ, then the standardized variablehas the standard Normal distribution, N(0,1).
39The Standard Normal Table Normal DistributionsBecause all Normal distributions are the same when we standardize, we can find areas under any Normal curve from a single table.Definition: The Standard Normal TableTable A is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z.Suppose we want to find the proportion of observations from the standard Normal distribution that are less than 0.81.We can use Table A:P(z < 0.81) =.7910Z.00.01.020.7.7580.7611.76420.8.7881.7910.79390.9.8159.8186.8212
40Normal Distribution Calculations Normal DistributionsWhen Tiger Woods hits his driver, the distance the ball travels can be described by N(304, 8). What percent of Tiger’s drives travel between 305 and 325 yards?Using Table A, we can find the area to the left of z=2.63 and the area to the left of z=0.13.– = About 44% of Tiger’s drives travel between 305 and 325 yards.
41Assessing NormalityThe Normal distributions provide good models for some distributions of real data. Many statistical inference procedures are based on the assumption that the population is approximately Normally distributed. Consequently, we need a strategy for assessing Normality.Normal DistributionsPlot the data.Make a dotplot, stemplot, or histogram and see if the graph is approximately symmetric and bell-shaped.Check whether the data follow the rule.Count how many observations fall within one, two, and three standard deviations of the mean and check to see if these percents are close to the 68%, 95%, and 99.7% targets for a Normal distribution.
42Normal Probability Plots Most software packages can construct Normal probability plots. These plots are constructed by plotting each observation in a data set against its corresponding percentile’s z-score.Normal DistributionsInterpreting Normal Probability PlotsIf the points on a Normal probability plot lie close to a straight line, the plot indicates that the data are Normal. Systematic deviations from a straight line indicate a non-Normal distribution. Outliers appear as points that are far away from the overall pattern of the plot.
44Scatterplots and Correlation Explanatory and Response VariablesMost statistical studies examine data on more than one variable. In many of these settings, the two variables play different roles.Scatterplots and CorrelationDefinition:A response variable measures an outcome of a study. An explanatory variable may help explain or influence changes in a response variable.Note: In many studies, the goal is to show that changes in one or more explanatory variables actually cause changes in a response variable. However, other explanatory-response relationships don’t involve direct causation.
45Scatterplots and Correlation Displaying Relationships: ScatterplotsThe most useful graph for displaying the relationship between two quantitative variables is a scatterplot.Scatterplots and CorrelationDefinition:A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical axis. Each individual in the data appears as a point on the graph.How to Make a ScatterplotDecide which variable should go on each axis.Remember, the eXplanatory variable goes on the X-axis!Label and scale your axes.Plot individual data values.
46Scatterplots and Correlation Interpreting ScatterplotsTo interpret a scatterplot, follow the basic strategy of data analysis from Chapters 1 and 2. Look for patterns and important departures from those patterns.Scatterplots and CorrelationHow to Examine a ScatterplotAs in any graph of data, look for the overall pattern and for striking departures from that pattern.You can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship.An important kind of departure is an outlier, an individual value that falls outside the overall pattern of the relationship.Remember DOFS
47Scatterplots and Correlation Measuring Linear Association: CorrelationA scatterplot displays the strength, direction, and form of the relationship between two quantitative variables.Linear relationships are important because a straight line is a simple pattern that is quite common. Unfortunately, our eyes are not good judges of how strong a linear relationship is.Scatterplots and CorrelationDefinition:The correlation r measures the strength of the linear relationship between two quantitative variables.r is always a number between -1 and 1r > 0 indicates a positive association.r < 0 indicates a negative association.Values of r near 0 indicate a very weak linear relationship.The strength of the linear relationship increases as r moves away from 0 towards -1 or 1.The extreme values r = -1 and r = 1 occur only in the case of a perfect linear relationship.
48Scatterplots and Correlation Facts about CorrelationHow correlation behaves is more important than the details of the formula. Here are some important facts about r.Scatterplots and CorrelationCorrelation makes no distinction between explanatory and response variables.r does not change when we change the units of measurement of x, y, or both.The correlation r itself has no unit of measurement.Cautions:Correlation requires that both variables be quantitative.Correlation does not describe curved relationships between variables, no matter how strong the relationship is.Correlation is not resistant. r is strongly affected by a few outlying observations.Correlation is not a complete summary of two-variable data.
49Least-Squares Regression Interpreting a Regression LineA regression line is a model for the data, much like density curves. The equation of a regression line gives a compact mathematical description of what this model tells us about the relationship between the response variable y and the explanatory variable x.Least-Squares RegressionDefinition:Suppose that y is a response variable (plotted on the vertical axis) and x is an explanatory variable (plotted on the horizontal axis). A regression line relating y to x has an equation of the formŷ = a + bxIn this equation,ŷ (read “y hat”) is the predicted value of the response variable y for a given value of the explanatory variable x.b is the slope, the amount by which y is predicted to change when x increases by one unit.a is the y intercept, the predicted value of y when x = 0.
50Least-Squares Regression Interpreting a Regression LineConsider the regression line from the example “Does Fidgeting Keep You Slim?” Identify the slope and y-intercept and interpret each value in context.Least-Squares RegressionThe y-intercept a = kg is the fat gain estimated by this model if NEA does not change when a person overeats.The slope b = tells us that the amount of fat gained is predicted to go down by kg for each added calorie of NEA.
51Least-Squares Regression PredictionWe can use a regression line to predict the response ŷ for a specific value of the explanatory variable x.Use the NEA and fat gain regression line to predict the fat gain for a person whose NEA increases by 400 cal when she overeats.Least-Squares RegressionWe predict a fat gain of 2.13 kg when a person with NEA = 400 calories.
52Least-Squares Regression ExtrapolationWe can use a regression line to predict the response ŷ for a specific value of the explanatory variable x. The accuracy of the prediction depends on how much the data scatter about the line.While we can substitute any value of x into the equation of the regression line, we must exercise caution in making predictions outside the observed values of x.Least-Squares RegressionDefinition:Extrapolation is the use of a regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line. Such predictions are often not accurate.Don’t make predictions using values of x that are much larger or much smaller than those that actually appear in your data.
53Least-Squares Regression ResidualsIn most cases, no line will pass exactly through all the points in a scatterplot. A good regression line makes the vertical distances of the points from the line as small as possible.Least-Squares RegressionDefinition:A residual is the difference between an observed value of the response variable and the value predicted by the regression line. That is,residual = observed y – predicted yresidual = y - ŷPositive residuals(above line)Negative residuals(below line)residual
54Least-Squares Regression Least-Squares Regression LineWe can use technology to find the equation of the least-squares regression line. We can also write it in terms of the means and standard deviations of the two variables and their correlation.Least-Squares RegressionDefinition: Equation of the least-squares regression lineWe have data on an explanatory variable x and a response variable y for n individuals. From the data, calculate the means and standard deviations of the two variables and their correlation. The least squares regression line is the line ŷ = a + bx withslopeand y intercept
55Least-Squares Regression Interpreting Residual PlotsA residual plot magnifies the deviations of the points from the line, making it easier to see unusual observations and patterns.The residual plot should show no obvious patternsThe residuals should be relatively small in size.Least-Squares RegressionPattern in residualsLinear model not appropriateDefinition:If we use a least-squares regression line to predict the values of a response variable y from an explanatory variable x, the standard deviation of the residuals (s) is given by
56Least-Squares Regression Interpreting Computer Regression OutputA number of statistical software packages produce similar regression output. Be sure you can locatethe slope b,the y intercept a,and the values of s and r2.Least-Squares Regression
57Least-Squares Regression Correlation and Regression Wisdom2. Correlation and regression lines describe only linear relationships.Least-Squares Regression3. Correlation and least-squares regression lines are not resistant.Definition:An outlier is an observation that lies outside the overall pattern of the other observations. Points that are outliers in the y direction but not the x direction of a scatterplot have large residuals. Other outliers may not have large residuals.An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line.
59Population and SampleThe distinction between population and sample is basic to statistics. To make sense of any sample result, you must know what population the sample representsSampling and SurveysDefinition:The population in a statistical study is the entire group of individuals about which we want information.A sample is the part of the population from which we actually collect information. We use information from a sample to draw conclusions about the entire population.PopulationCollect data from a representative Sample...SampleMake an Inference about the Population.
60The Idea of a Sample Survey We often draw conclusions about a whole population on the basis of a sample.Choosing a sample from a large, varied population is not that easy.Sampling and SurveysStep 1: Define the population we want to describe.Step 2: Say exactly what we want to measure.A “sample survey” is a study that uses an organized plan to choose a sample that represents some specific population.Step 3: Decide how to choose a sample from the population.
61Convenience samples often produce unrepresentative data…why? How to Sample BadlyHow can we choose a sample that we can trust to represent the population? There are a number of different methods to select samples.Sampling and SurveysDefinition:Choosing individuals who are easiest to reach results in a convenience sample.Convenience samples often produce unrepresentative data…why?Definition:The design of a statistical study shows bias if it systematically favors certain outcomes.
62How to Sample BadlyConvenience samples are almost guaranteed to show bias. So are voluntary response samples, in which people decide whether to join the sample in response to an open invitation.Sampling and SurveysDefinition:A voluntary response sample consists of people who choose themselves by responding to a general appeal. Voluntary response samples show bias because people with strong opinions (often in the same direction) are most likely to respond.
63How to Sample Well: Random Sampling The statistician’s remedy is to allow impersonal chance to choose the sample. A sample chosen by chance rules out both favoritism by the sampler and self-selection by respondents.Random sampling, the use of chance to select a sample, is the central principle of statistical sampling.Sampling and SurveysDefinition:A simple random sample (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.In practice, people use random numbers generated by a computer or calculator to choose samples. If you don’t have technology handy, you can use a table of random digits.
64How to Choose an SRS Using Table D Sampling and SurveysDefinition:A table of random digits is a long string of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 with these properties:• Each entry in the table is equally likely to be any of the 10 digits• The entries are independent of each other. That is, knowledge of one part of the table gives no information about any other part.Step 1: Label. Give each member of the population a numerical label of the same length.Step 2: Table. Read consecutive groups of digits of the appropriate length from Table D.Your sample contains the individuals whose labels you find.How to Choose an SRS Using Table D
65Other Sampling Methods The basic idea of sampling is straightforward: take an SRS from the population and use your sample results to gain information about the population. Sometimes there are statistical advantages to using more complex sampling methods.One common alternative to an SRS involves sampling important groups (called strata) within the population separately. These “sub-samples” are combined to form one stratified random sample.Sampling and SurveysDefinition:To select a stratified random sample, first classify the population into groups of similar individuals, called strata. Then choose a separate SRS in each stratum and combine these SRSs to form the full sample.
66Other Sampling Methods Although a stratified random sample can sometimes give more precise information about a population than an SRS, both sampling methods are hard to use when populations are large and spread out over a wide area.In that situation, we’d prefer a method that selects groups of individuals that are “near” one another.Sampling and SurveysDefinition:To take a cluster sample, first divide the population into smaller groups. Ideally, these clusters should mirror the characteristics of the population. Then choose an SRS of the clusters. All individuals in the chosen clusters are included in the sample.
67Observational Study versus Experiment In contrast to observational studies, experiments don’t just observe individuals or ask them questions. They actively impose some treatment in order to measure the response.ExperimentsDefinition:An observational study observes individuals and measures variables of interest but does not attempt to influence the responses.An experiment deliberately imposes some treatment on individuals to measure their responses.When our goal is to understand cause and effect, experiments are the only source of fully convincing data.The distinction between observational study and experiment is one of the most important in statistics.
68The Language of Experiments An experiment is a statistical study in which we actually do something (a treatment) to people, animals, or objects (the experimental units) to observe the response. Here is the basic vocabulary of experiments.ExperimentsDefinition:A specific condition applied to the individuals in an experiment is called a treatment. If an experiment has several explanatory variables, a treatment is a combination of specific values of these variables.The experimental units are the smallest collection of individuals to which treatments are applied. When the units are human beings, they often are called subjects.Sometimes, the explanatory variables in an experiment are called factors. Many experiments study the joint effects of several factors. In such an experiment, each treatment is formed by combining a specific value (often called a level) of each of the factors.
69How to Experiment Well: The Randomized Comparative Experiment The remedy for confounding is to perform a comparative experiment in which some units receive one treatment and similar units receive another. Most well designed experiments compare two or more treatments.Comparison alone isn’t enough, if the treatments are given to groups that differ greatly, bias will result. The solution to the problem of bias is random assignment.ExperimentsDefinition:In an experiment, random assignment means that experimental units are assigned to treatments at random, that is, using some sort of chance process.
70The Randomized Comparative Experiment ExperimentsDefinition:In a completely randomized design, the treatments are assigned to all the experimental units completely by chance.Some experiments may include a control group that receives an inactive treatment or an existing baseline treatment.Group 1Group 2Treatment 1Treatment 2Compare ResultsExperimental UnitsRandom Assignment
71Principles of Experimental Design Three Principles of Experimental DesignRandomized comparative experiments are designed to give good evidence that differences in the treatments actually cause the differences we see in the response.ExperimentsPrinciples of Experimental DesignControl for lurking variables that might affect the response: Use a comparative design and ensure that the only systematic difference between the groups is the treatment administered.Random assignment: Use impersonal chance to assign experimental units to treatments. This helps create roughly equivalent groups of experimental units by balancing the effects of lurking variables that aren’t controlled on the treatment groups.Replication: Use enough experimental units in each group so that any differences in the effects of the treatments can be distinguished from chance differences between the groups.
72Experiments: What Can Go Wrong? The logic of a randomized comparative experiment depends on our ability to treat all the subjects the same in every way except for the actual treatments being compared.Good experiments, therefore, require careful attention to details to ensure that all subjects really are treated identically.ExperimentsA response to a dummy treatment is called a placebo effect. The strength of the placebo effect is a strong argument for randomized comparative experiments.Whenever possible, experiments with human subjects should be double-blind.Definition:In a double-blind experiment, neither the subjects nor those who interact with them and measure the response variable know which treatment a subject received.
73If they are, we call them statistically significant. Inference for ExperimentsIn an experiment, researchers usually hope to see a difference in the responses so large that it is unlikely to happen just because of chance variation.We can use the laws of probability, which describe chance behavior, to learn whether the treatment effects are larger than we would expect to see if only chance were operating.If they are, we call them statistically significant.ExperimentsDefinition:An observed effect so large that it would rarely occur by chance is called statistically significant.A statistically significant association in data from a well-designed experiment does imply causation.
74Blocking Experiments Definition Completely randomized designs are the simplest statistical designs for experiments. But just as with sampling, there are times when the simplest method doesn’t yield the most precise results.ExperimentsDefinitionA block is a group of experimental units that are known before the experiment to be similar in some way that is expected to affect the response to the treatments.In a randomized block design, the random assignment of experimental units to treatments is carried out separately within each block.Form blocks based on the most important unavoidable sources of variability (lurking variables) among the experimental units.Randomization will average out the effects of the remaining lurking variables and allow an unbiased comparison of the treatments.Control what you can, block on what you can’t control, and randomize to create comparable groups.
75Matched-Pairs Design Experiments Definition A common type of randomized block design for comparing two treatments is a matched pairs design. The idea is to create blocks by matching pairs of similar experimental units.ExperimentsDefinitionA matched-pairs design is a randomized blocked experiment in which each block consists of a matching pair of similar experimental units.Chance is used to determine which unit in each pair gets each treatment.Sometimes, a “pair” in a matched-pairs design consists of a single unit that receives both treatments. Since the order of the treatments can influence the response, chance is used to determine with treatment is applied first for each unit.
76Scope of InferenceWhat type of inference can be made from a particular study? The answer depends on the design of the study.Well-designed experiments randomly assign individuals to treatment groups. However, most experiments don’t select experimental units at random from the larger population. That limits such experiments to inference about cause and effect.Observational studies don’t randomly assign individuals to groups, which rules out inference about cause and effect. Observational studies that use random sampling can make inferences about the population.Using Studies Wisely
77Data EthicsComplex issues of data ethics arise when we collect data from people. Here are some basic standards of data ethics that must be obeyed by all studies that gather data from human subjects, both observational studies and experiments.Using Studies WiselyBasic Data EthicsAll planned studies must be reviewed in advance by an institutional review board charged with protecting the safety and well-being of the subjects.All individuals who are subjects in a study must give their informed consent before data are collected.All individual data must be kept confidential. Only statistical summaries for groups of subjects may be made public.