Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.1 Different Types of.

Similar presentations


Presentation on theme: "Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.1 Different Types of."— Presentation transcript:

1

2 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.1 Different Types of Data

3 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 3 A variable is any characteristic observed in a study. Examples: Marital status, Height, Weight, IQ A variable can be classified as either  Categorical (in Categories), or  Quantitative (Numerical) Variable

4 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 4 A variable can be classified as categorical if each observation belongs to one of a set of categories: Examples:  Gender (Male or Female)  Religious Affiliation (Catholic, Jewish, …)  Type of Residence (Apartment, Condo, …)  Belief in Life After Death (Yes or No) Categorical Variable

5 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 5 A variable is called quantitative if observations on it take numerical values that represent different magnitudes of the variable. Examples:  Age  Number of Siblings  Annual Income Quantitative Variable

6 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 6 For Quantitative variables: key features are the center and spread (variability) of the data. Example: What’s a typical annual amount of precipitation? Is there much variation from year to year? For Categorical variables: a key feature is the percentage of observations in each of the categories. Example: What percentage of students at a certain college are Democrats? Main Features of Quantitative and Categorical Variables

7 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 7 A quantitative variable is discrete if its possible values form a set of separate numbers, such as 0,1,2,3,…. Discrete variables have a finite number of possible values. Examples:  Number of pets in a household  Number of children in a family  Number of foreign languages spoken by an individual Discrete Quantitative Variable

8 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 8 A quantitative variable is continuous if its possible values form an interval. Continuous variables have an infinite number of possible values. Examples:  Height/Weight  Age  Blood pressure Continuous Quantitative Variable

9 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 9 Identify the variable type as either categorical or quantitative.  Number of siblings in a family  County of residence  Distance (in miles) of commute to school  Marital status Class Problem #1

10 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 10 Identify each of the following variables as continuous or discrete.  Length of time to take a test  Number of people waiting in line  Number of speeding tickets received last year  Your dog’s weight Class Problem #2

11 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 11 The proportion of the observations that fall in a certain category is the frequency (count) of observations in that category divided by the total number of observations. Frequency of that class Sum of all frequencies The percentage is the proportion multiplied by 100. Proportions and percentages are also called relative frequencies. Proportion & Percentage (Relative Frequencies)

12 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 12 There were 268 reported shark attacks in Florida between 2000 and 2010. Table 2.1 classifies 715 shark attacks reported from 2000 through 2010, so, for Florida,  268 is the frequency.  0.375 =268/715 is the proportion and relative frequency.  37.5 is the percentage 0.375 *100= 37.5%. Frequency, Proportion, & Percentage Example

13 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 13 A frequency table is a listing of possible values for a variable, together with the number of observations and/or relative frequencies for each value. Frequency Table Table 2.1 Frequency of Shark Attacks in Various Regions for 2000–2010

14 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 14 A stock broker has been following different stocks over the last month and has recorded whether a stock is up, the same, or down in value. The results were: 1.What is the variable of interest? 2.What type of variable is it? 3.Add proportions to this frequency table. Class Problem #3

15 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.2 Graphical Summaries of Data

16 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 16 A graph or frequency table describes a distribution. A distribution tells us the possible values a variable takes as well as the occurrence of those values (frequency or relative frequency). Distribution

17 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 17 The two primary graphical displays for summarizing a categorical variable are the pie chart and the bar graph. Pie Chart: A circle having a “slice of pie” for each category. Bar Graph: A graph that displays a vertical bar for each category. Graphs for Categorical Variables

18 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 18 Pie Charts:  Used for summarizing a categorical variable.  Drawn as a circle where each category is represented as a “slice of the pie”.  The size of each pie slice is proportional to the percentage of observations falling in that category. Pie Charts

19 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 19 Example: Renewable Electricity Table 2.2 Sources of Electricity in the United States and Canada, 2009

20 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 20 Figure 2.1 Pie Chart of Electricity Sources in the United States. The label for each slice of the pie gives the category and the percentage of electricity generated from that source. The slice that represents the percentage generated by coal is 45% of the total area of the pie. Question: Why is it beneficial to label the pie wedges with the percent? Example: Renewable Electricity

21 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 21 Bar graphs are used for summarizing a categorical variable.  Bar Graphs display a vertical bar for each category.  The height of each bar represents either counts (“frequencies”) or percentages (“relative frequencies”) for that category.  It is usually easier to compare categories with a bar graph rather than with a pie chart.  Bar Graphs are called Pareto Charts when the categories are ordered by their frequency, from the tallest bar to the shortest bar. Bar Graphs

22 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 22 Figure 2.2 Bar Graph of Electricity Sources in the United States. The bars are ordered from largest to smallest based on the percentage use. (Pareto Chart). Example: Renewable Electricity

23 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 23 Dot Plot: shows a dot for each observation placed above its value on a number line. Stem-and-Leaf Plot: portrays the individual observations. Histogram: uses bars to portray the data. Graphs for Quantitative Variables

24 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 24 How do we decide which to use? Here are some guidelines: Dot-plot and stem-and-leaf plot  More useful for small data sets  Data values are retained Histogram  More useful for large data sets  Most compact display  More flexibility in defining intervals Choosing a Graph Type

25 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 25 Dot Plots are used for summarizing a quantitative variable. To construct a dot plot 1.Draw a horizontal line. 2.Label it with the name of the variable. 3.Mark regular values of the variable on it. 4.For each observation, place a dot above its value on the number line. Dot Plots

26 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 26 1.Divide the range of the data into intervals of equal width. 2.Count the number of observations in each interval, creating a frequency table. 3.On the horizontal axis, label the values or the endpoints of the intervals. 4.Draw a bar over each value or interval with height equal to its frequency (or percentage), values of which are marked on the vertical axis. 5.Label and title appropriately. Steps for Constructing a Histogram

27 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 27 Example: Sodium in Cereals Table 2.4 Frequency Table for Sodium in 20 Breakfast Cereals. The table summarizes the sodium values using eight intervals and lists the number of observations in each, as well as the proportions and percentages.

28 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 28 Figure 2.6 Histogram of Breakfast Cereal Sodium Values. The rectangular bar over an interval has height equal to the number of observations in the interval. Histogram for Sodium in Cereals

29 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 29 Overall pattern consists of center, spread, and shape.  Assess where a distribution is centered by finding the median (50% of data below median 50% of data above).  Assess the spread of a distribution.  Shape of a distribution: roughly symmetric, skewed to the right, or skewed to the left. Interpreting Histograms

30 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 30 A distribution is skewed to the left if the left tail is longer than the right tail. Shape A distribution is skewed to the right if the right tail is longer than the left tail. Symmetric Distributions: if both left and right sides of the histogram are mirror images of each other.

31 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 31 Examples of Skewness

32 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 32 An outlier falls far from the rest of the data. Outlier

33 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 33 Used for displaying a time series, a data set collected over time.  Plots each observation on the vertical scale against the time it was measured on the horizontal scale. Points are usually connected.  Common patterns in the data over time, known as trends, should be noted.  To see a trend more clearly, it is beneficial to connect the data points in their time sequence. Time Plots

34 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 34 The number of people in the United Kingdom between 2006 and 2010 who used the Internet (in millions). Adults using the Internet everyday Time Plots Example

35 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.3 Measuring the Center of Quantitative Data

36 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 36 The mean is the sum of the observations divided by the number of observations. Mean

37 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 37 Example: Center of the Cereal Sodium Data We find the mean by adding all the observations and then dividing this sum by the number of observations, which is 20: 0, 340, 70, 140, 200, 180, 210, 150, 100, 130, 140, 180, 190, 160, 290, 50, 220, 180, 200, 210 Mean = (0 + 340 + 70 +... +210)/20 = 3340/20 =167

38 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 38 The median is the midpoint of the observations when they are ordered from the smallest to the largest (or from the largest to smallest). How to Determine the Median: Put the n observations in order of their size. If the number of observations, n, is:  odd, then the median is the middle observation.  even, then the median is the average of the two middle observations. Median

39 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 39 CO 2 pollution levels in 8 largest nations measured in metric tons per person: China 4.9 Brazil 1.9 India 1.4 Pakistan 0.9 United States 18.9 Russia 10.8 Indonesia 1.8 Bangladesh 0.3 The CO 2 values have n = 8 observations. The ordered values are: 0.3, 0.9, 1.4, 1.8, 1.9, 4.9, 10.8, 18.9 Example: CO 2 Pollution

40 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 40 Since n is even, two observations are in the middle, the fourth and fifth ones in the ordered sample. These are 1.8 and 1.9. The median is their average, 1.85. The relatively high value of 18.9 falls well above the rest of the data. It is an outlier. The size of the outlier affects the calculation of the mean but not the median. Example: CO 2 Pollution

41 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 41 The shape of a distribution influences whether the mean is larger or smaller than the median.  Perfectly symmetric, the mean equals the median.  Skewed to the right, the mean is larger than the median.  Skewed to the left, the mean is smaller than the median. Comparing the Mean and Median

42 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 42 In a skewed distribution, the mean is farther out in the long tail than is the median.  For skewed distributions the median is preferred because it is better representative of a typical observation. Figure 2.9 Relationship Between the Mean and Median. Question: For skewed distributions, what causes the mean and median to differ? Comparing the Mean and Median

43 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 43 A numerical summary measure is resistant if extreme observations (outliers) have little, if any, influence on its value.  The Median is resistant to outliers.  The Mean is not resistant to outliers. Resistant Measures

44 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 44 Mode  Value that occurs most often.  Highest bar in the histogram.  The mode is most often used with categorical data. The Mode

45 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.4 Measuring the Variability of Quantitative Data

46 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 46 One way to measure the spread is to calculate the range. The range is the difference between the largest and smallest values in the data set: Range = max  min The range is simple to compute and easy to understand, but it uses only the extreme values and ignores the other values. Therefore, it’s affected severely by outliers. Range

47 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 47 The deviation of an observation from the mean is, the difference between the observation and the sample mean.  Each data value has an associated deviation from the mean.  A deviation is positive if the value falls above the mean and negative if the value falls below the mean.  The sum of the deviations for all the values in a data set is always zero. Standard Deviation

48 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 48 For the cereal sodium values, the mean is = 167. The observation of 210 for Honeycomb has a deviation of 210 - 167 = 43. The observation of 50 for Honey Smacks has a deviation of 50 - 167 = -117. Figure 2.11 shows these deviations Figure 2.9 Dot Plot for Cereal Sodium Data, Showing Deviations for Two Observations. Question: When is a deviation positive and when is it negative? Standard Deviation

49 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 49 Gives a measure of variation by summarizing the deviations of each observation from the mean and calculating an adjusted average of these deviations. The Standard Deviation s of n Observations

50 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 50 1.Find the mean. 2.Find the deviation of each value from the mean. 3.Square the deviations. 4.Sum the squared deviations. 5.Divide the sum by n-1 and take the square root of that value. Standard Deviation

51 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 51 Metabolic rates of 7 men (cal./24hr) : Standard Deviation

52 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 52 Standard Deviation

53 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 53 The most basic property of the standard deviation is this: The larger the standard deviation, the greater the variability of the data.  measures the spread of the data.  only when all observations have the same value, otherwise. As the spread of the data increases, gets larger.  has the same units of measurement as the original observations. The variance = has units that are squared.  is not resistant. Strong skewness or a few outliers can greatly increase. Properties of the Standard Deviation

54 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 54 If a distribution of data is bell shaped, then approximately:  68% of the observations fall within 1 standard deviation of the mean, that is, between the values of and (denoted ).  95% of the observations fall within 2 standard deviations of the mean.  All or nearly all observations fall within 3 standard deviations of the mean. Magnitude of s: The Empirical Rule

55 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 55 Figure 2.12 The Empirical Rule. For bell-shaped distributions, this tells us approximately how much of the data fall within 1, 2, and 3 standard deviations of the mean. Question: About what percentage would fall more than 2 standard deviations from the mean? Magnitude of s: The Empirical Rule

56 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.5 Using Measures of Position to Describe Variability

57 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 57 The th percentile is a value such that percent of the observations fall below or at that value. Percentile

58 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 58 Quartiles Figure 2.14 The Quartiles Split the Distribution Into Four Parts. 25% is below the first quartile (Q1), 25% is between the first quartile and the second quartile (the median, Q2), 25% is between the second quartile and the third quartile (Q3), and 25% is above the third quartile. Question: Why is the second quartile also the median?

59 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 59 SUMMARY: Finding Quartiles  Arrange the data in order.  Consider the median. This is the second quartile, Q2.  Consider the lower half of the observations (excluding the median itself if n is odd). The median of these observations is the first quartile, Q1.  Consider the upper half of the observations (excluding the median itself if n is odd).  Their median is the third quartile, Q3.

60 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 60 Consider the sodium values for the 20 breakfast cereals. What are the quartiles for the 20 cereal sodium values? From Table 2.3, the sodium values, in ascending order, are: The median of the 20 values is the average of the 10th and 11 th observations, 180 and 180, which is Q2 = 180 mg. The first quartile Q1 is the median of the 10 smallest observations (in the top row), which is the average of 130 and 140, Q1 = 135mg. The third quartile Q3 is the median of the 10 largest observations (in the bottom row), which is the average of 200 and 210, Q3 = 205 mg. Example: Cereal Sodium Data

61 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 61 The interquartile range is the distance between the third quartile and first quartile: IQR = Q3  Q1 IQR gives spread of middle 50% of the data The Interquartile Range (IQR) In Words: If the interquartile range of U.S. music teacher salaries equals $16,000, this means that for the middle 50% of the distribution of salaries, $16,000 is the distance between the largest and smallest salaries.

62 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 62 Examining the data for unusual observations, such as outliers, is important in any statistical analysis. Is there a formula for flagging an observation as potentially being an outlier? The 1.5 x IQR Criterion for Identifying Potential Outliers An observation is a potential outlier if it falls more than 1.5 x IQR below the first quartile or more than 1.5 x IQR above the third quartile. Detecting Potential Outliers

63 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 63 The 5-number summary is the basis of a graphical display called the box plot, and consists of  Minimum value  First Quartile  Median  Third Quartile  Maximum value 5-Number Summary of Positions

64 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 64  A box goes from Q1 to Q3.  A line is drawn inside the box at the median.  A line goes from the lower end of the box to the smallest observation that is not a potential outlier and from the upper end of the box to the largest observation that is not a potential outlier.  The potential outliers are shown separately. SUMMARY: Constructing a Box Plot

65 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 65 Figure 2.15 shows a box plot for the sodium values. Labels are also given for the five-number summary of positions. Example: Boxplot for Cereal Sodium Data Figure 2.15 Box Plot and Five-Number Summary for 20 Breakfast Cereal Sodium Values. The central box contains the middle 50% of the data. The line in the box marks the median. Whiskers extend from the box to the smallest and largest observations, which are not identified as potential outliers. Potential outliers are marked separately. Question: Why is the left whisker drawn down only to 50 rather than to 0?

66 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 66 Comparing Distributions A box plot does not portray certain features of a distribution, such as distinct mounds and possible gaps, as clearly as does a histogram. Box plots are useful for identifying potential outliers. Figure 2.16 Box Plots of Male and Female College Student Heights. The box plots use the same scale for height. Question: What are approximate values of the quartiles for the two groups?

67 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 67 The z-score also identifies position and potential outliers. The z-score for an observation is the number of standard deviations that it falls from the mean. A positive z-score indicates the observation is above the mean. A negative z- score indicates the observation is below the mean. For sample data, the z -score is calculated as: An observation from a bell-shaped distribution is a potential outlier if its z-score +3 (3 standard deviation criterion). Z-Score

68 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.6 Recognizing and Avoiding Misuses of Graphical Summaries

69 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 69  Label both axes and provide proper headings.  To better compare relative size, the vertical axis should start at 0.  Be cautious in using anything other than bars, lines, or points.  It can be difficult to portray more than one group on a single graph when the variable values differ greatly. Consider instead using separate graphs, or plotting relative sizes such as ratios or percentages. Summary: Guidelines for Constructing Effective Graphs

70 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 70 Example: Heading in Different Directions Figure 2.18 An Example of a Poor Graph. Question: What’s misleading about the way the data are presented?

71 Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 71 Figure 2.19 A Better Graph for the Data in Figure 2.18. Question: What trends do you see in the enrollments from 1996 to 2004? Example: Heading in Different Directions


Download ppt "Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.1 Different Types of."

Similar presentations


Ads by Google