Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 2: Exploring Data with Graphs and Numerical Summaries

Similar presentations


Presentation on theme: "Chapter 2: Exploring Data with Graphs and Numerical Summaries"— Presentation transcript:

1 Chapter 2: Exploring Data with Graphs and Numerical Summaries

2 Introduction to descriptive Statistics
we will learn: Types of data using graphs to describe data ways of summarizing the data numerically.

3 Section 2.1: Types of Data Definition: A variable is any characteristic that is recorded for subjects in a study. Examples: the height, weight, major, GPA…of the students

4 Definitions Definition: The data values that we observe for a variable are referred to as the observations. Example: GPA of the students in our class 3.0 3.4 2.9 ……

5 Categorical vs Quantitative
Each observation can be numerical, such as number of centimeters of precipitation in a day. or category, such as "yes" or "no" for whether it rained.

6 Categorical vs Quantitative
Definition: A variable is called categorical if each observation belongs to one of a set of categories. Examples: Gender (Male or Female), Religious affiliation (Catholic, Jewish, etc.), Place of residence (house, apartment…) belief in life after death (Yes or No)

7 Categorical vs Quantitative
Definition: A variable is called quantitative if observations on it take numerical values that represent different magnitudes of the variable. Examples: age annual income number of years of education completed

8 True or False: Q: The variables represented by numbers are quantitative variables. True or False? A:

9 Practice: Are the following variables categorical or quantitative?
choice of diet (vegetarian, nonvegetarian). number of people you have known who have been elected to a political office. the weight of a dog.

10 Discrete vs Continuous
Definition: A quantitative variable is discrete if its possible values form a set of separate numbers such as 0, 1, 2, 3… Examples: Number of pets in a household, number of children in a family, number of foreign languages spoken

11 Discrete vs Continuous
Definition: A quantitative variable is continuous if its possible values form an interval. Note: We only talk about discrete or continuous for quantitative variables. Examples: height, weight, age, amount of time it takes to complete an assignment.

12 Practice: Are these quantitative variables discrete or continuous?
the number of people in line at a box office to purchase theater tickets. the length of time to run marathon. the number of people you have dated in the past month. the weight of a dog..

13 Section 2.2: Using graphic summaries to describe data
Frequency table: Frequency table is not a graphic summary. But usually, it well summarizes the information from the data sets and helps to build a graphic summary. If we want to draw a graph by hand, not from software, we often do frequency table first.

14 Example: 735 shark attacks reported from 1990 through 2002
Region Counts Proportion Percentage Florida 289 Hawaii 44 California 34 Australia Brazil 55 South Africa 64 Others 205 Total 735

15 Questions: What is the variable? Is it categorical or quantitative?
How to find the proportion and percentage?

16 Categorical data For Categorical variables, the key feature is the percentage in each of the categories. The two commonly used graphic tools are pie chart and bar graph.

17 Pie chart Pie chart is a circle having a "slice of the pie" for each category. The size of a slice corresponds to the percentage of observations in the category. Example: Shark attacks regions.

18 Bar chart Bar graph is a graph that displays a vertical bar for each category. The height of the bar is the percentage of observations in the category. Example: Shark attacks regions.

19 Pareto Chart Pareto chart is a special type of bar chart where the bars being plotted are arranged in descending order. Example: Shark attacks regions.

20 Questions What information do you get from the plots.

21 Quantitative data We need to describe the whole shape of the distribution. The two commonly used graphic tools are Stem-and-leaf Plot and histogram.

22 Example: A demo data There are 20 data. Their values are 0, 21, 26,13, 22, 29, 21, 14, 22, 20, 13, 17, 25, 15, 17, 7, 23, 20, 29, 18.

23 Stem-and-Leaf Plot In stem-and-leaf plot,
each observation is represented by a stem and a leaf. Usually the stem consist of all the digits except for the final one, which is the leaf. The decimal point is 2 digit(s) to the right of the | 0 | 07 1 | 2 |

24 Steps: Order the number in increasing order.
Split each number to stem and plot: stem consist of all the digits except the final digit, leaf is the final digit. Sometime you need to round it. place the stem in a column, starting with the smallest, and place a vertical line to their right. On the right side of the vertical line, indicate each leaf (final digit) that has a particular stem, and list the leaves in increasing order.

25 Example: A demo data There are 20 data. Their values are 0, 21, 26,13, 22, 29, 21, 14, 22, 20, 13, 17, 25, 15, 17, 7, 23, 20, 29, 18.

26 Variation of plots Stem-and leaf plot is not unique, it depends on how you choose the stem and leaves You can split it by put the same stem twice and separate the leaves to 0~4 and 5~9. The decimal point is 2 digit(s) to the right of the | 0 | 0 0 | 7 1 | 334 1 | 5778 2 | 2 | 5699

27 Histogram A histogram is a graph that uses bars to portray the counts or the percentages of the outcomes for a quantitative variable. Example: Sodium in Cereals

28 Steps Divide the range of the data into intervals of equal width. You don't want the number of the intervals to be too many or too few. Count the number of observations in each interval, forming a frequency table. On the horizontal axis, label the values or the endpoints of the intervals. Draw a bar over each interval with height equal to its frequency (or percentage), values of which are marked on the vertical axis.

29 Example: Sodium in Cereals
Data: 0, 210, 260,125, 220, 290, 210, 140, 220, 200, 125, 170, 250, 50, 170, 70, 230, 200, 290, 180

30 Variation The number of bars for the histogram is not fixed, it depends on how you choose the breakpoints You don't want the number of the bars to be too many (like this one) or too few.

31 Which graph to use Stem-and-leaf plot: Histogram
More useful for small data sets Data values are retained Histogram More useful for large data sets Most compact display More flexibility in defining intervals

32 The shape of the distribution
Overall pattern: Any Clusters? Outliers? Outlier: an observation that falls well above or below the overall set of data. Single mound, bimodal or multiple mound? Mode: the value that occurs most frequently. The shape of a distribution: Is it symmetric or skewed? If skewed, it is skewed to the left or skewed to the right?

33 Skewness A distribution is
Symmetric: if the side of the distribution below a central value is a mirror image of the side above that central value. Skewed: if one side of the distribution stretches out longer than the other side. (1). skewed to the left : if the left tail is longer than the right tail (2). skewed to the right: if the right tail is longer than the left tail

34 Skewness

35 Questions: Describe the shape of distributions in sodium data based on the first histogram.

36 Practice: Is the following variables likely to be symmetric, or skewed? If skewed, to the left or to the right? IQ scores for the general public. Scores of students on a very easy exam in which most score very well but a few score very poorly. Assessed value of houses in a large city, say the New York City.

37 Section 2.3 Numerical Summaries (for quantitative data)
Two key features for Quantitative Data How Can We describe the Center of Quantitative Data? How Can We describe the Spread (variability) of Quantitative Data?

38 2.3.1 Measure the center Mean: The sum of the observations divided by the number of observations It is just average of the data

39 Example: Sodium in Cereals
Q: There are 20 popular cereals. Their sodium values are 0, 210, 260,125, 220, 290, 210, 140, 220, 200, 125, 170, 250, 150, 170, 70, 230, 200, 290, 180. What is the mean? A: There are totally 20 observations: n=20. Mean=( …+290)/20=185.5

40 Median Median is the midpoint of the observations when they are ordered from the smallest to the largest (or from the largest to the smallest) Steps for calculation: 1. Put the n observations in order from the smallest to the largest. 2. If the number of observations n is odd, the median is the middle observation in the ordered sample. 3. When the number of observation n is even, two observations from the ordered sample fall in the middle, and the median is their average.

41 Example: Sodium in Cereals
First, order the numbers from the smallest to the largest: 0, 70, 125, 125, 140, 150, 170, 170, 180, 200, 200, 210, 210, 220, 220, 230, 250, 260, 290, 290. Sample size n is 20: even. So we find out the two numbers in the middle, which is 200 and 200. So median is the average of the two: Median=( )/2=200

42 Compare mean and median
When a distribution is close to symmetric, the median and mean are similar. For skewed distributions, the mean lies toward the direction of skew (the longer tail) relative to the median. The more highly skewed the distribution, the more the mean and median differ.

43 Compare mean and median (continued)
Question: Who is larger under the following situation, mean or median? perfectly symmetric: skewed to the right: skewed to the left:

44 Example:CO2 emission CO2 Pollution levels in 8 largest nations measured in metric tons per person for eight largest (in population size) countries: China: 2.3, India: 1.1, United Sates: 19.7, Indonesia: 1.2, Brazil: 1.8, Russia: 9.8, Pakistan: 0.7, Bangladesh: 0.2. What is the mean and median?

45 Example: Continued There are two outliers: United states (19.7) and Russia (9.8). If we delete the outlier United states (19.7). What is the new mean and median? What did you learn from this?

46 Resistant Definition: A numerical summary of the observations is called resistant if extreme observations have little, if any, influence on its value.

47 Compare mean and median (Continued)
Mean is not resistant: The mean can be highly influenced by an outlier. Median is resistant: The extreme values have little influence on the values of median. Conclusion: When distribution is highly skewed, you’d better use median instead of mean to describe the center of the data

48 2.3.2 Measure of Spread Range: difference between the largest and smallest observations Eg: sodium (mg) smallest observations: 0 mg largest observations: 290mg range = 290mg -0mg =290 mg

49 Standard deviation Variance is an average of the squares of the deviations from their mean standard deviation is the square root of the variance Misprint in handout

50 Standard deviation s The greater the spread of the data, the larger is the value of s. s is always non-negative. s=0 only when all observations take the same value. s can be strongly influenced by outliers. It is not resistant.

51 Example: Calculate the standard deviation of the following data set by hand: -4, 2, 10, 8, -1.

52 Interpreting the magnitude of s
The Empirical rule: If a distribution of data is bell-shaped, then approximately 68% of the observations fall within 1 standard deviation of the mean, 95% of the observations fall within 2 standard deviations of the mean All or nearly all observations fall within 3 standard deviations of the mean

53 Example: Height data of 117 male students at the University of Georgia
mean=70.93, standard deviation =2.86 Can you find the interval where approximately 68% data fall into that interval. How about for approximately 95%, 100%?

54 Example: Height data of 117 male students at the University of Georgia

55 Example: First Midterm scores of 61 students in stat1001 Spring 2007
mean=78, standard deviation =19. Does the empirical rule still hold?

56 Example: First Midterm scores of 61 students in stat1001 Spring 2007

57 Rule of change of scale If we add the same number c to all of the original observations, how the following statistics will change? mean median standard deviation variance

58 Rule of change of scale If we multiply all the observations in the data list by a positive number c, how the following statistics will change? mean median standard deviation variance

59 Example: change of scale for temperature
For the daily temperature of the first 10 days in this January, we know mean=28F, median=29F, standard deviation=2.5F. What is the mean, median, standard deviation in Celsius?

60 Measure of position The pth Percentile is a value such that p percent of the observations fall below it and (100-p) percent of the observations above it. Note: The 50th percentile is just the median.

61 quartiles Quartiles split the data into four parts with equal number of observations. To get quartiles: Arrange the data in order. Find the median. This is the second quartile, Q2. Consider the lower half of the observations. The median of these observations is the first quartile, Q1. Consider the upper half of the observations. Their median is the third quartile, Q3.

62 Example: Sodium in Cereals.
0, 70, 125, 125, 140, 150, 170, 170, 180, 200, 200, 210, 210, 220, 220, 230, 250, 260, 290, 290 Find out the quartiles Q2 Q1 Q3

63 interquartile range The interquartile range is the distance between the third quartile and first quartile: IQR=Q3-Q1 Example: Sodium in Cereals.

64 1.5 IQR Criterion for Detecting Potential Outliers:
An observation is a potential outlier if it falls more than 1.5 x IQR below the first quartile or more than 1.5 x IQR above the third quartile

65 Example: Sodium in Cereals.
0, 70, 125, 125, 140, 150, 170, 170, 180, 200, 200, 210, 210, 220, 220, 230, 250, 260, 290, 290 Find out the potential outliers.

66 five number summary The five number summary of a dataset is
the minimum value, first quartile, median, third quartile, the maximum value. Q: What is the five number summary for the example of Sodium in Cereals?

67 Box plot Box plot is based on the five-number summary.
Steps to construct: A box is constructed from Q1 to Q3. A line is drawn inside the box at the median. A line extends outward from the lower end of the box to the smallest observation that is not the potential outlier. A line extends outward from the upper end of the box to the largest observation that is not the potential outlier. The potential outliers are shown separately by points.

68 Example: Sodium in Cereals
Draw the box plot.

69 Side-by-side box plots
used to compare the distribution different groups. Example: height data of female and male

70 Compare range, standard deviation and five number summaries
Standard deviation s, IQR and five number summaries usually are usually preferred over the range. When the distribution is highly skewed, usually, IQR and five number summaries are preferred to s, because IQR and five number summaries are resistant. When the distribution is close to symmetric, usually, standard deviation s is preferred over IQR, because by the empirical rule, we could get a good sense about the spread.

71 Z-Score The z-score for an observation measures how far an observation is from the mean in standard deviation units An observation in a bell-shaped distribution is a potential outlier if its z-score < -3 or > +3

72 Homework problems: Homework is due at lab on Jun 22 ,Tue.
Read: Chapter 2. Homework problems: 2.3, 2.80, 2.82, 2.86, 2.90, 2.96, 2.104, 2.111, 2.128 (ignore the “dot plot” in the questions, since we didn’t cover it in class)


Download ppt "Chapter 2: Exploring Data with Graphs and Numerical Summaries"

Similar presentations


Ads by Google