Chapter 2: Exploring Data with Graphs and Numerical Summaries

Slides:



Advertisements
Similar presentations
Very simple to create with each dot representing a data value. Best for non continuous data but can be made for and quantitative data 2004 US Womens Soccer.
Advertisements

Describing Quantitative Variables
Chapter 2 Exploring Data with Graphs and Numerical Summaries
Descriptive Measures MARE 250 Dr. Jason Turner.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.2 Graphical Summaries.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES
1 Chapter 1: Sampling and Descriptive Statistics.
Displaying & Summarizing Quantitative Data
ISE 261 PROBABILISTIC SYSTEMS. Chapter One Descriptive Statistics.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter Two Treatment of Data.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
CHAPTER 1: Picturing Distributions with Graphs
CHAPTER 2: Describing Distributions with Numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Agresti/Franklin Statistics, 1 of 63 Chapter 2 Exploring Data with Graphs and Numerical Summaries Learn …. The Different Types of Data The Use of Graphs.
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Chapter 1 Exploring Data
Let’s Review for… AP Statistics!!! Chapter 1 Review Frank Cerros Xinlei Du Claire Dubois Ryan Hoshi.
Chapter 1 – Exploring Data YMS Displaying Distributions with Graphs xii-7.
2011 Summer ERIE/REU Program Descriptive Statistics Igor Jankovic Department of Civil, Structural, and Environmental Engineering University at Buffalo,
Chapter 1: Exploring Data
1 Laugh, and the world laughs with you. Weep and you weep alone.~Shakespeare~
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Chapter 2 Describing Data.
Describing distributions with numbers
Categorical vs. Quantitative…
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Section 2-2 Frequency Distributions.
To be given to you next time: Short Project, What do students drive? AP Problems.
Agresti/Franklin Statistics, 1 of 63 Chapter 2 Exploring Data with Graphs and Numerical Summaries Learn …. The Different Types of Data The Use of Graphs.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Chapter 3 Looking at Data: Distributions Chapter Three
1 Chapter 2: Exploring Data with Graphs and Numerical Summaries Section 2.1: What Are the Types of Data?
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.3 Measuring the Center.
Copyright © 2011 Pearson Education, Inc. Describing Numerical Data Chapter 4.
+ Chapter 1: Exploring Data Section 1.2 Displaying Quantitative Data with Graphs The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE.
+ Chapter 1: Exploring Data Section 1.3 Describing Quantitative Data with Numbers The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE.
Descriptive Statistics Unit 6. Variable Any characteristic (data) recorded for the subjects of a study ex. blood pressure, nesting orientation, phytoplankton.
Class Two Before Class Two Chapter 8: 34, 36, 38, 44, 46 Chapter 9: 28, 48 Chapter 10: 32, 36 Read Chapters 1 & 2 For Class Three: Chapter 1: 24, 30, 32,
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 2 Exploring Data with Graphs and Numerical Summaries Section 2.1 Different Types of.
1.2 Displaying Quantitative Data with Graphs.  Each data value is shown as a dot above its location on the number line 1.Draw a horizontal axis (a number.
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Graphing options for Quantitative Data
Exploratory Data Analysis
Exploring Data Descriptive Data
Chapter 2: Methods for Describing Data Sets
Chapter 1: Exploring Data
Warm Up.
Laugh, and the world laughs with you. Weep and you weep alone
Descriptive Statistics
CHAPTER 1: Picturing Distributions with Graphs
Chapter 1 Data Analysis Section 1.2
DAY 3 Sections 1.2 and 1.3.
Describing Distributions of Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Honors Statistics Review Chapters 4 - 5
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Chapter 1: Exploring Data
Presentation transcript:

Chapter 2: Exploring Data with Graphs and Numerical Summaries

Introduction to descriptive Statistics we will learn: Types of data using graphs to describe data ways of summarizing the data numerically.

Section 2.1: Types of Data Definition: A variable is any characteristic that is recorded for subjects in a study. Examples: the height, weight, major, GPA…of the students

Definitions Definition: The data values that we observe for a variable are referred to as the observations. Example: GPA of the students in our class 3.0 3.4 2.9 ……

Categorical vs Quantitative Each observation can be numerical, such as number of centimeters of precipitation in a day. or category, such as "yes" or "no" for whether it rained.

Categorical vs Quantitative Definition: A variable is called categorical if each observation belongs to one of a set of categories. Examples: Gender (Male or Female), Religious affiliation (Catholic, Jewish, etc.), Place of residence (house, apartment…) belief in life after death (Yes or No)

Categorical vs Quantitative Definition: A variable is called quantitative if observations on it take numerical values that represent different magnitudes of the variable. Examples: age annual income number of years of education completed

True or False: Q: The variables represented by numbers are quantitative variables. True or False? A:

Practice: Are the following variables categorical or quantitative? choice of diet (vegetarian, nonvegetarian). number of people you have known who have been elected to a political office. the weight of a dog.

Discrete vs Continuous Definition: A quantitative variable is discrete if its possible values form a set of separate numbers such as 0, 1, 2, 3… Examples: Number of pets in a household, number of children in a family, number of foreign languages spoken

Discrete vs Continuous Definition: A quantitative variable is continuous if its possible values form an interval. Note: We only talk about discrete or continuous for quantitative variables. Examples: height, weight, age, amount of time it takes to complete an assignment.

Practice: Are these quantitative variables discrete or continuous? the number of people in line at a box office to purchase theater tickets. the length of time to run marathon. the number of people you have dated in the past month. the weight of a dog..

Section 2.2: Using graphic summaries to describe data Frequency table: Frequency table is not a graphic summary. But usually, it well summarizes the information from the data sets and helps to build a graphic summary. If we want to draw a graph by hand, not from software, we often do frequency table first.

Example: 735 shark attacks reported from 1990 through 2002 Region Counts Proportion Percentage Florida 289 Hawaii 44 California 34 Australia Brazil 55 South Africa 64 Others 205 Total 735

Questions: What is the variable? Is it categorical or quantitative? How to find the proportion and percentage?

Categorical data For Categorical variables, the key feature is the percentage in each of the categories. The two commonly used graphic tools are pie chart and bar graph.

Pie chart Pie chart is a circle having a "slice of the pie" for each category. The size of a slice corresponds to the percentage of observations in the category. Example: Shark attacks regions.

Bar chart Bar graph is a graph that displays a vertical bar for each category. The height of the bar is the percentage of observations in the category. Example: Shark attacks regions.

Pareto Chart Pareto chart is a special type of bar chart where the bars being plotted are arranged in descending order. Example: Shark attacks regions.

Questions What information do you get from the plots.

Quantitative data We need to describe the whole shape of the distribution. The two commonly used graphic tools are Stem-and-leaf Plot and histogram.

Example: A demo data There are 20 data. Their values are 0, 21, 26,13, 22, 29, 21, 14, 22, 20, 13, 17, 25, 15, 17, 7, 23, 20, 29, 18.

Stem-and-Leaf Plot In stem-and-leaf plot, each observation is represented by a stem and a leaf. Usually the stem consist of all the digits except for the final one, which is the leaf. The decimal point is 2 digit(s) to the right of the | 0 | 07 1 | 3345778 2 | 00112235699

Steps: Order the number in increasing order. Split each number to stem and plot: stem consist of all the digits except the final digit, leaf is the final digit. Sometime you need to round it. place the stem in a column, starting with the smallest, and place a vertical line to their right. On the right side of the vertical line, indicate each leaf (final digit) that has a particular stem, and list the leaves in increasing order.

Example: A demo data There are 20 data. Their values are 0, 21, 26,13, 22, 29, 21, 14, 22, 20, 13, 17, 25, 15, 17, 7, 23, 20, 29, 18.

Variation of plots Stem-and leaf plot is not unique, it depends on how you choose the stem and leaves You can split it by put the same stem twice and separate the leaves to 0~4 and 5~9. The decimal point is 2 digit(s) to the right of the | 0 | 0 0 | 7 1 | 334 1 | 5778 2 | 0011223 2 | 5699

Histogram A histogram is a graph that uses bars to portray the counts or the percentages of the outcomes for a quantitative variable. Example: Sodium in Cereals

Steps Divide the range of the data into intervals of equal width. You don't want the number of the intervals to be too many or too few. Count the number of observations in each interval, forming a frequency table. On the horizontal axis, label the values or the endpoints of the intervals. Draw a bar over each interval with height equal to its frequency (or percentage), values of which are marked on the vertical axis.

Example: Sodium in Cereals Data: 0, 210, 260,125, 220, 290, 210, 140, 220, 200, 125, 170, 250, 50, 170, 70, 230, 200, 290, 180

Variation The number of bars for the histogram is not fixed, it depends on how you choose the breakpoints You don't want the number of the bars to be too many (like this one) or too few.

Which graph to use Stem-and-leaf plot: Histogram More useful for small data sets Data values are retained Histogram More useful for large data sets Most compact display More flexibility in defining intervals

The shape of the distribution Overall pattern: Any Clusters? Outliers? Outlier: an observation that falls well above or below the overall set of data. Single mound, bimodal or multiple mound? Mode: the value that occurs most frequently. The shape of a distribution: Is it symmetric or skewed? If skewed, it is skewed to the left or skewed to the right?

Skewness A distribution is Symmetric: if the side of the distribution below a central value is a mirror image of the side above that central value. Skewed: if one side of the distribution stretches out longer than the other side. (1). skewed to the left : if the left tail is longer than the right tail (2). skewed to the right: if the right tail is longer than the left tail

Skewness

Questions: Describe the shape of distributions in sodium data based on the first histogram.

Practice: Is the following variables likely to be symmetric, or skewed? If skewed, to the left or to the right? IQ scores for the general public. Scores of students on a very easy exam in which most score very well but a few score very poorly. Assessed value of houses in a large city, say the New York City.

Section 2.3 Numerical Summaries (for quantitative data) Two key features for Quantitative Data How Can We describe the Center of Quantitative Data? How Can We describe the Spread (variability) of Quantitative Data?

2.3.1 Measure the center Mean: The sum of the observations divided by the number of observations It is just average of the data

Example: Sodium in Cereals Q: There are 20 popular cereals. Their sodium values are 0, 210, 260,125, 220, 290, 210, 140, 220, 200, 125, 170, 250, 150, 170, 70, 230, 200, 290, 180. What is the mean? A: There are totally 20 observations: n=20. Mean=(0+ 210+260+…+290)/20=185.5

Median Median is the midpoint of the observations when they are ordered from the smallest to the largest (or from the largest to the smallest) Steps for calculation: 1. Put the n observations in order from the smallest to the largest. 2. If the number of observations n is odd, the median is the middle observation in the ordered sample. 3. When the number of observation n is even, two observations from the ordered sample fall in the middle, and the median is their average.

Example: Sodium in Cereals First, order the numbers from the smallest to the largest: 0, 70, 125, 125, 140, 150, 170, 170, 180, 200, 200, 210, 210, 220, 220, 230, 250, 260, 290, 290. Sample size n is 20: even. So we find out the two numbers in the middle, which is 200 and 200. So median is the average of the two: Median=(200+200)/2=200

Compare mean and median When a distribution is close to symmetric, the median and mean are similar. For skewed distributions, the mean lies toward the direction of skew (the longer tail) relative to the median. The more highly skewed the distribution, the more the mean and median differ.

Compare mean and median (continued) Question: Who is larger under the following situation, mean or median? perfectly symmetric: skewed to the right: skewed to the left:

Example:CO2 emission CO2 Pollution levels in 8 largest nations measured in metric tons per person for eight largest (in population size) countries: China: 2.3, India: 1.1, United Sates: 19.7, Indonesia: 1.2, Brazil: 1.8, Russia: 9.8, Pakistan: 0.7, Bangladesh: 0.2. What is the mean and median?

Example: Continued There are two outliers: United states (19.7) and Russia (9.8). If we delete the outlier United states (19.7). What is the new mean and median? What did you learn from this?

Resistant Definition: A numerical summary of the observations is called resistant if extreme observations have little, if any, influence on its value.

Compare mean and median (Continued) Mean is not resistant: The mean can be highly influenced by an outlier. Median is resistant: The extreme values have little influence on the values of median. Conclusion: When distribution is highly skewed, you’d better use median instead of mean to describe the center of the data

2.3.2 Measure of Spread Range: difference between the largest and smallest observations Eg: sodium (mg) 0 210 260 125 220 290 210 140 220 200 125 170 250 150 170 70 230 200 290 180 smallest observations: 0 mg largest observations: 290mg range = 290mg -0mg =290 mg

Standard deviation Variance is an average of the squares of the deviations from their mean standard deviation is the square root of the variance Misprint in handout

Standard deviation s The greater the spread of the data, the larger is the value of s. s is always non-negative. s=0 only when all observations take the same value. s can be strongly influenced by outliers. It is not resistant.

Example: Calculate the standard deviation of the following data set by hand: -4, 2, 10, 8, -1.

Interpreting the magnitude of s The Empirical rule: If a distribution of data is bell-shaped, then approximately 68% of the observations fall within 1 standard deviation of the mean, 95% of the observations fall within 2 standard deviations of the mean All or nearly all observations fall within 3 standard deviations of the mean

Example: Height data of 117 male students at the University of Georgia mean=70.93, standard deviation =2.86 Can you find the interval where approximately 68% data fall into that interval. How about for approximately 95%, 100%?

Example: Height data of 117 male students at the University of Georgia

Example: First Midterm scores of 61 students in stat1001 Spring 2007 mean=78, standard deviation =19. Does the empirical rule still hold?

Example: First Midterm scores of 61 students in stat1001 Spring 2007

Rule of change of scale If we add the same number c to all of the original observations, how the following statistics will change? mean median standard deviation variance

Rule of change of scale If we multiply all the observations in the data list by a positive number c, how the following statistics will change? mean median standard deviation variance

Example: change of scale for temperature For the daily temperature of the first 10 days in this January, we know mean=28F, median=29F, standard deviation=2.5F. What is the mean, median, standard deviation in Celsius?

Measure of position The pth Percentile is a value such that p percent of the observations fall below it and (100-p) percent of the observations above it. Note: The 50th percentile is just the median.

quartiles Quartiles split the data into four parts with equal number of observations. To get quartiles: Arrange the data in order. Find the median. This is the second quartile, Q2. Consider the lower half of the observations. The median of these observations is the first quartile, Q1. Consider the upper half of the observations. Their median is the third quartile, Q3.

Example: Sodium in Cereals. 0, 70, 125, 125, 140, 150, 170, 170, 180, 200, 200, 210, 210, 220, 220, 230, 250, 260, 290, 290 Find out the quartiles Q2 Q1 Q3

interquartile range The interquartile range is the distance between the third quartile and first quartile: IQR=Q3-Q1 Example: Sodium in Cereals.

1.5 IQR Criterion for Detecting Potential Outliers: An observation is a potential outlier if it falls more than 1.5 x IQR below the first quartile or more than 1.5 x IQR above the third quartile

Example: Sodium in Cereals. 0, 70, 125, 125, 140, 150, 170, 170, 180, 200, 200, 210, 210, 220, 220, 230, 250, 260, 290, 290 Find out the potential outliers.

five number summary The five number summary of a dataset is the minimum value, first quartile, median, third quartile, the maximum value. Q: What is the five number summary for the example of Sodium in Cereals?

Box plot Box plot is based on the five-number summary. Steps to construct: A box is constructed from Q1 to Q3. A line is drawn inside the box at the median. A line extends outward from the lower end of the box to the smallest observation that is not the potential outlier. A line extends outward from the upper end of the box to the largest observation that is not the potential outlier. The potential outliers are shown separately by points.

Example: Sodium in Cereals Draw the box plot.

Side-by-side box plots used to compare the distribution different groups. Example: height data of female and male

Compare range, standard deviation and five number summaries Standard deviation s, IQR and five number summaries usually are usually preferred over the range. When the distribution is highly skewed, usually, IQR and five number summaries are preferred to s, because IQR and five number summaries are resistant. When the distribution is close to symmetric, usually, standard deviation s is preferred over IQR, because by the empirical rule, we could get a good sense about the spread.

Z-Score The z-score for an observation measures how far an observation is from the mean in standard deviation units An observation in a bell-shaped distribution is a potential outlier if its z-score < -3 or > +3

Homework problems: Homework is due at lab on Jun 22 ,Tue. Read: Chapter 2. Homework problems: 2.3, 2.80, 2.82, 2.86, 2.90, 2.96, 2.104, 2.111, 2.128 (ignore the “dot plot” in the questions, since we didn’t cover it in class)