Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topic 5: Exploring Quantitative data

Similar presentations


Presentation on theme: "Topic 5: Exploring Quantitative data"— Presentation transcript:

1 Topic 5: Exploring Quantitative data

2 Dot plot, mean, and standard deviation

3 Data matrix for s Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 s that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65 Variable Description Spam Specifies whether the is spam Num_char Number of characters in Line_breaks Number of line breaks in Format Specifies whether was in html or text format Number Indicates if contained no number, a small number (under 1,000,000), or a big number

4 Data matrix for emails Quantitative variables
Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 s that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65 Quantitative variables

5 Sample of data Let’s consider a random sample of 50 s from the data. Here, rows 1, 2, 3, and 50 of a data matrix are displayed below. It contains data from the randomly selected 50 s. Spam Num_char Line_breaks Format Number 1 no 2454 61 text small 2 41623 1088 html 3 57 5 . 50 15829 242

6 Dot plot A dot plot provides a case-by-case view of data for one quantitative variable. Num_char 1 2454 2 41623 3 57 . 50 15829

7 Dot plot A dot plot provides a case-by-case view of data for one quantitative variable. Num_char 1 2454 2 41623 3 57 . 50 15829

8 Dot plot and the mean The “placement” of data, as seen in a dot plot or some other representation, is called the distribution of the data. The mean (also called the average) is a common way to measure the center of the distribution. Mean for data below is

9 The mean The sample mean, denoted by , can be calculated as
where represent the observed values.

10 Population mean and estimation
The population mean is also computed the same way, but denoted by μ (the Greek letter mu). It is often not possible to compute μ because data on the entire population is not available. The sample mean is a sample statistic, and serves as a point estimate of the population mean. This estimate is probably not perfect, but if the sample is representative of the population, it is usually a good estimate.

11 Distributions with the same mean
Each dot plot displays 124 observations and the distributions all have a mean of 6. What makes them different?

12 Distributions with the same mean
Order these distributions from the least spread out to the most spread out. A. B. C.

13 Standard Deviation The standard deviation is the typical distance of an observation from the mean. The mean of the distribution is = 6 and sample size is n = 124. The standard deviation is computed as follows:

14 Standard deviation measures spread
A. Std. dev. = 1.361 The standards deviations of the three distributions are given. B. Std. dev. = 2.550 C. Std. dev. = 1.482

15 The standard deviation
The standard deviation of a sample is denoted by s and can be calculated using the formula given on the previous slide. The standard deviation of the population is computed in a similar way, except we divide by n instead of n-1. The standard deviation of the population is denoted by σ (the Greek letter sigma).

16 Histograms and the shape of a distribution

17 Histogram A histogram plots binned counts as bars. Characters
(in thousands) Count 0-5 19 5-10 12 10-15 6 15-20 3 20-25 25-30 5 30-35 35-40 40-45 2

18 Histograms A histogram is another way to display the distribution of a quantitative variable. Better than a stem-and-leaf plot for larger data sets, but doesn’t retain the actual numerical values. Basic Steps for Creating a Histogram Divide the range of the data (smallest to largest) into classes of equal width. The classes should not overlap. Count the number of observations that fall into each class. Recall that the counts are also called frequencies. Draw a horizontal axis and mark off the classes along this axis. The vertical axis can be the count, the proportion, or the percentage. Draw a rectangle (a vertical bar) above each class with the height equal to the count, the proportion, or the percentage.

19 Bin width: height of MAT 117 students
Bin width can alter the story we get from the histogram. ½ in. bins 1 in. bins 6 in. bins 33 in. bins

20 Shape of a Distribution: Modality
Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)? Note: To determine modality, step back and imagine a smooth curve over the histogram – imagine the bars are wooden blocks and you drop a limp spaghetti noodle over them, the shape the spaghetti would take could be viewed as a smooth curve.

21 Modality: height of MAT 117 students
Which bin width most accurately presents the modality? ½ in. bins 1 in. bins 6 in. bins 33 in. bins

22 Shape of a Distribution: Skewness
Is the histogram right skewed, left skewed, or symmetric? Note: Histograms are said to be skewed to the side of the long tail.

23 Shape of a Distribution: Unusual Observations
Are there any unusual observations or potential outliers

24 Sample of data How would you describe the shape of the distribution of the number of characters contained in the s?

25 Sample of data How would you describe the shape of the distribution of the number of characters contained in the s? Unimodal and right skewed, with a potentially unusual observation at 40,000 characters

26 Box plot and the five number summary

27 Percentiles, quartiles, and the median
The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 0-th percentile Minimum 50-th percentile Median 100-th percentile Maximum

28 Percentiles, quartiles, and the median
The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 0-th percentile Minimum 50-th percentile Median 100-th percentile Maximum 25-th percentile First quartile Q1 75-th percentile Third quartile Q3

29 Percentiles, quartiles, and the median
The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 0-th percentile Minimum 50-th percentile Median Second quartile Q2 100-th percentile Maximum 25-th percentile First quartile Q1 75-th percentile Third quartile Q3

30 Percentiles, quartiles, and the median
The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 0-th percentile Minimum 50-th percentile Median Second quartile Q2 100-th percentile Maximum 25-th percentile First quartile Q1 75-th percentile Third quartile Q3

31 Height of female MAT 117 students

32 Height of female MAT 117 students

33 Height of female MAT 117 students
Median Q1 Q3 Max. Min. We want to graphically represent these five numbers, called the five-number summary. This graph is called a box plot. As you can see, there is a bit more to it than just these five numbers.

34 Box plot: height of female MAT 117 students

35 Anatomy of the box plot Median Lower whisker Upper whisker
Potential outliers Q1 Q3 Potential outlier

36 IQR, whisker, and outliers
Between Q1 and Q3 is the middle 50% of the data. The range these data span is called the interquartile range (IQR). IQR = Q3 – Q1 Whiskers of a box plot can extend up to 1.5 x IQR away from the the quartiles: Max upper whisker reach = Q x IQR Max lower whisker reach = Q1 – 1.5 x IQR A potential outlier is an observation beyond the maximum reach of the whiskers. It is an observation that appears to be extreme relative to the rest of the data.

37 Outliers Why is it important to look for outliers?
Identify extreme skew in the distribution. Identify data collection and entry errors. Provide insight into interesting features of the data.

38 Resistant statistics

39 Extreme Observations: 2006 US household income
Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267

40 Extreme Observations: 2006 US household income
Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267

41 Extreme Observations: 2006 US household income
Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267 700K to 1,400K 72,760 122,603

42 Extreme Observations: 2006 US household income
Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267 700K to 1,400K 72,760 122,603

43 Extreme Observations: 2006 US household income
Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267 700K to 1,400K 72,760 122,603

44 Extreme Observations: 2006 US household income
Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267 700K to 1,400K 72,760 122,603 3K to 1,400K 76,300 130,564 40,262 71,888

45 Quantitative data pairs: scatterplots

46 Scatterplot A scatterplot provides a case-by-case view of data for two quantitative variables. Num_char Line_breaks 1 2454 61 2 41623 1088 3 57 5 . 50 15829 242

47 Scatterplot A scatterplot provides a case-by-case view of data for two quantitative variables. Num_char Line_breaks 1 2454 61 2 41623 1088 3 57 5 . 50 15829 242

48 Scatterplots: trends Linear trend Nonlinear trend

49 Scatterplots: trends (continued)
Cluster trend No apparent trend

50 Categorical-quantitative data pairs: comparing groups

51 A categorical-quantitative data pair
Typically the categorical variable is the explanatory variable, and the quantitative variable is the response variable: Explanatory: categorical variable Response: quantitative variable We want to compare the quantitative variable (its mean, median, etc.) for the different groups formed by the categorical variable.

52 Sample of data Let’s consider a random sample of 50 s from the data. Here, rows 1, 2, 3, and 50 of a data matrix are displayed below. It contains data from the randomly selected 50 s. Spam Num_char Line_breaks Format Number 1 no 2454 61 text small 2 41623 1088 html 3 57 5 . 50 15829 242 Quantitative Categorical

53 Number of characters and the format of emails
The table below shows the mean and standard deviation for the number of characters in s formatted as text or html. Number of Characters (in thousands) Mean Standard Deviation Text 2.308 3.626 HTML 14.862 13.711

54 Comparing box plots


Download ppt "Topic 5: Exploring Quantitative data"

Similar presentations


Ads by Google