Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Analysis and Statistical Software I Quarter: Spring 2003

Similar presentations


Presentation on theme: "Data Analysis and Statistical Software I Quarter: Spring 2003"— Presentation transcript:

1 Data Analysis and Statistical Software I Quarter: Spring 2003
Daniela Stan Raicu School of CTI, DePaul University 1/18/2019 Daniela Stan - CSC323

2 Outline Describing distributions with numbers (continuation from the previous lecture) The 1.5 X IQR criterion for suspected outliers Measuring spread: the standard deviation Normal Distribution Standard Normal Distribution 1/18/2019 Daniela Stan - CSC323

3 Describing Distributions (cont.)
Measuring spread: the quartiles The pth percentile of a distribution is the value such that p percent of the observations fall at or below it. The 50th percentile = median, M The 25th percentile = first quartile, Q1 The 75th percentile = third quartile, Q3 1/18/2019 Daniela Stan - CSC323

4 Describing Distributions (cont.)
To calculate the quartiles: 1. Arrange the observations in increasing order and locate the median M in the list of observations. 2. The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median. 3. The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the location of the overall median. Example: 1.13 M=?, Q1=?, Q3=? 1/18/2019 Daniela Stan - CSC323

5 Describing Distributions (cont.)
The Five-Number Summary of a set of observations consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from the smallest to the largest. In symbols, the five number summary is Minimum Q1 M Q3 Maximum A boxplot is a graph of the five-number summary: A central box spans the quartiles Q1 and Q3 A line in the box marks the median M Lines extend from the box out to the smallest and largest observations 1/18/2019 Daniela Stan - CSC323

6 Weight Data: Sorted 1/18/2019 Daniela Stan - CSC323

7 Weight Data: Quartiles
11 009 14 08 16 555 19 245 20 3 21 025 22 0 23 24 25 26 0 Weight Data: Quartiles first quartile median or second quartile third quartile Q1= 127.5 Q2= (Median) Q3= 185 1/18/2019 Daniela Stan - CSC323

8 range = max  min = 160 Five-Number Summary minimum = 100
first quartile = 127.5 second quartile = 165 third quartile = 185 maximum = 260 interquartile range = Q3  Q1 = 57.5 range = max  min = 160 1/18/2019 Daniela Stan - CSC323

9 Five-Number Summary: Boxplot
Q1 M Q3 min max Weight 1/18/2019 Daniela Stan - CSC323

10 Recommended Problems Chapter 1: Section 1.1
IPS web site: 1/18/2019 Daniela Stan - CSC323

11 The 1.5 X IQR criterion The interquartile range IQR: is the distance between the first and third quartiles: IQR=Q3 – Q1 The 1.5 X IQR criterion for outliers: An observation is a suspect outlier if it falls more than 1.5 X IQR above the third quartile or below the first quartile. Modified boxplot: - the lines extend out from the central box only to the smallest and largest observations that are not suspected outliers. - the suspected outliers are plotted as individual points. 1/18/2019 Daniela Stan - CSC323

12 The 1.5 X IQR criterion (cont.)
Examples 1.9/page 14 & 1.17/page 46 1/18/2019 Daniela Stan - CSC323

13 The 1.5 X IQR criterion (cont.)
Shape? skewed to the right with a single peak at the left Outliers? The one state that stands out is New Mexico with 38.7% Histogram of the percent of Hispanics in the adult population 1/18/2019 Daniela Stan - CSC323

14 The 1.5 X IQR criterion (cont.)
The five number summary is: 0.6 2.0 4.1 38.7 7.0 Minimum M Q1 Maximum Q3 The 1.5 X IQR criterion for outliers: IQR=Q3 – Q1= X IQR=7.5 Suspected outlier: any value below Q1-1.5 X IQR or above Q3+1.5 X IQR Q1-1.5 X IQR= = -5.5 Q3+1.5 X IQR= =14.5 There are 7 suspected outliers 1/18/2019 Daniela Stan - CSC323

15 The 1.5 X IQR criterion (cont.)
Modified boxplot: The points represent the suspected outliers. 1/18/2019 Daniela Stan - CSC323

16 Measuring the spread: Variance and Standard Deviation
If all values are the same, what is the variation in the data? Variation exists when some values are above or below the mean. Each data value has an associated deviation from the mean 1/18/2019 Daniela Stan - CSC323

17 Deviations and Variance
A deviation: what is a typical deviation from the mean? small values of this typical deviation indicate small variation in the data; large values of this typical deviation indicate large variation in the data Variance: Find the mean Find the deviation of each value from the mean Square the deviations Sum the squared deviations Divide the sum by n-1 1/18/2019 Daniela Stan - CSC323

18 Measuring Spread: The standard deviation
The variance s2 of a set of observations x1, x2,…, xn is the average of the squares of the observations from their mean: or, in more compact notation 1/18/2019 Daniela Stan - CSC323

19 Measuring Spread: The standard deviation
The standard deviation s is the square root of the variance s2: The number n-1 is called degree of freedom of the variance or standard deviation. When standard deviation s is equal to zero? Is standard deviation s a resistant measure ? 1/18/2019 Daniela Stan - CSC323

20 The standard deviation (cont.)
Example: Problem 1.59 Choosing measures for center and spread: - if the distribution is skewed, choose five number summary - if the distribution is symmetric and free of outliers, choose the mean and the standard deviation 1/18/2019 Daniela Stan - CSC323

21 Density curves Sometimes the overall pattern of a large number of observations is so regular that we can describe it by smooth curve. The curve is the mathematical model for the distribution. A density curve is a curve that is always on or above horizontal axis and has area exactly 1 underneath it. The histogram of all 947 seventh grade students in Gary, Indiana, on the vocabulary part of the Iowa test. A symmetric density curve 1/18/2019 Daniela Stan - CSC323

22 The normal distributions
Normal curves are density curves that are: Symmetric Unimodal Bell-Shaped 1/18/2019 Daniela Stan - CSC323

23 The normal distributions (cont.)
A normal distribution is specified by: Mean  Standard Deviation  Notation: N(, ) The equation of the normal distribution ( gives the height of the normal distribution) : 1/18/2019 Daniela Stan - CSC323

24 ? 1/18/2019 Daniela Stan - CSC323

25 The normal distributions (cont.)
Example of two normal curves specified by their mean and standard deviation f(x) Can we locate the standard deviation with the eye? 1/18/2019 Daniela Stan - CSC323

26 The 68-95-99.7 rule In the normal distribution N(, ):
Approximately 68% of the observations are between -  and +  Approximately 95% of the observations are between - 2 and + 2 Approximately 99.7% of the observations are between - 3 and + 3 1/18/2019 Daniela Stan - CSC323

27 Empirical Rule for Any Normal Curve
+1* -1* 68% +2 * -2*  95% +3 * -3 * 99.7% 1/18/2019 Daniela Stan - CSC323

28 Health and Nutrition Examination Study of 1976-1980 (HANES)
Heights of adults, aged 18-24 women mean: 65.0 inches standard deviation: 2.5 inches men mean: 70.0 inches standard deviation: 2.8 inches 1/18/2019 Daniela Stan - CSC323

29 Health and Nutrition Examination Study of 1976-1980 (HANES)
Empirical Rule women 68% are between 62.5 and 67.5 inches [mean  1 std dev =  2.5] 95% are between 60.0 and 70.0 inches 99.7% are between 57.5 and 72.5 inches men 68% are between 67.2 and 72.8 inches 95% are between 64.4 and 75.6 inches 99.7% are between 61.6 and 78.4 inches 1/18/2019 Daniela Stan - CSC323

30 With the Mean and Standard Deviation of the Normal Distribution We Can Determine:
What proportion of individuals fall into any range of values Example: What proportion of men are less than 68 inches tall? At what percentile a given individual falls, if you know their values What value corresponds to a given percentile ? (height values) 1/18/2019 Daniela Stan - CSC323


Download ppt "Data Analysis and Statistical Software I Quarter: Spring 2003"

Similar presentations


Ads by Google