Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 16: Exploratory data analysis: numerical summaries

Similar presentations


Presentation on theme: "Chapter 16: Exploratory data analysis: numerical summaries"— Presentation transcript:

1 Chapter 16: Exploratory data analysis: numerical summaries
CIS 3033

2 16.1 The center of a dataset Sample mean:
Sample median: Medn is the middle element (if n is odd), or the average of the two middle elements (if n is even), after the elements are sorted. The sample mean and sample median of a dataset correspond to the expectation and median of a probability distribution, respectively.  Sample mean is more sensitive to outliers than sample median.

3 16.2 The amount of variability
Sample variance: Sample standard deviation: sn , the square-root of sample variance. When the center of a dataset is represented by its sample median, the variability is often indicated by median of absolute deviations, or MAD,  MAD less sensitive to outliers than sn.

4 16.3 Empirical quantiles, quartiles, IQR
The pth empirical quantile, qn(p), or the 100p empirical percentile, is the number that is greater than (or equal to) a proportion p of the elements in the dataset, and less than (or equal to) a proportion 1−p of the elements. A quantile is not necessarily in the dataset. The order statistics of a dataset x1, x2, , xn consist of the same elements as in the original dataset, but ordered as x(1) ≤ x(2) ≤ · · · ≤ x(n).

5 16.3 Empirical quantiles, quartiles, IQR
The ith order statistic is greater or equal to i elements in the dataset, and less or equal to n-i+1 elements, so it is the i/(n+1) quantile (the element itself is counted in both sides). Example: 2, 5, 7, 8             When p = 0.3, for this dataset                 = 2 + (0.3*5 - 1) * (5 - 2) = 3.5

6 16.3 Empirical quantiles, quartiles, IQR
The qn(0.25) is called the lower quartile and qn(0.75) is called the upper quartile. The distance between the upper and lower quartiles is called the interquartile range, or IQR. Five-number summary of a dataset: minimum, lower (or first) quartile, median, upper (or third) quartile, maximum.

7 16.3 Empirical quantiles, quartiles, IQR

8 16.4 The box-and-whisker plot
Box-and-whisker plot, briefly boxplot: visualizing the five-number summary. The horizontal width of the box is irrelevant. The height of the box is precisely the IQR. The horizontal line inside the box corresponds to the sample median. Two whiskers are the min/max data elements within 1.5 IQR from the box. All observations beyond the whiskers are called outliers.

9 16.4 The box-and-whisker plot

10 16.4 The box-and-whisker plot
Boxplots become useful if we want to compare several sets of data in a simple graphical display.


Download ppt "Chapter 16: Exploratory data analysis: numerical summaries"

Similar presentations


Ads by Google