Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 15: Exploratory data analysis: graphical summaries CIS 3033.

Similar presentations


Presentation on theme: "Chapter 15: Exploratory data analysis: graphical summaries CIS 3033."— Presentation transcript:

1 Chapter 15: Exploratory data analysis: graphical summaries CIS 3033

2 15.1 Example: the Old Faithful data Statistics: the collection, analysis, and interpretation of data. The set of observations is called a dataset. Assumption: the randomness in a dataset roughly follows a probability model. From Data to Model (the reverse of simulation)simulation It is often necessary to condense the data for easy visual comprehension of general characteristics.

3 15.1 Example: the Old Faithful data

4 The durations (in seconds) of 272 eruptions of the Old Faithful geyser is collected. The variety in the lengths of the eruptions indicates that randomness is involved, but what can be said about the distribution? The mean of the data is 209.3. Putting the elements in order shows that they are all in [96, 306], with 240 as median.Such numerical summaries are covered in detail in the next chapter.

5 15.2 Histograms Graphical summary: group similar data and show their distribution visually.

6 15.2 Histograms A version of histogram: the total area under the curve is equal to 1, so the histogram can be seen as an approximation of the density function. Steps: 1.Divide the range of the data into bins (intervals), which usually (though not necessarily) have the same width. 2.The height of the histogram on a bin is (the number of elements in the bin) / [(the number of all elements)*(the width of the bin)]

7 15.2 Histograms Let r be a reference point smaller than the minimum of the dataset, and b the bin width, then B i = (r + (i − 1)b, r + ib] for i = 1, 2,...,m We may let m = 1 + 3.3 log 10 (n) or b = 3.49sn −1/3 where s is the sample standard deviation.

8 15.3 Kernel density estimates Idea: “put a pile of sand” around each data element, so as to contribute to its neighborhood continuously.

9 15.3 Kernel density estimates The plot is constructed by choosing a kernel K and a bandwidth h. The kernel reflects the shape of the "piles of sand", whereas the bandwidth determines how wide the piles of sand will be. A kernel K typically satisfies the following conditions: (K1) K is a probability density function; (K2) K is symmetric around zero, i.e., K(u) = K(−u); (K3) K(u) = 0 for |u| > 1. Roughly, histograms can be seen as formed with uniform kernels on bins.

10 15.3 Kernel density estimates

11 1: 2:3: Three steps to construct a kernel density estimate:

12 15.3 Kernel density estimates Choice of the bandwidth: too small and too large are both bad. A good choice: h = 1.06 sn −1/5, where s is the sample standard deviation.

13 15.3 Kernel density estimates Choice of the kernel is less important, since different kernels may produce similar results. When symmetric kernel is improper, boundary kernel can be used.

14 15.4 The empirical distribution function The empirical cumulative distribution function of the data: For example, if the data is 4 3 9 1 7, then

15 15.4 The empirical distribution function

16 15.5 Scatterplot In the case of two variables x and y, the dataset consists of pairs of observations: (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ). In a scatterplot, each pair is shown as a point.

17 15.5 Scatterplot


Download ppt "Chapter 15: Exploratory data analysis: graphical summaries CIS 3033."

Similar presentations


Ads by Google