# CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding.

## Presentation on theme: "CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding."— Presentation transcript:

CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding Why and How Instructor: Dr. Longin Jan Latecki

 The set of observations is called a dataset.  By exploring the dataset we can gain insight into what probability model suits the phenomenon.  To graphically represent univariate datasets, consisting of repeated measurements of one particular quantity, we discuss the classical histogram, the more recently introduced kernel density estimates and the empirical distribution function.  To represent a bivariate dataset, which consists of repeated measurements of two quantities, we use the scatterplot. 2 Chapter 15 Exploratory data analysis: graphical summaries

15.2 Histograms: The term histogram appears to have been used first by Karl Pearson. 3

Histogram construction and pdf 4 Denote a generic (univariate) dataset of size n by First we divide the range of the data into intervals. These intervals are called bins and denoted by The length of an interval B i is denoted by ǀ B i ǀ and is called the bin width. We want the area under the histogram on each bin B i to reflect the number of elements in B i. Since the total area 1 under the histogram then corresponds to the total number of elements n in the dataset, the area under the histogram on a bin B i is equal to the proportion of elements in B i : The height of the histogram on bin B i must be equal to As we know from Ch. 13.4, the histogram approximates the pdf f, in particular, for a bin centered at point a, B a =(a-h, a+h], we have

5 The function g in blue is a mixture of two Gaussians. We draw 200 samples from it, which are shown as blue dots. We use the samples to generate the histogram (yellow) and its kernel density estimate f (red). The Matlab script is twoGaussKernelDensity1.mtwoGaussKernelDensity1.m In Matlab: binwidth=0.5; bincenters=[0.5:binwidth:9.5]; hx=hist(x,bincenters)/(200*binwidth);

Choice of the bin width 6 Consider a histogram with bins of equal width. In that case the bins are of the from where r is some reference point smaller than the minimum of the dataset and b denotes the bin width. Mathematical research, however, has provided some guide- line for a data-based choice for b or m, where s is the sample std:

15.3 Kernel density estimates 7

A kernel K is a function K:R  R and a kernel K typically satisfies the following conditions. 8

Examples of Kernel Construction 9

Scaling the kernel K 10 Scale the kernel K into the function Then put a scaled kernel around each element xi in the dataset

11 The bandwid th is too small The bandwidth is too big

12 The function g in blue is a mixture of two Gaussians. We draw 200 samples from it, which are shown as blue dots. We use the samples to generate the histogram (yellow) and its kernel density estimate f (red). The Matlab script is twoGaussKernelDensity1.mtwoGaussKernelDensity1.m

15.4 The empirical distribution function 13 Another way to graphically represent a dataset is to plot the data in a cumulative manner. This can be done by using the empirical cumulative distribution function.

Empirical distribution function Continued 14

Example 15.6. Given is the following information about a histogram, compute the value of the empirical distribution function at point t = 7: By: Wanwisa Smith 15 Because (2 - 0) * 0.245 + (4 - 2) * 0.130 + (7 - 4) * 0.050 + (11 - 7) * 0.020 + (15 - 11) * 0.005 = 1, there are no data points outside the listed bins. Hence

Relation between histogram and empirical cdf 15.11. Given is a histogram and the empirical distribution function F n of the same dataset. Show that the height of the histogram on a bin (a, b] is equal to By: Wanwisa Smith 16 The height of the histogram on a bin B i = (a, b] is Hence

15.5 Scatterplot 17 In some situation we might wants to investigate the relationship between two or more variable. In the case of two variables x and y, the dataset consists of pairs of observations: We call such a dataset a bivariate dataset in contrast to the univariate. The plot the points (X i, Y i ) for i = 1, 2, …,n is called a scatterplot.

Download ppt "CIS 2033 1 Based on text book: F.M. Dekking, C. Kraaikamp, H.P.Lopulaa, L.E.Meester. A Modern Introduction to Probability and Statistics Understanding."

Similar presentations