Presentation on theme: "Descriptive and exploratory statistics Garib Murshudov."— Presentation transcript:
Descriptive and exploratory statistics Garib Murshudov
Contents 1.Itroduction 2.Location 3.Spread 4.Various plots: plots depend on data 5.Histograms, cumulative distributions and all that
Purpose of descriptive statistics and various plots maximize insight into a data set uncover underlying structure extract important variables detect outliers and anomalies test underlying assumptions In general: build intuition about the data and problem
Descriptive statistics There are two types of the simplest numerical descriptors of a data sets: 1)Values describing location – mean, median, mode 2)Values describing spread – variance, interquartile range
Descritpive statistics Additional descriptive numerical values: Skewness – it is a measure of symmetry of the distribution. Positive skewness: right tail is fatter and negative skewness: left tail is fatter Kurtosis is a measure of tail: Positive kurtosis: heavier tailthan normal distribution and negative kurtosis:lighter tail than normal distribution Skewness and kurtosis for normal distribution are zero. Skewness = 1.18 Kurtosis = 2.33 Skewness = -1.18 Kurtosis = 2.33
Location The simplest information about a data set is about its location. There are three different location parameters: average, median and mode: 1)Average = sum(data)/Ndata 2)Median: proportion of data more than median is equal to that of less than median 3)Mode: the most occurring data point. Example: mean = 9.91 median = 9.92 mode = 9.98
Location Average is very sensitive to few outliers. If we change one value of data arbitrarily then we can affect average value substantially. However median is not affected very much Example: 13.2 8.2 10.9 14.3 10.7 6.6 9.5 10.8 8.8 13.3 - Av = 10.63, median = 10.75 13.2 8.2 10.9 74.3 10.7 6.6 9.5 10.8 8.8 13.3 - Av = 16.63, median = 10.75 Breakdown point of average is 0, breakdown point of median is 0.5. I.e. you have to change 50% of the data dramatically to affect the median. Median is the most robust estimator Average is the most convenient estimator with nice properties If sample is small then mode it may be impossible to estimate mode. Wikidictionary: The number or proportion of arbitrarily large or small extreme values that must be introduced into a batch or sample to cause the estimator to yield an arbitrarily large result.
Simpson’s paradox: batting averages RunsOutsAverage 1 st Ashes MW270645 SW5001050 2 nd Ashes MW7001070 SW320480 Total MW9701660.25 SW8201458.57 MW – Mark Waugh SW – Steve Waugh One should be careful in dealing with averages. The most famous paradox related to averages is Simpson’s paradox.
Spread There are two main indicators of spread of a data set 1)Standard deviation (=(var) 1/2 ). It is a usual indicator of spread. Very easy to calculate. But it is not robust to outliers. One outlier is sufficient to corrupt standard deviation 2)Interquartile range - IQR: 50% of the data are within first and third quartile of the data. This indicator is more robust. You need to corrupt 25% of the data to corrupt IQR Black vertical lines – quartiles Blue vertical lines - mean+sd, mena-sd
Spread: robustness Average is very sensitive to few outliers. If we change one value of data arbitrarily then we can affect average value substantially. However median is not affected very much Example: 13.2 8.2 10.9 14.3 10.7 6.6 9.5 10.8 8.8 13.3 - sd = 2.45, IQR = 3.65 13.2 8.2 10.9 74.3 10.7 6.6 9.5 10.8 8.8 13.3 - sd = 20.37, IQR = 3.65 Breakdown point of sd is 0, breakdown point of median is 0.25. I.e. you have to change at 25% of the data dramatically to affect IQR. IQR is the much more robust than sd Wikidictionary: The number or proportion of arbitrarily large or small extreme values that must be introduced into a batch or sample to cause the estimator to yield an arbitrarily large result.
Tukey’s five number summaries One of the important books on statistical data analysis is: Tukey, JW. (1977) Exploratory data analysis After this book there was explosion of exploratory data analysis. I.e. visualisation of datasets and modelling based on visual analysis. One of the suggestions in this book is five number summary of data sets. Essentially these numbers are (although in Tukey’s book different numbers are suggested): Minimum, 1 st quartile, median, 3 rd quartile, maximum. These numbers are calculated by R with the command summary. For example: A = 13.2 8.2 10.9 14.3 10.7 6.6 9.5 10.8 8.8 13.3 summary(A) Min. 1st Qu. Median Mean 3rd Qu. Max. 6.600 8.975 10.750 10.630 12.620 14.300
Various plots In general data visualisation is dependent on the type of data and the system it comes from. For some of the data sets it can be suggested to use some general plots. These include: 1)Box and whisker plot – boxplot 2)Histograms 3)Cumulative distribution plots 4)QQ plots
Boxplots Boxplots are convenient ways to visualise one dimensional data. It shows minimum maximum, first quartile, median and third quartile – visual representation of five number summary. This plot can indicate if the distribution of the data is symmetric. This plot may indicate outliers – if one of the points is too different from others – e.g. it is outside the interval (median + 2*IQR)
Side by side boxplot Boxplots can be used visual comparison of data, e.g. effects of different treatments. Effect of different insecticides
Boxplots Boxplots are just a schematic plots. Sometimes they must mask out some of the features of the data. Classic example is Lord Rayleigh’s data on measurement of densities of nitrogen derived from different sources which lead to the discovery of Argon. Rayleigh was led into the investigation by small anomalies he found in measurements of the density of nitrogen purified by different methods. Those different methods led to different quantities of nitrogen, and thus to different proportions of nitrogen and a hitherto unsuspected atmospheric gas. Argon was the first noble gas isolated. Ramsay's subsequent work isolated helium and discovered neon, krypton, and xenon by the end of the century. Ramsay and Rayleigh were awarded Nobel Prizes in 1904. Rayleigh was awarded the physics prize for argon, while Ramsay was awarded the chemistry prize for argon and the family of noble gases.
Boxplots By looking at the boxplot we do not see any peculiarity in the data. However one can notice that whiskers are very close to the edges of the box, i.e. minimum and maximum are close to first and third quartile respectively. When you see that then you should be suspicious about the data. If we do side by side plot of scatter (dot) plot and boxplot we see peculiar behaviour. There seem to be two classes. Let us use boxplot for different sources of nitrogen. There is definitely two classes. One derived from air and another from other sources.
Insectsprays revisited If we do side by side scatter and boxplot of Insectsparys data we see that there is some peculiarity for spray F. I do not know the reason but it may be interesting to investigate if you see something like that in your data.
Histograms Histograms are good way of visualisation of 1D data (there are high dimensional versions also). If there are enough data points then histograms may indicate the potential distribution, multimodality, skewness. For visually pleasing histograms number of bins to calculate histograms is important. Too many bins might be very noisy, too few bins can mask out important features. Nbin=5 Nbin=500 Nbin=50 Scott DW, Multivariate Density Estimation
Cumulative frequency (probability) plot Histograms represent density of probability distribution. To plot histograms we must divide the range of data into bins and then count the number of data points in each bin (for bin number n we need to count the number of data points obeying this: x i ≤ y < x i+1 where x i is the bin boundary and y is the observation). For each bin we may have very small number of data points and therefore their variation may be large resulting in noisy histograms. Cumulative frequency (probability) plots are another way of representing data. In this case we count the number of data points below given point (all y for which y < x i ). As we see the number of data points become larger and larger as x i approches to the maximum value of the data points. P
Cumulative frequency plots Cumulative distributions may indicate if the data points have normal distribution or heavy tail or some other peculiarities. These plots can also help to select appropriate distribution. However these plots are hard to interpret by their own One way of comparing two distribution would be plotting them on the same plot. To do this we need at least standardise the data. Even after standardisation the range of the data can be very different. Data standardisation: y = (x-mean(x))/sd(x)
QQ plots Quantile-quantile plots are useful when testing distributions assumptions. These plots could indicate if two data sets are from the same distribution, if yes then they can help to transfer linearly one of them into another one. Mathematically: let us say that X is from the distribution with cumulative distribution function (CDF) – F(x) and Y has the distributions G(y). Then by solving: G(y) = F(x) y = G -1 (F(x)) we can find relationship between y and x. As it can be seen random variables can be converted from one to another using QQ plots. For example if x is from exponential distribution – F(x) = 1 – exp(-lambda x) and y is from uniform distribution in the interval (a,b): G(y) = (y-a)/(b-a) then we need to solve: (y-a)/(b-a) = 1-exp(-lambda x) y = b – (b-a) exp(-lambda x), if we see exponential function then we may have this particular relationship.
QQ plots Example: uniform and exponential distributions Empirical Theoretical
QQ norm QQnorm is the special case of QQ plot – it is a quantile quantile plot against normal distribution. QQ norm can already indicate some properties of the data. 1)Outliers: Normal Too small value Too large value
QQ norm 2) Bimodality, skewness (note that small curviture for small and large values can be expected) Normal Bimodal Skewed to left Curviture Convex
QQ norm 3) Heavier tail Normal t distribution, df = 3 Heavier tails
QQ norm If distribution of two random variables have the same form then we may derive linear transformation of data vs another one. Qqplot: one data vs another. Slope and intercept of the line gives linear transformation needed: y = a + bx and a = 3, b=10
Conclusions Average and variance are usual measures for location and spread of the data. However they are not robust. Median and IQR are more rbust Boxplot is good way of summarising data, however it might mask out features of the data QQ plot can be used to check distribution assumptions
References Tukey, JW. Exploratory data analysis Scott DW, Multivariate Density Estimation