# Plotting Multivariate Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

## Presentation on theme: "Plotting Multivariate Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland."— Presentation transcript:

Plotting Multivariate Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Resources Everitt, BS, and G Dunn (2001) Applied Multivariate Data Analysis, London:Arnold. Everitt, BS (2005) An R and S-PLUS® Companion to Multivariate Analysis, London:Springer

Edward Tufte’s Recommendations Show the data Induce the viewer to think about the substance of the data Avoid distorting what the data have to say Present many numbers in a small space Make large data sets coherent Encourage comparison Reveal the data at several levels of detail Serve a clear purpose Be closely integrated with the statistical and verbal descriptions of the data –Tufte, E R (2001), The Visual Display of Quantitative Information, Graphics Press.

Tufte’s Points Graphics reveal data. Graphics can be more precise and revealing than conventional statistics. Anscombe’s data –Anscombe, F J (1973) “Graphs in Statistical Analysis”, American Statistician, 27:17-21. All four data sets are described by the same linear model. IxIyIixIiyIIIxIIIyIVxIVy 108.04109.14107.4686.58 86.9588.1486.7785.76 137.58138.741312.7487.71 98.8198.7797.1188.84 118.33119.26117.8188.47 149.96148.10148.8487.04 67.2466.1366.0885.25 44.2643.1045.391912.5 1210.8 4 129.13128.1585.56 74.8277.2676.4287.91 55.6854.7455.7386.89

The Anscombe Graphics

Ways of Looking at Data Scatterplots –Demonstration “The convex hull of bivariate data” –Demonstration Chiplot –Demonstration Bivariate Boxplot –Demonstration

And More Multivariate Graphics Bivariate Densities –Demonstration Other Variables in a Scatterplot –Demonstration Scatterplot Matrix –Demonstration of pairs 3-D Plots –Demonstration Conditioning Plots –Demonstration

Demonstration Launch R Set the working directory to Statistics/RSPCMA/Data airpoll<-source("chap2airpoll.dat")\$value Review exercises on pages 19-22

Convex Hull of Bivariate Data Scatterplots are often used during the calculation of the correlation coefficient of two variables. Used to detect outliers. Convex hull trimming generates a robust estimate of the correlation coefficient. Demonstration –attach(airpoll) –cor(SO2, Mortality)

Robust Estimation of the Correlation hull<-chull(SO2, Mortality) # finds the convex hull plot(SO2, Mortality, pch=1) polygon(SO2[hull],Mortality[hull], density=15, angle=30) cor(SO2[-hull],Mortality[-hull]) The results are almost identical, which is unusual.

Chiplot A way of augmenting the scatterplot to spot dependence/independence. See Statistics/RSCMPA/functions.txt chiplot(SO2,Mortality,vlabs=c("SO2", "Mortality") For independent data, the points will be scattered in a horiszontal band centered around 0. Departure from independence here is shown by the points missing from (-0.25,0.25)

Bivariate Boxplot Two-dimensional analogue of the boxplot A pair of concentric ellipses—the inner ellipse (the “hinge”) holds half the data, and the outer ellipse (the “fence”) identifiers outliers. Regression lines of x on y and y on x are shown. –bvbox(cbind(SO2,Mortality), xlab="SO2", ylab="Mortality") Cleaned up (more robust): –bvbox(cbind(SO2,Mortality), xlab="SO2", ylab="Mortality", method="O")

Bivariate Densities The goal of examining a scatterplot is to identify clusters and outliers. Humans are not particularly good at this, so graphical aids help. Adding a bivariate density estimate is good. Histograms are too rough, though.

Demo of Bivariate Density den1<-bivden(SO2,Mortality) persp(den1\$seqx, den1\$seqy, den1\$den, xlab=“SO2”, ylab=“Mortality”, zlab=“Density”, lwd=2) plot(SO2, Mortality) contour(den1\$seqx, den1\$seqy, den1\$den, lwd=2, nlevels=20, add=T)

Adding a Third Variable to the Scatterplot The bubbleplot plot(SO2, Mortality, pch=1, lwd=2, ylim=c(700,1200), xlim=c(-5,300)) # basic scatterplot. symbols(SO2, Mortality, circles=Rainfall, inches=0.4, add=TRUE, lwd=2) # adding Rainfall to each point.

Scatterplot Matrix pairs(airpoll) To add regression lines –pairs(airpoll,panel=function(x,y) { abline(lsfit(x,y)\$coef,lwd=2) lines(lowess(x,y),lty=2,lwd=2) points(x,y)}) For 3D graphics, use cloud –cloud(Mortality~SO2+Rainfall)

Conditioning Plots coplot(Mortality~SO2|Popden) To add a local regression fit coplot(Mortality~SO2|Popden, panel=function(x,y,col,pch) panel.smooth(x,y,span=1))

Conclusions The purpose of graphics is to aid your intuition. Explore them—the appropriate graphics reflect your questions and the structure of the data. Next week: graphic presentations to avoid, because they mislead you and your audience. Look at the books by Edward Tufte in the library.

Download ppt "Plotting Multivariate Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland."

Similar presentations