Download presentation

Presentation is loading. Please wait.

Published byGavin Ball Modified about 1 year ago

1
Multivariate data visualization

2
multivariate data Graphical display Data graphics visually display measured quantities by means of the combined use of points, lines, a coordinate system, numbers, symbols, words, shading and color. “Humans are good at discerning subtle patterns that are really there, but equally so at imagining them when they are altogether absent.” Carl Sagan “ There is no statistical tool that is as powerful as a well-chosen graph.” Chambers, Cleveland, Kleiner, and Tukey

3
multivariate data Graphical display Advantages of graphical methods: -In comparison with other types of presentation, well-design charts are more effective in creating interest and in appealing to the attention of the reader. -Visual relationships as portrayed by charts and graphs are more easily grasped and more easily remembered. -The use of charts and graphs saves time since the essential meaning of large measures of statistical data can be visualized at a glance. -Charts and graphs provide a comprehensive picture of a problem that makes for a more complete and better balanced understanding that could be derived from tabular or textual forms of presentation. -Charts and graphs can bring out hidden facts and relationships and can stimulate, as well as aid, analytical thinking and investigation.

4
multivariate data Graphical display Goals for graphical displays of data: To provide an overview To tell a story To suggest hypotheses To criticize a model

5
multivariate data Air pollution data Air pollution in 41 cities in the USA. R data “USairpollution” Variables: SO2: SO2 content of air in micrograms per cubic meter temp: average annual temperature in degrees Fahrenheit manu: number of manufacturing enterprises employing 20 or more workers popul: population size (1970 census) in thousands wind: average annual wind speed in miles per hour precip: average annual precipitation in inches predays: average number of days with precipitation per year

6
multivariate data Scatterplot The relational graphic - the greatest of all graphical designs. It links at list two variables encouraging the viewer to assess the possible causal relationship between the plotted variables. The standard for representing continuous bivariate data but it can be enhanced in a variety of ways to accommodate information about other variables. Build a basic scatterplot of the two variables “manu” and “popul”in USairpollution data.

7
multivariate data Scatterplot # set up the labels for the two variables: > mlab="Manufacturing enterprises with 20 or more workers" > plab="Population size (1970 census) in thousands“ #plot manu vs. popul : > plot(popul~manu, data=USairpollution,xlab=mlab,ylab=plab ) # read the date Usairpollution: >USairpollution=read.csv("E:/Multivariate_analysis/Data/USai rpollution.csv",header=T)

8
multivariate data Scatterplot

9
multivariate data Scatterplot We construct a scatter plot that shows the marginal distribution of manu and popul in two ways: 1. Marginal distribution as rug plots on each axis: >plot(popul~manu,data=USairpollution,xlab=mlab,ylab=plab) >rug(USairpollution$manu,side=1) > rug(USairpollution$popul,side=2)

10
multivariate data Scatterplot Rugs

11
multivariate data Scatterplot # set the plot margins >par(mar=c(2,2,2,2)) # set the limit of the x axis > xlim=with(USairpollution,range(manu))*1.1 # set the display of the plots > layout(matrix(c(1,3,2,0),nrow=2,byrow=TRUE), widths=c(2,1),heights=c(2,1),respect=TRUE) 2. Marginal distribution of manu as a histogram and that of popul as boxplot:

12
# plot popul vs manu >plot(popul~manu,data=USairpollution,cex.lab=0.7, xlab=mlab,ylab=plab,xlim=xlim) # add labels for the points on the scatterplot > with(USairpollution, text(manu,popul,cex=0.6, labels=abbreviate(City))) # add histogram for manu along the x axis > with(USairpollution,hist(manu,xlim=xlim)) # add a boxplot for popul along the y axis > with(USairpollution,boxplot(popul)) multivariate data Scatterplot

14
multivariate data Bivariate boxplot Based on calculating “robust” measures of location, scale, and correlation. Consists of a pair of concentric ellipse: -Hinge (include 50% of the data ) -Fence (delimitates potentially troublesome outliers) Resistant regression lines of both y on x and x on y are shown, with the intersection showing the bivariate location estimator. The acute angle between the regression lines will be small for a large absolute value of correlations and large for a small one.

15
multivariate data Bivariate boxplot # select row number for marginal cities in USairpollution >lab=c(“Houston”,"Chicago","Detroit","Cleveland","Philadel phia") >outcity=match(lab, USairpollution[,1]) # subset data for manu and popul variables > x=USairpollution[,c("manu","popul")] # draw the bivariate boxplot # need packages: “MVA” and HSAUR2” for bvbox > bvbox(x,main="USairpollution",xlab=mlab,ylab=plab) # add text for the outcity > text(x$manu[outcity],x$popul[outcity],labels=lab,cex=0.7, pos=c(2,2,4,2,2))

16
Hindge Fence Regression lines

17
multivariate data Bivariate boxplot Scatter plots should always be consulted when calculating correlation coefficients because the presence of outliers can on occasion considerably distort the value of a correlation coefficient. The observations identified as outliers may then be excluded from the calculation of the correlation coefficients. The bivariate boxplot identifies Chicago, Philadelphia, Detroit, and Cleveland as outliers in the scatterplot of manu and popul.

18
multivariate data R output # correlation between manu and popul variables > with(USairpollution,cor(manu,popul)) [1] 0.9552693 # correlation coefficient # remove the outliers >outcity=match(c("Chicago","Detroit","Cleveland","Philadelph ia"),USairpollution[,1]) # correlation between manu and popul with the outliers removed >with(USairpollution,cor(manu[-outcity],popul[-outcity])) [1] 0.7955549 # correlation coefficient

19
multivariate data The convex hull of bivariate data An alternative approach to using the scatterplot combined with the bivariate boxplot. Gives a robust estimation of correlation. The convex hull of a set of bivariate observations consists of the vertices of the smallest convex polyhedron in variable space within which or on which all data points lie. Removal of the points lying on the convex hull can eliminate isolated outliers without disturbing the general shape of the bivariate distribution.

20
multivariate data R output # scatterplot of manu and popul in USairpollution data >plot(popul~manu,data=USairpollution,cex.lab=0.7, xlab=mlab,ylab=plab,xlim=xlim) # extract the hull from manu and popul data >hull=with(USairpollution,chull(manu,popul)) [1] 9 15 41 6 2 18 16 14 7 # plot the convex hull >with(USairpollution,polygon(manu[hull],popul[hull], density=15,angle=30)) # correlation of the manu and popul with the hull removed > with(USairpollution,cor(manu[-hull],popul[-hull])) [1] 0.9225267

22
multivariate data The chi-plot Suggested by Fisher and Switzer (1985,2001). The chi-plot is designed to address the problem of independence. It transforms the measurements (x 11,..., x n1 ) and (x 12,..., x n2 ) into values (,…, ) and (, …, ), which, plotted in a scatterplot, can be used to detect deviations form independence. Under independence the values should show a non- systematic random fluctuation around zero; the values measure the distance of unit i from the “center” of the bivariate distribution.

23
multivariate data The chi-plot # scatter plot for manu and popul >with(USairpollution,plot(manu,popul, xlab=mlab,ylab=plab, cex.lab=0.9)) # need package “asbio” > with(USairpollution,chi.plot(manu,popul))

24
multivariate data The chi-plot Departure from independence is indicated in the second pot by a lack of points in the horizontal band indicated on the plot.

25
multivariate data The bubble and other glyph plots Bubble plot – three variables are displayed: two are used to form the scatter plot itself and the value of the third variable are represented by circles with radii proportional to these values and centered on the appropriate plot in the scatterplot. Make a bubble plot for temp, wind, and SO2 in “USairpollution” data. The plot seems to suggest that cities with moderate annual temperature and moderate annual wind speeds tend to suffer the greatest air pollution, but does not tell us anything about the rest of variables.

26
Bubble plot # set ylim for wind >ylim=with(USairpollution,range(wind)*c(0.95,1)) # scatterplot of wind and temp >plot(wind~temp,data=USairpollution, xlab="Average annual temperature(Fahrenheit)", ylab="Average annual wind speed (m.p.h.)", pch=10,ylim=ylim) # plot SO2 as circles >with(USairpollution,symbols(temp,wind,circles=SO2, inches=0.5,add=TRUE)) The bubble and other glyph plots

27
multivariate data Bubble plot The bubble and other glyph plots

28
multivariate data Star plot # star plot for all variables in USairpollution >stars(USairpollution,labels=as.character(USairpollution$Ci ty),cex=0.45) The bubble and other glyph plots Make a star plot for all variables in “USairpollution” as a seven-sided “star” with the lengths of each side representing each of the seven variables. The plot does not communicate much useful information about the data but it shows which units have similar shapes.

29
multivariate data Star plot The bubble and other glyph plots

30
multivariate data The scatterplot matrix A scatterplot matrix is a square, symmetric grid of bivariate scatterplots. The grid has q rows and columns, each one corresponding to a different variable. Each of the grid’s cells shows a scatterplot of two variables. Variable j is plotted against variable i in the ij cell, and the same variables appear in the cell ji, with the x- and y- axes of the scatterplots interchanged.

31
multivariate data # plot the grid pairs(USairpollution[,-1],cex.labels=NULL,font.labels=0.1, gap=0.1,pch=".",cex=0.00001) # plot the grid and the regression lines >pairs(USairpollution[,-1], pch=".",cex=0.1,gap=0,font.labels=0.001, panel=function(x,y,...){ points(x,y,...) abline(lm(y~x),col="grey") }) The scatterplot matrix

32
multivariate data The scatterplot matrix

33
multivariate data The scatterplot matrix

34
# three dimensional plot for hips, chest, waist in Measure data > library(scatterplot3d) >with(Measure,scatterplot3d(chest,waist,hips, pch=(1:2)[gender],type="h“, angle=55)) multivariate data Three-dimensional plots The three dimensional plot of chest, waist, and hips in “Measure” data suggests the presence of two separate groups of point corresponding to the males and females in the data.

35
# three dimensional plot for temp, wind, and SO2 > library(scatterplot3d) >with(USairpollution,scatterplot3d(temp,wind,SO2, type="h“, angle=55)) multivariate data Three-dimensional plots The plot of temp, wind, and SO2 from the “USairpollution” data shows that two observations with high SO2 levels stand out.

36
multivariate data Trellis graphics Trellis graphics is an approach to examining high-dimensional structure in data by means of one-, tow-, and three- dimensional graphs. The problem addressed is how observations of one or more variables depend on the observations of the other variables. Multiple conditioning that allows some type of plot to be displayed for different values of a given variable (or variables).

37
multivariate data # scatterplot of SO2 and temp for light and high winds > library(lattice) > plot(xyplot(SO2~temp|cut(wind,2),data=USairpollution)) Trellis graphics Scatterplots of SO2 and temp conditioned on values of wind divided in two equal parts.

38
multivariate data # three-dimensional plots of temp, wind, and precip conditioned on four levels of SO2. >pollution=with(USairpollution,equal.count(SO2,4)) >plot(cloud(precip~temp*wind|pollution,panel.aspect=0.9, data=USairpollution)) Trellis graphics Not very useful for small data sets.

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google