Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant.

Univariate EDA (Exploratory Data Analysis)

EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant measures/displays –little influenced by changes in a small proportion of the total number of cases –resistant to the effects of outliers –emphasizes smooth over rough components concepts apply to statistics and to graphical methods

Tree Ring dates (AD) 1255 1239 1162 1239 1240 1243 1241 1241 1271 9 dendrochronology dates what do they mean???? usually helps to sort the data…

Stem-and-Leaf Diagram 1162 1239 1239 1240 1241 1241 1243 1255 1271 11|62 12|39,39,40,41,41,43,55,71 original values preserved no rounding, no loss of information…

can simplify in various ways… 11|6 12|44444467 –‘leaves’ rounded to nearest decade –‘stem’ based on centuries

1162 1239 1239 1240 1241 1241 1243 1255 1271 116|2 117| 118| 119| 120| 121| 122| 123|99 124|0113 125|5 126| 127|1 ‘stem’ based on decades…

Stem-and-Leaf Diagram 11|6 12|44444467 –‘leaves’ rounded to nearest decade –‘stem’ based on centuries 1162 1239 1239 1240 1241 1241 1243 1255 1271

Stem-and-Leaf Diagram 11|62 12|39,39,40,41,41,43,55,71 original values preserved no rounding, no loss of information…

1162 1239 1239 1240 1241 1241 1243 1255 1271 116|2 117| 118| 119| 120| 121| 122| 123|99 124|0113 125|5 126| 127|1 ‘stem’ based on decades…

1162 1239 1239 1240 1241 1241 1243 1255 1271 116|2 117| 118| 119| 120| 121| 122| 123|99 124|0113 125|5 126| 127|1 highlights existence of gaps in the distribution of dates, groups of dates…

R stem() vu  round(runif(25, 0, 50),0); stem(vu) vn  round(rnorm(25, 25, 10),0); stem(vn) stem(vn, scale=2)

unit 1unit 2 12.616.2 11.616.4 16.313.8 13.113.2 12.111.3 26.914 9.79 11.512.5 14.815.6 13.511.2 12.412.2 13.615.5 11.7 926 25 24 23 22 21 20 19 18 17 31624 1556 8140 6511328 6411225 6511237 10 790 unit 1unit 2  Back-to-back stem-and-leaf plot rim diameter data (cm)

percentiles useful for constructing various kinds of EDA graphics don’t confuse percentile with percent or proportion Note: frequency = count relative frequency = percent or proportion

percentiles “the pth percentile of a distribution:  number such that approximately p percent of the values in the distribution are equal or less than that number…” can be calculated for numbers that actually exist in the distribution, and interpolated for numbers than don’t…

percentiles sort the data so that x 1 is the smallest value, and x n is the largest (where n=total number of cases) x i is the p i th percentile of a dataset of n members where:

p 1 = 100(1 - 0.5) / 7 = 7.1 p 2 = 100(2 - 0.5) / 7 = 21.4 p 3 = 100(3 - 0.5) / 7 = 35.7 p 4 = 100(4 - 0.5) / 7 = 50 etc… [1]

 25 ? 85 ? 50 50 th percentile: i=(7*50)/100 +.5 i=4, x i =7 25 th percentile: i=(7*25)/100 +.5 i=2.25, 3<x i <5

? if i integer, then… k = integer part of i; f = fractional part of i x int = interpolated value of x x int = (1-f)x k + fx k+1 x int = (1-.25)*3+.25*5 x int = 3.5 25 th percentile: i=(7*25)/100 +.5 i=2.25, 3<x i <5 25

use R!! test<-c(1,3,5,7,9,9,14) quantile(test,.25, type=5)

75 th 25 th 50 th percentiles: interquartile range (midspread) upper hingelower hinge inner fence “boxplot” (1.5 x midspread)

Figure 6.25: Internal diversity of neighbourhoods used to define N-clusters, measured by the 'evenness' statistic H/Hmax on the basis of counts of various A-clusters, and broken down by N-cluster and phase. [Boxes encompass the midspread; lines inside boxes indicate the median, while whiskers show the range of cases that fall within 1.5-times the midspread, above or below the limits of the box.]

Cleveland, W. S. (1985) The Elements of Graphing Data.

Histograms divide a continuous variable into intervals called ‘bins’ count the number of cases within each bin use bars to reflect counts intervals on the horizontal axis counts on the vertical axis

“bins” Histogram counts percent

useful for illustrating the shape of the distribution of a batch of numbers may be helpful for identifying modes and modal behaviour Histograms

mode mode? mode! the distribution is clearly bimodal may be multimodal…

important variables in histogram constuction: bin width bin starting point

boundaries of ‘bins’… bins: 1-2; 2-3; 3-4... George Cowgill: construct ‘bins’ of whole multiples of “minimum meaningful measurement units” (“mmmus”) where to count a value like ‘2.0’? Shennan: really means 1.0-1.9; 2.0-2.9; 3.0-3.9... is this OK?? 2.0 = 1.95> <2.05 mmmu=1.95-2.05; 2.05-2.15; 2.15-2.25; 2.25-2.35…

1.80 - 1.992.00 - 2.19 1.8 - 2.02.0 - 2.2 1.952.05 observed value 2.0… 1.75 - 1.951.95 - 2.15

smoothing histograms may want to accentuate the ‘smooth’ in a data distribution… calculate “running averages” on bin counts level of smoothing is arbitrary…

histogram / barchart variations 3d stacked dual frequency polygon kernel density methods

dual barchart

Site 1 Site 2

‘mirror’ barchart

stacked barchart

3d barchart

frequency polygon

kernel density model

controlling kernel density plots… hd <- density(XX) hh <- hist(XX, plot=F) maxD <- max(hd$y) maxH <- max(hh$density) Y <- c(0, max(c(maxD, maxH))) hist(XX, freq=F, ylim=Y) lines(density(XX))

Dot Plot [R: dotchart()]

Dot Histogram [R: stripchart()] 12345678910 VAR00003 12345678910 VAR00003 12345678910 VAR00003 method = “stack”

cooking/serviceserviceritual line plot

cooking/serviceserviceritual

20% 19% 18% 21% 22% pie chart

10 20 30 40 50 60 70 80 90 100 percent 10 20 30 40 50 60 70 80 90 100 cumulative percent Cumulative Percent Graph

10 20 30 40 50 60 70 80 90 100 cumulative percent some useful statistical measures (ordinal or ratio scale) can be misleading when used with nominal data good for comparing data sets Cumulative Percent Graph

Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant.

Similar presentations

Presentation on theme: "Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant.

Similar presentations

Presentation on theme: "Univariate EDA (Exploratory Data Analysis). EDA John Tukey (1970s) data –two components: smooth + rough patterned behaviour + random variation resistant."— Presentation transcript:

Similar presentations

About project

Feedback