Download presentation

Presentation is loading. Please wait.

Published byAubrey Spires Modified over 2 years ago

1
**Exploratory Data Analysis and Data Visualization**

Chapter 2 credits: Hand, Mannila and Smyth Cook and Swayne ggobi Lecture Notes: Padhraic Smyth’s UCI lecture notes R Graphics book Data Mining - Massey University

2
**Data Mining - Massey University**

Outline EDA Visualization One variable Two variables More than two variables Other types of data Dimension reduction Data Mining - Massey University

3
**Data Mining - Massey University**

EDA and Visualization Exploratory Data Analysis (EDA) and Visualization are important (necessary?) steps in any analysis task. can be thought of as hypothesis generation get to know your data! distributions (symmetric, normal, skewed) data quality problems outliers correlations and inter-relationships subsets of interest suggest functional relationships Sometimes EDA or viz might be the goal! but be careful of multiple comparisons Data Mining - Massey University

4
**Data Mining - Massey University**

5
**Data Mining - Massey University**

EDA Good data analysis practice You should always look at every variable - you will learn something! Deveaux example histogram? Look at descriptive statistics Use means, medians, quantiles, boxplots R functions: summary(), hist(), table() Visualization as part of EDA Humans are the best pattern recognition software Limitations : many dimensions, large data sets Data Mining - Massey University

6
**Exploratory Data Analysis (EDA)**

get a general sense of the data interactive and visual (cleverly/creatively) exploit human visual power to see patterns 1 to 5 dimensions (e.g. spatial, color, time, sound) e.g. plot raw data/statistics, reduce dimensions as needed data-driven (model-free) especially useful in early stages of data mining detect outliers (e.g. assess data quality) test assumptions (e.g. normal distributions or skewed?) identify useful raw data & transforms (e.g. log(x)) Bottom line: it is always well worth looking at your data! Data Mining - Massey University

7
**Data Mining - Massey University**

Summary Statistics not visual sample statistics of data X mean: = i Xi / n { minimizes i (Xi - )2 } mode: most common value in X median: X=sort(X), median = Xn/2 (half below, half above) quartiles of sorted X: Q1 value = X0.25n , Q3 value = X0.75 n interquartile range: value(Q3) - value(Q1) range: max(X) - min(X) = Xn - X1 variance: 2 = i (Xi - )2 / n skewness: i (Xi - )3 / [ (i (Xi - )2)3/2 ] zero if symmetric; right-skewed more common (e.g. you v. Bill Gates) number of distinct values for a variable (see unique() in R) summary() very useful. Data Mining - Massey University

8
**Single Variable Visualization**

Histogram: Shows center, variability, skewness, modality, outliers, or strange patterns. Bins matter, use nclass option of hist Beware of real zeros hist(DiastolicBP,col='orange',nclass=20) Data Mining - Massey University

9
**Data Mining - Massey University**

Histograms number of weeks a credit card was used in a given year Data Mining - Massey University

10
**Data Mining - Massey University**

Histograms small change to the “anchor point” can make a big difference: Data Mining - Massey University

11
**Issues with Histograms**

For small data sets, histograms can be misleading. Small changes in the data or to the bucket boundaries can result in very different histograms. For large data sets, histograms can be quite effective at illustrating general properties of the distribution. Histograms effectively only work with 1 variable at a time Difficult to extend to 2 dimensions, not possible for >2 So histograms tell us nothing about the relationships among variables Data Mining - Massey University

12
**Smoothed Histograms - Density Estimates**

Kernel estimates smooth out the contribution of each datapoint over a local neighborhood of that point. h is the kernel width Gaussian kernel is common: Formal procedures for optimal bandwidth choice R includes many options (?density) Data Mining - Massey University

13
**Data Mining - Massey University**

14
**Data Mining - Massey University**

Boxplots Shows a lot of information about a variable in one plot Median IQR Outliers Range Skewness Negatives Overplotting Hard to tell distributional shape no standard implementation in software (many options) Data Mining - Massey University

15
**Time Series Example 1 New Year bumps summer bifurcations in air travel**

(favor early/late) summer peaks steady growth trend New Year bumps Data Mining - Massey University

16
**Data Mining - Massey University**

Time-Series Example 2 mean weight vs mean age for 10k control group Scotland experiment on effects of milk on better health Unexpected “step effect” ??? Data Mining - Massey University

17
**Data Mining - Massey University**

Time Series Example 3 spatio-temporal data growth of Wal-Mart in US Data Mining - Massey University

18
**Displaying Two Variables**

For two numeric variables, the scatterplot is the obvious choice interesting? interesting? Data Mining - Massey University

19
**Data Mining - Massey University**

2D Scatterplots standard tool to display relation between 2 variables e.g. y-axis = response, x-axis = suspected indicator useful to answer: x,y related? no linearly nonlinearly variance(y) depend on x? outliers present? R: plot(x,y,’.’); Data Mining - Massey University

20
**Scatter Plot: No apparent relationship**

Data Mining - Massey University

21
**Scatter Plot: Linear relationship**

Data Mining - Massey University

22
**Scatter Plot: Quadratic relationship**

Data Mining - Massey University

23
**Scatter plot: Homoscedastic**

Variation of Y Does Not Depend on X Data Mining - Massey University

24
**Scatter plot: Heteroscedastic**

variation in Y differs depending on the value of X e.g., Y = annual tax paid, X = income Data Mining - Massey University

25
**Two variables - continuous**

Scatterplots But can be bad with lots of data Data Mining - Massey University

26
**Data Mining - Massey University**

Transparent plotting plot( rnorm(1000), rnorm(1000), col="#0000ff22", pch=16,cex=3) Data Mining - Massey University

27
**Data Mining - Massey University**

Alpha blending courtesy Simon Urbanek Data Mining - Massey University

28
**Data Mining - Massey University**

Jittering Jittering points helps too plot(age, TimesPregnant) plot(jitter(age),jitter(TimesPregnant) Data Mining - Massey University

29
**Two variables - continuous**

What to do for large data sets Contour plots Data Mining - Massey University

30
**Displaying Two Variables**

If one variable is categorical, use variations on single dimensional methods Library(‘trellis’) histogram(~DiastolicBP | TimesPregnant==0) Data Mining - Massey University

31
**Two Variables - one categorical**

Side by side boxplots are very effective in showing differences in a quantitative variable across factor levels tips data do men or women tip better orchard sprays measuring potency of various orchard sprays in repelling honeybees Data Mining - Massey University

32
**Barcharts and Spineplots**

stacked barcharts or histograms are useful but should be used with caution spineplots are nice, but can be hard to interpret Data Mining - Massey University

33
**More than two variables**

Scatterplot matrices : pairs(x) somewhat ineffective for categorical data Data Mining - Massey University

34
**More than two variables**

Get creative! Conditioning on variables trellis or lattice plots Cleveland models on human perception, all based on conditioning all use the R formula model a lot of control over the output alternate versions of standard R plot functions plot => xyplot barplot => barchart boxplot =>bwplot Earthquake data: locations of 1000 seismic events of MB > 4.0. The events occurred in a cube near Fiji since 1964 Data Mining - Massey University

35
**Data Mining - Massey University**

36
**Data Mining - Massey University**

37
**Data Mining - Massey University**

Starplots Data Mining - Massey University

38
**Using Icons to Encode Information, e.g., Star Plots**

Each star represents a single observation. Star plots are used to examine the relative values for a single data point The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. Useful for small data sets with up to 10 or so variables Limitations? Small data sets, small dimensions Ordering of variables may affect perception 1 Price 2 Mileage (MPG) Repair Record (1 = Worst, 5 = Best) Repair Record (1 = Worst, 5 = Best) 5 Headroom 6 Rear Seat Room 7 Trunk Space 8 Weight 9 Length Data Mining - Massey University

39
**Data Mining - Massey University**

Chernoff’s Faces described by ten facial characteristic parameters: head eccentricity, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, eye spacing, eye size, mouth length and degree of mouth opening Chernoff faces applet more icon plots Data Mining - Massey University

40
**Data Mining - Massey University**

Chernoff faces Data Mining - Massey University

41
**Mosaic plots for categorical data**

Data Mining - Massey University

42
**Data Mining - Massey University**

Mosaic Plots Good for plotting many categorical variables sensitive to the order which they are applied Data Mining - Massey University

43
**Data Mining - Massey University**

Networks and Graphs creating networks where they might not obviously exist Data Mining - Massey University

44
**Interactive Visualization**

Multi-dimensional viz is easiest using a tool that allows for variable selction ggobi is such a tool. Brushing and linking of different plots demo Data Mining - Massey University

45
**Data Mining - Massey University**

What’s missing? pie charts very popular good for showing simple relations of proportions hard to get a real sense of what is going on barplots, histograms usually better (but less pretty) 3D nice to be able to show three dimensions hard to do well often done poorly 3d best shown through “spinning” in 2D uses various types of projecting into 2D see video Data Mining - Massey University

46
**Data Mining - Massey University**

47
**Data Mining - Massey University**

Dimension Reduction One way to visualize high dimensional data is to reduce it to 2 or 3 dimensions Variable selection e.g. stepwise Principle Components find linear projection onto p-space with maximal variance Multi-dimensional scaling takes a matrix of (dis)similarities and embeds the points in p-dimensional space to retain those similarities Data Mining - Massey University

48
**Data Mining - Massey University**

Lab #2 Explore graphics with demo(graphics) Download Di Cook’s music data set and create some simple graphics Use the USArrests data to plot scatterplots and do rudimentary interactive viz with identify(). Data Mining - Massey University

Similar presentations

OK

Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on power amplifier Ppt on phonetic transcription translator Ppt on digital multimedia broadcasting Ppt on human chromosomes Ppt on latest technology in ece Ppt on schottky diode bridge Ppt on id ego superego in lord Ppt on sets for class 11th Ppt on field study 3 Ppt on motion for class 9 download