Data Mining 2011 - Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for.

Slides:



Advertisements
Similar presentations
Exploratory Data Analysis and Data Visualization
Advertisements

Lesson Describing Distributions with Numbers parts from Mr. Molesky’s Statmonkey website.
Lecture 2 Summarizing the Sample. WARNING: Today’s lecture may bore some of you… It’s (sort of) not my fault…I’m required to teach you about what we’re.
Ethics of data representation v2.0. Collect Raw Data Process and Filter Data Clean Dataset Exploratory Analysis Generate Conclusion Generate Visualisation.
Statistics 100 Lecture Set 6. Re-cap Last day, looked at a variety of plots For categorical variables, most useful plots were bar charts and pie charts.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
Graphical Examination of Data Jaakko Leppänen
Beginning the Visualization of Data
Reading Graphs and Charts are more attractive and easy to understand than tables enable the reader to ‘see’ patterns in the data are easy to use for comparisons.
Statistics 101 & Exploratory Data Analysis (EDA)
Multidimensional data processing. Multivariate data consist of several variables for each observation. Actually, serious data is always multivariate.
Visualizing and Exploring Data Summary statistics for data (mean, median, mode, quartile, variance, skewnes) Distribution of values for single variables.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
© 2002 Prentice-Hall, Inc.Chap 3-1 Basic Business Statistics (8 th Edition) Chapter 3 Numerical Descriptive Measures.
Descriptive Statistics A.A. Elimam College of Business San Francisco State University.
Types of Data Displays Based on the 2008 AZ State Mathematics Standard.
Visualization and Data Mining. 2 Outline  Graphical excellence and lie factor  Representing data in 1,2, and 3-D  Representing data in 4+ dimensions.
© 2003 Prentice-Hall, Inc.Chap 3-1 Business Statistics: A First Course (3 rd Edition) Chapter 3 Numerical Descriptive Measures.
Presenting information
1 How to interpret scientific & statistical graphs Theresa A Scott, MS Department of Biostatistics
Charts and Graphs V
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Numerical Descriptive Measures
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Let’s Review for… AP Statistics!!! Chapter 1 Review Frank Cerros Xinlei Du Claire Dubois Ryan Hoshi.
Association between 2 variables We've described the distribution of 1 variable in Chapter 1 - but what if 2 variables are measured on the same individual?
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
Quantitative Skills 1: Graphing
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Describing Data.
CHAPTER 7: Exploring Data: Part I Review
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Chapter 11 Graphical Methods. Introduction “A picture is often better than several numerical analyses” Stand-alone procedure, or used in conjunction with.
Data Mining Lectures Lecture 3: EDA and Visualization Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 3: Exploratory Data Analysis and Visualization.
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
Lecture 3 Describing Data Using Numerical Measures.
Tan,Steinbach, Kumar: Exploratory Data Analysis (with modifications by Ch. Eick) Data Mining: “New” Teaching Road Map 1. Introduction to Data Mining and.
Categorical vs. Quantitative…
Unit 4 Statistical Analysis Data Representations.
Statistics Chapter 1: Exploring Data. 1.1 Displaying Distributions with Graphs Individuals Objects that are described by a set of data Variables Any characteristic.
Chapter 8 Making Sense of Data in Six Sigma and Lean
Descriptive statistics Petter Mostad Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when.
June 11, 2008Stat Lecture 10 - Review1 Midterm review Chapters 1-5 Statistics Lecture 10.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Displaying Distributions with Graphs. the science of collecting, analyzing, and drawing conclusions from data.
Exploratory Data Analysis and Data Visualization Credits: ChrisVolinsky - Columbia University 1.
Copyright © 2011 Pearson Education, Inc. Describing Numerical Data Chapter 4.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying.
Exploratory Spatial Data Analysis (ESDA) Analysis through Visualization.
Numerical descriptions of distributions
Data Visualization.
Bell Ringer You will need a new bell ringer sheet – write your answers in the Monday box. 3. Airport administrators take a sample of airline baggage and.
3/13/2016 Data Mining 1 Lecture 2-1 Data Exploration: Understanding Data Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB)
IPS Chapter 1 © 2012 W.H. Freeman and Company  1.1: Displaying distributions with graphs  1.2: Describing distributions with numbers  1.3: Density Curves.
Graphs with SPSS Aravinda Guntupalli. Bar charts  Bar Charts are used for graphical representation of Nominal and Ordinal data  Height of the bar is.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Chapter 15: Exploratory data analysis: graphical summaries CIS 3033.
Midterm Review IN CLASS. Chapter 1: The Art and Science of Data 1.Recognize individuals and variables in a statistical study. 2.Distinguish between categorical.
Exploratory Data Analysis
Exploring Data: Summary Statistics and Visualizations
EXPLORATORY DATA ANALYSIS and DESCRIPTIVE STATISTICS
CHAPTER 1 Exploring Data
Unit 4 Statistical Analysis Data Representations
Data Mining: Exploring Data
Data Mining: “New” Teaching Road Map
Data exploration and visualization
Presentation transcript:

Data Mining Volinsky - Columbia University Exploratory Data Analysis and Data Visualization Chapter 2 credits: Interactive and Dyamic Graphics for Data Analysis: Cook and Swayne Padhraic Smyth’s UCI lecture notes R Graphics: Paul Murrell Graphics of Large Datasets: Visualizing a Milion: Unwin, Theus and Hofmann 1

Data Mining Volinsky - Columbia University Outline EDA Visualization –One variable –Two variables –More than two variables –Other types of data –Dimension reduction 2

Data Mining Volinsky - Columbia University EDA and Visualization Exploratory Data Analysis (EDA) and Visualization are important (necessary?) steps in any analysis task. get to know your data! –distributions (symmetric, normal, skewed) –data quality problems –outliers –correlations and inter-relationships –subsets of interest –suggest functional relationships Sometimes EDA or viz might be the goal! 3

Data Mining Volinsky - Columbia University 4 flowingdata.com 9/9/11

Data Mining Volinsky - Columbia University 5 NYTimes 7/26/11

Data Mining Volinsky - Columbia University Exploratory Data Analysis (EDA) Goal: get a general sense of the data –means, medians, quantiles, histograms, boxplots You should always look at every variable - you will learn something! data-driven (model-free) Think interactive and visual –Humans are the best pattern recognizers – You can use more than 2 dimensions! x,y,z, space, color, time…. especially useful in early stages of data mining –detect outliers (e.g. assess data quality) –test assumptions (e.g. normal distributions or skewed?) –identify useful raw data & transforms (e.g. log(x)) Bottom line: it is always well worth looking at your data! 6

Data Mining Volinsky - Columbia University Summary Statistics not visual sample statistics of data X –mean:  =  i X i / n –mode: most common value in X –median: X=sort(X), median = X n/2 (half below, half above) –quartiles of sorted X: Q1 value = X 0.25n, Q3 value = X 0.75 n interquartile range: value(Q3) - value(Q1) range: max(X) - min(X) = X n - X 1 –variance:  2 =  i (X i -  ) 2 / n –skewness:  i (X i -  ) 3 / [ (  i (X i -  ) 2 ) 3/2 ] zero if symmetric; right-skewed more common (what kind of data is right skewed?) –number of distinct values for a variable (see unique() in R) –Don’t need to report all of thses: Bottom line…do these numbers make sense??? 7

Data Mining Volinsky - Columbia University Single Variable Visualization Histogram: –Shows center, variability, skewness, modality, –outliers, or strange patterns. –Bins matter –Beware of real zeros 8

Data Mining Volinsky - Columbia University Issues with Histograms For small data sets, histograms can be misleading. –Small changes in the data, bins, or anchor can deceive For large data sets, histograms can be quite effective at illustrating general properties of the distribution. Histograms effectively only work with 1 variable at a time –But ‘small multiples’ can be effective 9

Data Mining Volinsky - Columbia University 10 But be careful with axes and scales!

Data Mining Volinsky - Columbia University Smoothed Histograms - Density Estimates Kernel estimates smooth out the contribution of each datapoint over a local neighborhood of that point. h is the kernel width Gaussian kernel is common: 11

Data Mining Volinsky - Columbia University 12 Bandwidth choice is an art Usually want to try several

Data Mining Volinsky - Columbia University Boxplots Shows a lot of information about a variable in one plot –Median –IQR –Outliers –Range –Skewness Negatives –Overplotting –Hard to tell distributional shape –no standard implementation in software (many options for whiskers, outliers) 13

Time Series If your data has a temporal component, be sure to exploit it Data Mining Volinsky - Columbia University 14 steady growth trend New Year bumps summer peaks summer bifurcations in air travel (favor early/late)

Spatial Data If your data has a geographic component, be sure to exploit it Data from cities/states/zip cods – easy to get lat/long Can plot as scatterplot Data Mining Volinsky - Columbia University 15

Data Mining Volinsky - Columbia University Spatio-temporal data spatio-temporal data – (Nathan Yau) –But, fancy tools not needed! Just do successive scatterplots to (almost) the same effect 16

Spatial data: choropleth Maps Maps using color shadings to represent numerical values are called chloropleth maps Data Mining Volinsky - Columbia University 17

Data Mining Volinsky - Columbia University Two Continuous Variables For two numeric variables, the scatterplot is the obvious choice interesting? 18

interesting? 2D Scatterplots standard tool to display relation between 2 variables –e.g. y-axis = response, x-axis = suspected indicator useful to answer: –x,y related? linear quadratic other –variance(y) depend on x? –outliers present? Data Mining Volinsky - Columbia University 19

Data Mining Volinsky - Columbia University Scatter Plot: No apparent relationship 20

Data Mining Volinsky - Columbia University Scatter Plot: Linear relationship 21

Data Mining Volinsky - Columbia University Scatter Plot: Quadratic relationship 22

Data Mining Volinsky - Columbia University Scatter plot: Homoscedastic Why is this important in classical statistical modelling? 23

Data Mining Volinsky - Columbia University Scatter plot: Heteroscedastic variation in Y differs depending on the value of X e.g., Y = annual tax paid, X = income 24

Data Mining Volinsky - Columbia University Two variables - continuous Scatterplots –But can be bad with lots of data 25

Data Mining Volinsky - Columbia University What to do for large data sets –Contour plots Two variables - continuous 26

Data Mining Volinsky - Columbia University Transparent plotting Alpha-blending: plot( rnorm(1000), rnorm(1000), col="#0000ff22", pch=16,cex=3) 27

Data Mining Volinsky - Columbia University Alpha blending courtesy Simon Urbanek 28

Data Mining Volinsky - Columbia University Jittering Jittering points helps too plot(age, TimesPregnant) plot(jitter(age),jitter(TimesPregnant) 29

Data Mining Volinsky - Columbia University Displaying Two Variables If one variable is categorical, use small multiples Many software packages have this implemented as ‘lattice’ or ‘trellis’ packages 30 library(‘lattice’) histogram(~DiastolicBP | TimesPregnant==0) library(‘lattice’) histogram(~DiastolicBP | TimesPregnant==0)

Data Mining Volinsky - Columbia University Two Variables - one categorical Side by side boxplots are very effective in showing differences in a quantitative variable across factor levels –tips data do men or women tip better –orchard sprays measuring potency of various orchard sprays in repelling honeybees 31

Data Mining Volinsky - Columbia University Barcharts and Spineplots stacked barcharts can be used to compare continuous values across two or more categorical ones. spineplots show proportions well, but can be hard to interpret 32 orange=M blue=F

Data Mining Volinsky - Columbia University More than two variables Pairwise scatterplots Can be somewhat ineffective for categorical data 33

Data Mining Volinsky - Columbia University 34

Data Mining Volinsky - Columbia University Multivariate: More than two variables Get creative! Conditioning on variables –trellis or lattice plots –Cleveland models on human perception, all based on conditioning –Infinite possibilities Earthquake data: –locations of 1000 seismic events of MB > 4.0. The events occurred in a cube near Fiji since 1964 –Data collected on the severity of the earthquake 35

Data Mining Volinsky - Columbia University 36

Data Mining Volinsky - Columbia University 37

Data Mining Volinsky - Columbia University 38 How many dimensions are represented here? Andrew Gelman blog 7/15/2009

39 Multivariate Vis: Parallel Coordinates Petal, a non-reproductive part of the flower Sepal, a non-reproductive part of the flower The famous iris data! Data Mining Volinsky - Columbia University

40 Parallel Coordinates Sepal Length 5.1 Data Mining Volinsky - Columbia University

41 Parallel Coordinates: 2 D Sepal Length 5.1 Sepal Width 3.5 Data Mining Volinsky - Columbia University

42 Parallel Coordinates: 4 D Sepal Length 5.1 Sepal Width Petal length Petal Width Data Mining Volinsky - Columbia University

Parallel Visualization of Iris data Data Mining Volinsky - Columbia University

Multivariate: Parallel coordinates Data Mining Volinsky - Columbia University Courtesy Unwin, Theus, Hofmann Alpha blending can be effective 44

Parallel coordinates Useful in an interactive setting Data Mining Volinsky - Columbia University 45

Data Mining Volinsky - Columbia University Starplots 46

Data Mining Volinsky - Columbia University Using Icons to Encode Information, e.g., Star Plots Each star represents a single observation. Star plots are used to examine the relative values for a single data point The star plot consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. Useful for small data sets with up to 10 or so variables Limitations? –Small data sets, small dimensions –Ordering of variables may affect perception 1 Price 2 Mileage (MPG) Repair Record (1 = Worst, 5 = Best) Repair Record (1 = Worst, 5 = Best) 5 Headroom 6 Rear Seat Room 7 Trunk Space 8 Weight 9 Length 47

Data Mining Volinsky - Columbia University Chernoff’s Faces described by ten facial characteristic parameters: head eccentricity, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, eye spacing, eye size, mouth length and degree of mouth opening Much derided in statistical circles 48

Data Mining Volinsky - Columbia University Chernoff faces 49

Data Mining Volinsky - Columbia University Mosaic Plots generalization of spine plots for many categorical variables sensitive to the order which they are applied Titanic Data: 50

Mosaic plots Data Mining Volinsky - Columbia University Can be effective, but can get out of hand: 51

Data Mining Volinsky - Columbia University Networks and Graphs Visualizing networks is helpful, even if is not obvious that a network exists 52

Network Visualization Graphviz (open source software) is a nice layout tool for big and small graphs Data Mining Volinsky - Columbia University 53

Data Mining Volinsky - Columbia University What’s missing? pie charts –very popular –good for showing simple relations of proportions –Human perception not good at comparing arcs –barplots, histograms usually better (but less pretty) 3D –nice to be able to show three dimensions –hard to do well –often done poorly –3d best shown through “spinning” in 2D uses various types of projecting into 2D 54

Data Mining Volinsky - Columbia University 55 Worst graphic in the world?

Data Mining Volinsky - Columbia University Dimension Reduction One way to visualize high dimensional data is to reduce it to 2 or 3 dimensions –Variable selection e.g. stepwise –Principle Components find linear projection onto p-space with maximal variance –Multi-dimensional scaling takes a matrix of (dis)similarities and embeds the points in p- dimensional space to retain those similarities More on this in next Topic 56

Visualization done right Hans TED Data Mining Volinsky - Columbia University 57