Download presentation

Presentation is loading. Please wait.

Published byLara Pulsipher Modified over 2 years ago

1
Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

2
Outline of Presentation Exploratory v. Confirmatory Data Analyses Exploratory Data Analysis Techniques Examples of Graphical Techniques Examples of Non-graphical Techniques

3
What is Exploratory Data Analysis (EDA)? John Tukey ( ), American statistician It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it. Definition EDA consists of methods of discovering unanticipated patterns and relationships in a data set, by summarizing data quantitatively or presenting them visually. 3

4
Exploratory v. Confirmatory Exploratory Data Analysis – Descriptive Statistics - Inductive Approach Look for flexible ways to examine data without preconceptions Heavy reliance on graphical displays Let data suggest questions – Advantages Flexible ways to generate hypotheses Does not require more than data can support Promotes deeper understanding of processes – Disadvantages Usually does not provide definitive answers Requires judgment - cannot be cookbooked 4

5
Exploratory v. Confirmatory Confirmatory Data Analysis – Inferential Statistics - Deductive Approach Hypothesis tests and formal confidence interval estimation Hypotheses determined at outset Heavy reliance on probability models Look for definite answers to specific questions Emphasis on numerical calculations – Advantages Provide precise information in the right circumstances Well-established theory and methods – Disadvantages Misleading impression of precision in less than ideal circumstances Analysis driven by preconceived ideas Difficult to notice unexpected results 5

6
EDA Techniques Graphical presentation of distribution - Continuous variables (stem-and-leaf plot, box plot, histogram, bivariate scatterplot) - Categorical variables (bar graph, pie chart) Non-graphical summary of distribution - Continuous variables (mean, median, mode, variance, standard deviation, range, correlation coefficient, linear regression) - Categorical variables (frequency table, cross-tabulation)

7
Stem-and-Leaf Plot What is it? – A plot where each data value is split into a "leaf" (usually the last digit) and a "stem" (the other digits). Useful for describing distributions in terms of -- Symmetry or skewness (right-skewed=long right tail or left-skewed=long left tail) -- Unimodality, bimodality or multimodality (one, two, or more peaks) -- Presence of outliers (a few very large or very small observations) 7

8
How To Create Stem-and-Leaf Plot Syntax EXAMINE VARIABLES=Rain /PLOT BOXPLOT STEMLEAF By Mouse – Descriptive Statistics-> Explore -> Plot Stem and Leaf Plot 8

9
Example: Stem-and-leaf Plot 9 We use SPSS to construct a stem-and-leaf plot for rainfall in the US in metropolitan areas. Frequency Stem & Leaf 4.00 Extremes (=<15) Extremes (>=60)

10
Box Plot What is it? – A way of graphically depicting groups of numerical data through their five-number summaries: the smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation (sample maximum). A box plot may also indicate which observations, if any, might be considered outliers. Useful in visualizing the following: – Location – Spread – Skewness – Outliers 10

11
How To Create Box Plot Syntax EXAMINE VARIABLES=Rain /PLOT=BOXPLOT. By mouse Graphs> legacy plots-> Box Plots->Click summaries of separate variables-> Scaled Variable-> Optional: Label Case-> Okay 11

12
Example: Box Plot Using the previous data on precipitation, we would like to understand the distribution of the rain and check for any outliers. 12

13
Example: Multiple Box Plots Side-by-side box plots below display the population distribution of large cities in

14
How To Create Box Plots Syntax EXAMINE VARIABLES=Population BY Country /PLOT=BOXPLOT /ID=City. By mouse – Graph> legacy plots-> Box Plots> click summaries of groups of cases> define> Variable (scalar) > categories (how are we organize them)> label (IDs or name (optional)) 14

15
Histogram What is it? – A diagram consisting of rectangles which area is proportional to the frequency of a continuous variable and which width is equal to the class interval (bin). Useful for describing distributions in terms of -- Symmetry or skewness -- Unimodality, bimodality or multimodality -- Presence of outliers 15

16
How To Create Histogram Automatically chosen Bins Syntax GRAPH /HISTOGRAM(NORMAL)=Population. By Mouse – Graphs-> histogram-> Variable (scalar)-> okay 16

17
How To Create Histogram User-selected number of bins Syntax GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=Population MISSING=LISTWISE REPORTMISSING=NO /GRAPHSPEC SOURCE=INLINE. BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: Population=col(source(s), name("Population")) GUIDE: axis(dim(1), label("Population")) GUIDE: axis(dim(2), label("Frequency")) ELEMENT: interval(position(summary.count(bin.rect(Population, binCount(5)))), shape.interior(shape.square)) END GPL. By Mouse – Graphs-> Chartbuilder > Histogram-> Drag Variable (scalar) (x-axis)- >set parameters-> custom -> number of intervals -> continue-> okay 17

18
How To Create Histogram User-selected bin width – Syntax * Chart Builder. GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=Population MISSING=LISTWISE REPORTMISSING=NO /GRAPHSPEC SOURCE=INLINE. BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: Population=col(source(s), name("Population")) GUIDE: axis(dim(1), label("Population")) GUIDE: axis(dim(2), label("Frequency")) ELEMENT: interval(position(summary.count(bin.rect(Population, binWidth(1)))), shape.interior(shape.square)) END GPL. By Mouse – Graphs-> Chartbuilder > Histogram-> Drag Variable (scalar) (x-axis)- >set parameters-> custom -> number of intervals -> continue-> okay 18

19
Example: Histogram A researcher might need to select bins to have a better understanding of the distribution and check what type of distribution we have. 19

20
Scatterplot What is it? – A scatterplot is a plot of data points in xy-plane that displays the strength, direction and shape of the relationship between the two variables. Used for – Analyzing relationships between two variables – Looking to see if there are any outliers in the data 20

21
How To Create Scatterplot Syntax GRAPH /SCATTERPLOT(BIVAR)=Height WITH Wieght /MISSING=LISTWISE. By Mouse – > graph-> legacy dialogs-> scatter/dot-> Simple Scatter-> Y axis (outcome) -> X axis (predictor)-> okay 21

22
Example: Scatterplot Researchers wanted to see if there is a link between Height and Weight. 22

23
Bar Graph What is it? -- A diagram consisting of rectangles which area is proportional to the frequency of each level of categorical variable. -- Bar graph is similar to histogram but for categorical variables. Used for -- comparison of frequencies for different levels 23

24
How To Create Bar Graph Syntax GRAPH /BAR(SIMPLE)=COUNT BY Gender. By Mouse Graph-> legacy dialogues-> bar-> Categorical Variable->Categorical Axis-> okay 24

25
Example: Bar Graph Experimenters wanted to make sure they had an close equal number of males and females in a study. 25

26
Pie chart What is it? – A type of graph in which a circle is divided into sectors corresponding to each level of categorical variable and illustrating numerical proportion for that level. Used for -- comparison of proportions for different levels 26

27
How To Create Pie Chart Syntax GRAPH /PIE=COUNT BY Bindedage. By Mouse Graph-> Legacy Dialogs-> Pie Chart-> Summaries for group of cases-> define-> categorical variable-> categorical axis-> okay 27

28
Example: Pie Chart A researcher wants to partition the age variable into a categorical variable in terms of mental development (College Age, Older Young Adult, Young Middle age, Middle Middle Age and up). 28

29
Measures of Central Tendency Central Tendency is the location of the middle value – Mean=sum of all data values divided by the number of values (arithmetic average). 29 Non-Graphical Techniques

30
Measures of Central Tendency – Median=the middle value after all the values are put in an ordered list (50% observations lie below and 50% above the median). – If there is a two middle observations, median is the average of the two. – Mode=most likely or frequently occurring value. 30

31
Measures of Spread Spread is how far observations lie from each other. -- Variance=average of the squared distances from the mean. -- Standard deviation=square root of the variance. -- Range=maximum-minimum. 31

32
How to Compute Measures of Central Tendency and Spread Syntax FREQUENCIES VARIABLES=MORT /STATISTICS=STDDEV VARIANCE RANGE MEAN MEDIAN MODE /ORDER=ANALYSIS. By Mouse Analyze-> Frequency -> Select a Scaled data-> click Statistics-> select Mean, Median, Mode, Range, Maximum and Minimum. 32

33
Example: Central Tendency and Spread We use SPSS to figure out the Central Tendency and Spread of the Mortality rates in the 1960s. 33

34
Correlation Coefficient What is it? -- A numeric measure of linear relationship between two continuous variables. Properties of correlation coefficient: -- Ranges between -1 and 1 -- The closer it is to -1 or 1, the stronger the linear relationship is -- If r=0, the two variables are not correlated -- If r is positive, relationship is described as positive (larger values of one variable tend to accompany larger values of the other variable) -- If r is negative, relationship is described as negative (larger value of one variable tend to accompany smaller values of the other variable) 34

35
Correlation Slight warning: – Correlation tend to measure linear relationship; however there are events that a curves might exist 35

36
Linear Regression What is it? -- Statistical technique of fitting a linear function to data points in attempt to describe a relationship between two variables. Used for -- prediction -- interpretation of coefficients (change in y for a unit increase in x) 36

37
How To Find Correlation and Fitted Regression Line By Syntax REGRESSION /DESCRIPTIVES MEAN STDDEV CORR SIG N /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT Wieght /METHOD=ENTER Height. By mouse Analyze->Regression-> Y (Variable we want to predict) to Dependent -> X (variable we are using to predict Y) with Independent-> 37

38
Example: Correlation Referring to our weight and height scatterplot, the researchers want to check how related these two variable are. 38

39
Example: Regression Researchers want to create a linear model using the height as an independent variable (predictor) and weight as a dependent variable (outcome or response). The fitted line can be written as Weight= (Height) 39 Coefficients a Model Unstandardized Coefficients Standardiz ed Coefficient s tSig. BStd. ErrorBeta 1(Constant) Hieght

40
Frequency Table What is it? -- A table that shows frequency (count) for each level of a categorical variable. Used for -- comparison of frequencies for different levels 40

41
How To Find Frequency Table Syntax FREQUENCIES VARIABLES=EDUbinned /ORDER=ANALYSIS. By mouse Analyze-> Descriptives-> frequency->Variable -> display Frequency-> okay 41

42
Example: Frequency Table We want to know what was the frequencies of different educational levels in the US metropolitan area in 1960s. We have to use visual binning first and identify bins. Using the range, we create bins from 9 th, 10 th, 11 th, 12 th grade and up. – Syntax – * Visual Binning. – *EDU. – RECODE EDU (MISSING=COPY) (12 THRU HI=4) (11 THRU HI=3) (10 THRU HI=2) (LO THRU HI=1) (ELSE=SYSMIS) INTO EDUbins. – VARIABLE LABELS EDUbins 'EDU (Binned)'. – FORMATS EDUbins (F5.0). – VALUE LABELS EDUbins 1 '9th Grade' 2 '10th Grade' 3 '11th Grade' 4 '12th grade and up'. – VARIABLE LEVEL EDUbins (ORDINAL). – By Mouse Transform-> Visual Binning-> variable we want to create into an ordinal value-> okay-> Make cut point-> enter number of cutpoints, and width-> apply-> okay 42

43
Example: Frequency Table EDU (Binned) FrequencyPercent Valid Percent Cumulative Percent Valid9th Grade th Grade th Grade th grade and up Total

44
44 Cross-tabulation What it is? – a two-way table containing frequencies (counts) for different levels of the column and row variables. Used for – Comparison of frequencies for different levels of the variables (chi-squared test)

45
How To Find Cross-tabulation Syntax: CROSSTABS /TABLES=EDUbins BY US /FORMAT=AVALUE TABLES /STATISTICS=CHISQ /CELLS=COUNT /COUNT ROUND CELL. By Mouse Analyze-> Descriptive Statistics-> Crosstabs-> select variable for row-> select variable for column-> statistic-> Chi-Square-> continue-> Okay 45

46
Example: Cross-tabulation Researchers wish to understand if the educational levels from the SMSA data were equally distributed among the US. Looking at the p-value, we can see that the educational levels are different among the regions of the US. 46 Chi-Square Tests Valuedf Asymp. Sig. (2- sided) Pearson Chi- Square a Likelihood Ratio Linear-by- Linear Association N of Valid Cases 60

47
47

48
Recommended Readings/Citations Hartwig, F., & Dearing, B. E. (1979). Exploratory Data Analysis. Beverly Hills : Sage Publications. Hoaglin, D. C., Mostellar, F., & Tukey, J. W. (1983). Understanding Robust and Exploratory Data Analysis. New York: John Wile & Sons Inc. Pampel, F. C. (2004). Exploratory Data Analysis. In M. S. Lewis-Beck, A. Bryman, & L. t. Futing, The SAGE Encyclopedia of Social Science Research Methods (pp ). Thousand Oak, California : Sage Publications. Vogt, W. P. (1999). Exploratory Data Analysis. In W. P. Vogt, Dictionary of Statistics & Methodology: A Nontechnical Guide for the Social Science (pp ). Thousand Oaks, California: SAGE Publications. Inc. 48

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google