Presentation on theme: "Chocolate Cake Seminar Series on Statistical Applications"— Presentation transcript:
1 Chocolate Cake Seminar Series on Statistical Applications Today’s Talk:Be an Explorer with ExploratoryData Analysis!By David Ramirez
2 Outline of Presentation Exploratory v. Confirmatory Data AnalysesExploratory Data Analysis TechniquesExamples of Graphical TechniquesExamples of Non-graphical Techniques
3 What is Exploratory Data Analysis (EDA)? John Tukey ( ), American statisticianIt is important to understand whatyou CAN DO before you learn tomeasure how WELL you seem tohave DONE it.DefinitionEDA consists of methods of discovering unanticipatedpatterns and relationships in a data set, by summarizingdata quantitatively or presenting them visually.
4 Exploratory v. Confirmatory Exploratory Data AnalysisDescriptive Statistics - Inductive ApproachLook for flexible ways to examine data without preconceptionsHeavy reliance on graphical displaysLet data suggest questionsAdvantagesFlexible ways to generate hypothesesDoes not require more than data can supportPromotes deeper understanding of processesDisadvantagesUsually does not provide definitive answersRequires judgment - cannot be cookbooked
5 Exploratory v. Confirmatory Confirmatory Data AnalysisInferential Statistics - Deductive ApproachHypothesis tests and formal confidence interval estimationHypotheses determined at outsetHeavy reliance on probability modelsLook for definite answers to specific questionsEmphasis on numerical calculationsAdvantagesProvide precise information in the right circumstancesWell-established theory and methodsDisadvantagesMisleading impression of precision in less than ideal circumstancesAnalysis driven by preconceived ideasDifficult to notice unexpected results
7 Stem-and-Leaf Plot What is it? A plot where each data value is split into a "leaf" (usually the last digit) and a "stem" (the other digits).Useful for describing distributions in terms of-- Symmetry or skewness (right-skewed=long right tail orleft-skewed=long left tail)-- Unimodality, bimodality or multimodality (one, two,or more peaks)-- Presence of outliers (a few very large or very smallobservations)
8 How To Create Stem-and-Leaf Plot SyntaxEXAMINE VARIABLES=Rain/PLOT BOXPLOT STEMLEAFBy MouseDescriptive Statistics-> Explore -> Plot Stem and Leaf Plot
9 Example: Stem-and-leaf Plot We use SPSS to construct a stem-and-leaf plot for rainfall in the US in metropolitan areas.Frequency Stem & Leaf4.00 Extremes (=<15)1.00 Extremes (>=60)
10 Box Plot What is it? Useful in visualizing the following: A way of graphically depicting groups of numerical data through their five-number summaries: the smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation (sample maximum). A box plot may also indicate which observations, if any, might be considered outliers.Useful in visualizing the following:LocationSpreadSkewnessOutliers
11 How To Create Box Plot By mouse Syntax EXAMINE VARIABLES=Rain /PLOT=BOXPLOT.By mouseGraphs> legacy plots-> Box Plots->Click summaries of separate variables-> Scaled Variable-> Optional: Label Case-> Okay
12 Example: Box PlotUsing the previous data on precipitation, we would like to understand the distribution of the rain and check for any outliers.
13 Example: Multiple Box Plots Side-by-side box plots below display the population distribution of large cities in 1960.
14 How To Create Box Plots Syntax By mouse EXAMINE VARIABLES=Population BY Country/PLOT=BOXPLOT/ID=City.By mouseGraph> legacy plots-> Box Plots> click summaries of groups of cases> define> Variable (scalar) > categories (how are we organize them)> label (IDs or name (optional))
15 Histogram What is it? Useful for describing distributions in terms of A diagram consisting of rectangles which area is proportional to the frequency of a continuous variable and which width is equal to the class interval (bin).Useful for describing distributions in terms of-- Symmetry or skewness-- Unimodality, bimodality or multimodality-- Presence of outliers
16 How To Create Histogram Automatically chosen BinsSyntaxGRAPH/HISTOGRAM(NORMAL)=Population.By MouseGraphs-> histogram-> Variable (scalar)-> okay
17 How To Create Histogram User-selected number of binsSyntaxGGRAPH/GRAPHDATASET NAME="graphdataset" VARIABLES=Population MISSING=LISTWISE REPORTMISSING=NO/GRAPHSPEC SOURCE=INLINE.BEGIN GPLSOURCE: s=userSource(id("graphdataset"))DATA: Population=col(source(s), name("Population"))GUIDE: axis(dim(1), label("Population"))GUIDE: axis(dim(2), label("Frequency"))ELEMENT: interval(position(summary.count(bin.rect(Population, binCount(5)))), shape.interior(shape.square))END GPL.By MouseGraphs-> Chartbuilder > Histogram-> Drag Variable (scalar) (x-axis)->set parameters-> custom -> number of intervals -> continue-> okay
18 How To Create Histogram User-selected bin widthSyntax* Chart Builder.GGRAPH/GRAPHDATASET NAME="graphdataset" VARIABLES=Population MISSING=LISTWISE REPORTMISSING=NO/GRAPHSPEC SOURCE=INLINE.BEGIN GPLSOURCE: s=userSource(id("graphdataset"))DATA: Population=col(source(s), name("Population"))GUIDE: axis(dim(1), label("Population"))GUIDE: axis(dim(2), label("Frequency"))ELEMENT: interval(position(summary.count(bin.rect(Population, binWidth(1)))), shape.interior(shape.square))END GPL.By MouseGraphs-> Chartbuilder > Histogram-> Drag Variable (scalar) (x-axis)->set parameters-> custom -> number of intervals -> continue-> okay
19 Example: HistogramA researcher might need to select bins to have a better understanding of the distribution and check what type of distribution we have.
20 Scatterplot What is it? Used for A scatterplot is a plot of data points in xy-planethat displays the strength, direction and shape ofthe relationship between the two variables.Used forAnalyzing relationships between two variablesLooking to see if there are any outliers in the data
21 How To Create Scatterplot SyntaxGRAPH/SCATTERPLOT(BIVAR)=Height WITH Wieght/MISSING=LISTWISE.By Mouse> graph-> legacy dialogs-> scatter/dot-> Simple Scatter-> Y axis (outcome) -> X axis (predictor)-> okay
22 Example: ScatterplotResearchers wanted to see if there is a link between Height and Weight.
23 Bar GraphWhat is it?-- A diagram consisting of rectangles which area isproportional to the frequency of each level ofcategorical variable.-- Bar graph is similar to histogram but forcategorical variables.Used for-- comparison of frequencies for different levels
24 How To Create Bar Graph Syntax GRAPH /BAR(SIMPLE)=COUNT BY Gender. By MouseGraph-> legacy dialogues-> bar-> Categorical Variable->Categorical Axis-> okay
25 Example: Bar GraphExperimenters wanted to make sure they had an close equal number of males and females in a study.
26 Pie chart What is it? Used for A type of graph in which a circle is divided into sectors corresponding to each level of categorical variable and illustrating numerical proportion for that level.Used for-- comparison of proportions for different levels
27 How To Create Pie Chart Syntax By Mouse GRAPH/PIE=COUNT BY Bindedage.By MouseGraph-> Legacy Dialogs-> Pie Chart->Summaries for group of cases-> define->categorical variable-> categorical axis-> okay
28 Example: Pie ChartA researcher wants to partition the age variable into a categorical variable in terms of mental development (College Age, Older Young Adult, Young Middle age, Middle Middle Age and up).
29 Measures of Central Tendency Non-Graphical TechniquesMeasures of Central TendencyCentral Tendency is the location of the middle valueMean=sum of all data values divided by the number of values (arithmetic average).
30 Measures of Central Tendency Median=the middle value after all the values are put in an ordered list (50% observations lie below and 50% above the median).If there is a two middle observations, median is the average of the two.Mode=most likely or frequently occurring value.
31 Measures of Spread Spread is how far observations lie from each other. -- Variance=average of the squared distances fromthe mean.-- Standard deviation=square root of the variance.-- Range=maximum-minimum.
32 How to Compute Measures of Central Tendency and Spread SyntaxFREQUENCIES VARIABLES=MORT/STATISTICS=STDDEV VARIANCE RANGE MEAN MEDIAN MODE/ORDER=ANALYSIS.By MouseAnalyze-> Frequency -> Select a Scaled data-> click Statistics-> select Mean, Median, Mode, Range, Maximum and Minimum.
33 Example: Central Tendency and Spread We use SPSS to figure out the Central Tendency and Spread of the Mortality rates in the 1960s.
34 Correlation Coefficient What is it?-- A numeric measure of linear relationship between two continuousvariables.Properties of correlation coefficient:-- Ranges between -1 and 1-- The closer it is to -1 or 1, the stronger the linear relationship is-- If r=0, the two variables are not correlated-- If r is positive, relationship is described as positive (larger values of onevariable tend to accompany larger values of the other variable)-- If r is negative, relationship is described as negative (larger value of onevariable tend to accompany smaller values of the other variable)
35 Correlation Slight warning: Correlation tend to measure linear relationship; however there are events that a curves might exist
36 Linear Regression What is it? -- Statistical technique of fitting a linear function todata points in attempt to describe a relationshipbetween two variables.Used for-- prediction-- interpretation of coefficients (change in y for aunit increase in x)
37 How To Find Correlation and Fitted Regression Line By SyntaxREGRESSION/DESCRIPTIVES MEAN STDDEV CORR SIG N/MISSING LISTWISE/STATISTICS COEFF OUTS R ANOVA/CRITERIA=PIN(.05) POUT(.10)/NOORIGIN/DEPENDENT Wieght/METHOD=ENTER Height.By mouseAnalyze->Regression-> Y (Variable we want to predict) to Dependent -> X (variable we are using to predict Y) with Independent->
38 Example: CorrelationReferring to our weight and height scatterplot, the researchers want to check how related these two variable are.
39 Example: RegressionResearchers want to create a linear model using the height as an independent variable (predictor) and weight as a dependent variable (outcome or response).The fitted line can be written asWeight= (Height)CoefficientsaModelUnstandardized CoefficientsStandardized CoefficientstSig.BStd. ErrorBeta1(Constant)7.539.000Hieght1.018.044.71723.135
40 Frequency Table What is it? -- A table that shows frequency (count) for eachlevel of a categorical variable.Used for-- comparison of frequencies for different levels
41 How To Find Frequency Table SyntaxFREQUENCIES VARIABLES=EDUbinned/ORDER=ANALYSIS.By mouseAnalyze-> Descriptives-> frequency->Variable-> display Frequency-> okay
42 Example: Frequency Table We want to know what was the frequencies of different educational levels in the US metropolitan area in 1960s. We have to use visual binning first and identify bins. Using the range, we create bins from 9th, 10th, 11th, 12th grade and up.Syntax* Visual Binning.*EDU.RECODE EDU (MISSING=COPY) (12 THRU HI=4) (11 THRU HI=3) (10 THRU HI=2) (LO THRU HI=1) (ELSE=SYSMIS) INTO EDUbins.VARIABLE LABELS EDUbins 'EDU (Binned)'.FORMATS EDUbins (F5.0).VALUE LABELS EDUbins 1 '9th Grade' 2 '10th Grade' 3 '11th Grade' 4 '12th grade and up'.VARIABLE LEVEL EDUbins (ORDINAL).By MouseTransform-> Visual Binning-> variable we want to create into an ordinal value-> okay-> Make cut point-> enter number of cutpoints, and width-> apply-> okay
43 Example: Frequency Table EDU (Binned)FrequencyPercentValid PercentCumulative PercentValid9th Grade915.010th Grade1931.746.711th Grade2033.380.012th grade and up1220.0100.0Total60
44 Cross-tabulation What it is? Used for a two-way table containing frequencies (counts) for different levels of the column and row variables.Used forComparison of frequencies for different levels of the variables (chi-squared test)
45 How To Find Cross-tabulation Syntax:CROSSTABS/TABLES=EDUbins BY US/FORMAT=AVALUE TABLES/STATISTICS=CHISQ/CELLS=COUNT/COUNT ROUND CELL.By Mouse Analyze-> Descriptive Statistics-> Crosstabs-> select variable for row-> select variable for column-> statistic-> Chi-Square-> continue-> Okay
46 Example: Cross-tabulation Researchers wish to understand if the educational levels from the SMSA data were equally distributed among the US.Looking at the p-value, we can see that the educational levels are different among the regions of the US.Chi-Square TestsValuedfAsymp. Sig. (2-sided)Pearson Chi-Square26.078a9.002Likelihood Ratio25.377.003Linear-by-Linear Association9.8931N of Valid Cases60
48 Recommended Readings/Citations Hartwig, F., & Dearing, B. E. (1979). Exploratory Data Analysis. Beverly Hills : Sage Publications.Hoaglin, D. C., Mostellar, F., & Tukey, J. W. (1983). Understanding Robust and Exploratory Data Analysis. New York: John Wile & Sons Inc.Pampel, F. C. (2004). Exploratory Data Analysis . In M. S. Lewis-Beck, A. Bryman, & L. t. Futing, The SAGE Encyclopedia of Social Science Research Methods (pp ). Thousand Oak, California : Sage Publications.Vogt, W. P. (1999). Exploratory Data Analysis. In W. P. Vogt, Dictionary of Statistics & Methodology: A Nontechnical Guide for the Social Science (pp ). Thousand Oaks, California: SAGE Publications. Inc.