Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

Slides:



Advertisements
Similar presentations
Chapter 3, Numerical Descriptive Measures
Advertisements

Describing Quantitative Variables
Descriptive Measures MARE 250 Dr. Jason Turner.
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Measures of Dispersion
1 Chapter 1: Sampling and Descriptive Statistics.
Chapter 3 Describing Data Using Numerical Measures
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter Two Treatment of Data.
Measures of Central Tendency
Basic Business Statistics 10th Edition
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
The Five-Number Summary And Boxplots. Chapter 3 – Section 5 ●Learning objectives  Compute the five-number summary  Draw and interpret boxplots 1 2.
Hydrologic Statistics
Descriptive Methods in Regression and Correlation
Overview DM for Business Intelligence.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Numerical Descriptive Measures
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Statistics for Managers.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Let’s Review for… AP Statistics!!! Chapter 1 Review Frank Cerros Xinlei Du Claire Dubois Ryan Hoshi.
Chapter 12 Examining Relationships in Quantitative Research Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Describing Data.
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Chapter 2 Describing Data.
Lecture 3 Describing Data Using Numerical Measures.
Examining Relationships in Quantitative Research
DESCRIPTIVE STATISTICS © LOUIS COHEN, LAWRENCE MANION & KEITH MORRISON.
Categorical vs. Quantitative…
Chap 3-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 3 Describing Data Using Numerical.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Business Statistics, A First Course.
Statistics Chapter 1: Exploring Data. 1.1 Displaying Distributions with Graphs Individuals Objects that are described by a set of data Variables Any characteristic.
Chapter 8 Making Sense of Data in Six Sigma and Lean
Descriptive statistics Petter Mostad Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when.
To be given to you next time: Short Project, What do students drive? AP Problems.
MMSI – SATURDAY SESSION with Mr. Flynn. Describing patterns and departures from patterns (20%–30% of exam) Exploratory analysis of data makes use of graphical.
Unit 4: Describing Data After 8 long weeks, we have finally finished Unit 3: Linear & Exponential Functions. Now on to Unit 4 which will last 3 weeks.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 3 Section 5 – Slide 1 of 21 Chapter 3 Section 5 The Five-Number Summary And Boxplots.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Statistical Methods © 2004 Prentice-Hall, Inc. Week 3-1 Week 3 Numerical Descriptive Measures Statistical Methods.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
Yandell – Econ 216 Chap 3-1 Chapter 3 Numerical Descriptive Measures.
Descriptive Statistics ( )
Exploratory Data Analysis
Data Transformation: Normalization
MATH-138 Elementary Statistics
Analyzing One-Variable Data
Correlation, Bivariate Regression, and Multiple Regression
Exploring, Displaying, and Examining Data
Data Mining: Concepts and Techniques
Exploratory Data Analysis (EDA)
Laugh, and the world laughs with you. Weep and you weep alone
Unit 4 Statistics Review
Eco 6380 Predictive Analytics For Economists Spring 2014
DAY 3 Sections 1.2 and 1.3.
Topic 5: Exploring Quantitative data
Lecture 2 Chapter 3. Displaying and Summarizing Quantitative Data
Quartile Measures DCOVA
Displaying and Summarizing Quantitative Data
Organizing, Summarizing, &Describing Data UNIT SELF-TEST QUESTIONS
Boxplots.
Unit 4: Describing Data After 10 long weeks, we have finally finished Unit 3: Linear & Exponential Functions. Now on to Unit 4 which will last 5 weeks.
Lesson – Teacher Notes Standard:
Probability and Statistics
Higher National Certificate in Engineering
Presentation transcript:

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU

Presentation 3 Treatment of Missing Observations and Exploratory Data Analysis Chapter 2 in SPB

Treatment of Missing Observations Delete any record (case) that has a missing observation for any variable with respect to the case. (Can sometimes be a very “severe” rule. You might not have any observations left!) Impute values for the missing observations. Alternative methods include (i) using a chosen fixed value selected by the user, (ii) using the mean or median of the observations that are available for the variable with the missing observation, (iii) model the variable that has the missing observation as a target variable and, using all of the other variables as inputs, utilize a multiple linear regression or CART model to fill in the missing observations. See the SPSS modeler “Data Audit” Node for handling Missing Observations.

Exploratory Data Analysis (EDA) Coined by Prof. John Tukey ( ) a chemist-turned-topologist-turned statistician Seminal Work: Exploratory Data Analysis (1977, Addison-Wesley)

Some Exploratory Data Analysis Tools I. Charts (a) Box-Plot: a vertical plot that summarizes the variation in a given variable. The “box” in the plot represents the data ranging from the first quartile to the third quartile. The notch in the box represents the mean while the horizontal line across the box represents the median. Other “non-extreme” observations are between the “whiskers” of the plot and the box. Any observation outside of the whiskers is considered an “outlier.” Such observations should be examine further for possible coding errors or consideration for deletion from any analysis as the observation might be from another “population” and not of the population representing the rest of the observations on the variable. (b) Histogram: A plot of the data of a variable by intervals such that the height of the histogram represents the number of observations in a given interval. (c) Matrix Plots: Pairs of Bivariate Scatter plots that summarize the associations between variables. (d) Summary Statistics: Mean, median, standard deviation, kurtosis, and skewness of individual variables. (e) QQ and PP plots: Allows you to determine to what degree the variation in a variable does or does not conform to a specified distribution, e.g., the normal distribution.

The Box Plot: Let us take a look at the typical Box-Plot that XLMiner TM generates: + Outlier Max ○○○○ Q3 Q2(Median) Q1 ○ Min Mean Notch + Outlier Max ○○○○ Q3 Q2(Median) Q1 ○ Min Mean Notch Copyright ® XLMINER

The Box extends from Q1 to Q3, includes Q2. The other non- extreme points are included in its “whiskers”. This means the box includes the middle one-half of the data. The mean value is shown with a plus sign. The distance “d” between the values that the notch specifies is --> d = Confidence interval around the mean. The significance of the Box-plot is that it is not strongly influenced by the outliers. XLMiner TM constructs the Box-plot by extending its “whiskers” only to the most extreme “non- outliers”. It uses two explicit rules to define the outliers and marks them with dots. Copyright ® XLMINER

The rules used are Cutoff1 = Q1-1.5*(Q3-Q1) Cutoff2 = Q3+1.5*(Q3-Q1) On the Box-plot, Min = The actual minimum value in the dataset which is just above the Cutoff1. Max = The actual maximum value in the dataset that is just less than Cutoff2. All the values that are less than the Min value, and the ones that are great than the Max value are treated as the Outliers. Copyright ® XLMINER

Two Good EDA Tools in SPSS Modeler “Graphboard” node in the Graphs Group “Data Audit” node in the Output Group See the stream “Car_sales_EDA.str” for a demo on the use of the Graphboard node. See the stream “telco_dataaudit.str” for a demo on the use of the Data Audit node and the treatment of missing observations and outliers.

Variable Importance Variable Importance is measured by the bivariate association between specified target variable and input variables of interest This measure of association will depend on the nature of the target variable (interval or categorical) as well as the nature of the input variable being considered. You have the 4 cases: (target variable, input variable) = (interval, interval); (interval, categorical); (categorical, interval), and (categorical, categorical). For the measures of association of these various cases see the IBM/SPSS Modeler manual Modeler 15 AlgorithmsGuide.pdf, Chapter 17 “Feature Selection Algorithm.”

XLMINER Correlation Matrix Plot to Uncover Variable Importance

Other Variable Importance Methods MLR Subset Selection Methods: Forward, Backward, and Stepwise Selection. MLR with Lasso Constraint: Minimize sum of squared errors subject to the constraint that the sum of the absolute values of the coefficients not exceed a given constant, c. Regression and Classification Trees and the variables that make up a cross-validated tree Could take an Intersection of the Importance Variable Sets determined by the various methods

Other Helpful Dimension Reduction Methods Reducing the Number of Categories in Categorical Variables Converting a Categorical Variable of Many Groups into a “Rank” Variable that represents the ordering of the groups. You go from k indicators variables, say, to one variable representing the k groups. However, it should be noted that a Rank Variable implies, in some cases (as in MLR), that going from group one to group two has the same effect on the target variable as going from group two to group three, etc. Applying Principal Components Analysis to continuous variables. We will discuss this technique separately later. Note: Binning a Continuous Variable into a Categorical variable does not reduce the dimension of the input space but it sometimes is a convenient way to “break down” the multi- collinearity that may exist between continuous variables that are not binned.

An Application Showing How Useful Pre-screening Input Variables Can Be In SPSS Modeler, the crucial node is the “Feature Selection” node in the Modeling Group. For a good example of the importance of Pre- screening inputs, see the SPSS Modeler stream “featureselection.str” in the Demos subdirectory of Modeler.

Classroom Exercise: Exercise 1