Data Science and Big Data Analytics Chap 3: Data Analytics Using R

Slides:



Advertisements
Similar presentations
1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.
Advertisements

Inference for Regression
Quantitative Data Analysis: Hypothesis Testing
BHS Methods in Behavioral Sciences I April 25, 2003 Chapter 6 (Ray) The Logic of Hypothesis Testing.
Multiple regression analysis
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 13-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Linear Regression and Correlation Analysis
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 8 Analyzing and Interpreting Quantitative.
PSY 307 – Statistics for the Behavioral Sciences Chapter 19 – Chi-Square Test for Qualitative Data Chapter 21 – Deciding Which Test to Use.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Chapter 7 Forecasting with Simple Regression
Introduction to Regression Analysis, Chapter 13,
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Regression and Correlation Methods Judy Zhong Ph.D.
Statistics for the Social Sciences Psychology 340 Fall 2013 Thursday, November 21 Review for Exam #4.
Inference for regression - Simple linear regression
STA291 Statistical Methods Lecture 27. Inference for Regression.
Introductory Statistical Concepts. Disclaimer – I am not an expert SAS programmer. – Nothing that I say is confirmed or denied by Texas A&M University.
Hands-on Introduction to R. Outline R : A powerful Platform for Statistical Analysis Why bother learning R ? Data, data, data, I cannot make bricks without.
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
Quantitative Skills 1: Graphing
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Examining Relationships in Quantitative Research
Inference and Inferential Statistics Methods of Educational Research EDU 660.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
11 Chapter 12 Quantitative Data Analysis: Hypothesis Testing © 2009 John Wiley & Sons Ltd.
4 Hypothesis & Testing. CHAPTER OUTLINE 4-1 STATISTICAL INFERENCE 4-2 POINT ESTIMATION 4-3 HYPOTHESIS TESTING Statistical Hypotheses Testing.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Lesson 15 - R Chapter 15 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
1 Chapter 8 Introduction to Hypothesis Testing. 2 Name of the game… Hypothesis testing Statistical method that uses sample data to evaluate a hypothesis.
Data Analysis.
Chapter 6: Analyzing and Interpreting Quantitative Data
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
Analyzing Statistical Inferences July 30, Inferential Statistics? When? When you infer from a sample to a population Generalize sample results to.
Chapter 10 Inference for Regression
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Beginning Statistics Table of Contents HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2008 by Hawkes Learning Systems/Quant Systems, Inc.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
Lecturer: Ing. Martina Hanová, PhD.. Regression analysis Regression analysis is a tool for analyzing relationships between financial variables:  Identify.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Four ANALYSIS AND PRESENTATION OF DATA.
Appendix I A Refresher on some Statistical Terms and Tests.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
CHAPTER 12 More About Regression
Chapter 13 Simple Linear Regression
CHAPTER 12 More About Regression
Statistics for Psychology
Part Three. Data Analysis
CHAPTER 29: Multiple Regression*
Ass. Prof. Dr. Mogeeb Mosleh
CHAPTER 12 More About Regression
CHAPTER 12 More About Regression
Analyzing and Interpreting Quantitative Data
Introductory Statistics
Presentation transcript:

Data Science and Big Data Analytics Chap 3: Data Analytics Using R Charles Tappert Seidenberg School of CSIS, Pace University

Chap 3 Data Analytics Using R This chapter has three sections An overview of R Using R to perform exploratory data analysis tasks using visualization A brief review of statistical inference Hypothesis testing and analysis of variance

3.1 Introduction to R Generic R functions are functions that share the same name but behave differently depending on the type of arguments they receive (polymorphism) Some important functions used in chapter (most are generic) head() displays first six records of a file summary() generates descriptive statistics plot() can generate a scatter plot of one variable against another lm() applies a linear regression model between two variables hist() generates a histogram help() provides details of a function

3.1 Introduction to R Example: number of orders vs sales lm(formula = (sales$sales_total ~ sales$num_of_orders) intercept = -154.1 slope = 166.2

3.1 Introduction to R 3.1.1 R Graphical User Interfaces Getting R and RStudio 3.1.2 Data Import and Export Necessary for project work 3.1.3 Attributes and Data Types Vectors, matrices, data frames 3.1.4 Descriptive Statistics summary(), mean(), median(), sd()

3.1.1 Getting R and RStudio Download R and install (32-bit and 64-bit) https://www.r-project.org/ Download RStudio and install https://www.rstudio.com/products/RStudio/#Desk

3.1.1 RStudio GUI

3.2 Exploratory Data Analysis Scatterplots show possible relationships x <- rnorm(50) # default is mean=0, sd=1 y <- x + rnorm(50, mean=0, sd=0.5) plot(y,x)

3.2 Exploratory Data Analysis 3.2.1 Visualization before Analysis 3.2.2 Dirty Data 3.2.3 Visualizing a Single Variable 3.2.4 Examining Multiple Variables 3.2.5 Data Exploration versus Presentation

3.2.1 Visualization before Analysis Anscombe’s quartet – 4 datasets, same statistics should be x

3.2.1 Visualization before Analysis Anscombe’s quartet – visualized

3.2.1 Visualization before Analysis Anscombe’s quartet – Rstudio exercise ) Enter and plot Anscombe’s dataset #3 and obtain the linear regression line More regression coming in chapter 6 x <- 4:14 x y <- c(5.39,5.73,6.08,6.42,6.77,7.11,7.46,7.81,8.15,12.74,8.84) y summary(x) var(x) summary(y) var(y) plot(y~x) lm(y~x)

3.2.2 Dirty Data Age Distribution of bank account holders What is wrong here?

3.2.2 Dirty Data Age of Mortgage What is wrong here?

3.2.3 Visualizing a Single Variable Example Visualization Functions

3.2.3 Visualizing a Single Variable Dotchart – MPG of Car Models

3.2.3 Visualizing a Single Variable Barplot – Distribution of Car Cylinder Counts

3.2.3 Visualizing a Single Variable Histogram – Income

3.2.3 Visualizing a Single Variable Density – Income (log10 scale)

3.2.3 Visualizing a Single Variable Density – Income (log10 scale) In this case, the log density plot emphasizes the log nature of the distribution The rug() function at the bottom creates a one-dimensional density plot to emphasize the distribution

3.2.3 Visualizing a Single Variable Density plots – Diamond prices, log of same

3.2.4 Examining Multiple Variables Examining two variables with regression Red line = linear regression Blue line = LOESS curve fit x

3.2.4 Examining Multiple Variables Dotchart: MPG of car models grouped by cylinder

3.2.4 Examining Multiple Variables Barplot: visualize multiple variables

3.2.4 Examining Multiple Variables Box-and-whisker plot: income versus region Box contains central 50% of data Line inside box is median value Shows data quartiles

3.2.4 Examining Multiple Variables Scatterplot (a) & Hexbinplot – income vs education The hexbinplot combines the ideas of scatterplot and histogram For high volume data hexbinplot may be better than scatterplot

3.2.4 Examining Multiple Variables Matrix of Scatterplots

3.2.4 Examining Multiple Variables Variable over time – airline passenger counts

3.2.5 Exploration vs Presentation Data visualization for data exploration is different from presenting results to stakeholders Data scientists prefer graphs that are technical in nature Nontechnical stakeholders prefer simple, clear graphics that focus on the message rather than the data

3.2.5 Exploration vs Presentation Density plots better for data scientists

3.2.5 Exploration vs Presentation Histograms better to show stakeholders

3.3 Statistical Methods for Evaluation Statistics helps answer data analytics questions Model Building What are the best input variables for the model? Can the model predict the outcome given the input? Model Evaluation Is the model accurate? Does the model perform better than an obvious guess? Does the model perform better than other models? Model Deployment Is the prediction sound? Does model have the desired effect (e.g., reducing cost)?

3.3 Statistical Methods for Evaluation Subsections 3.3.1 Hypothesis Testing 3.3.2 Difference of Means 3.3.3 Wilcoxon Rank-Sum Test 3.3.4 Type I and Type II Errors 3.3.5 Power and Sample Size 3.3.6 ANOVA (Analysis of Variance)

3.3.1 Hypothesis Testing Basic concept is to form an assertion and test it with data Common assumption is that there is no difference between samples (default assumption) Statisticians refer to this as the null hypothesis (H0) The alternative hypothesis (HA) is that there is a difference between samples

3.3.1 Hypothesis Testing Example Null and Alternative Hypotheses

3.3.2 Difference of Means Two populations – same or different?

3.3.2 Difference of Means Two Parametric Methods Student’s t-test Assumes two normally distributed populations, and that they have equal variance Welch’s t-test Assumes two normally distributed populations, and they don’t necessarily have equal variance

3.3.3 Wilcoxon Rank-Sum Test A Nonparametric Method Makes no assumptions about the underlying probability distributions

3.3.4 Type I and Type II Errors An hypothesis test may result in two types of errors Type I error – rejection of the null hypothesis when the null hypothesis is TRUE Type II error – acceptance of the null hypothesis when the null hypothesis is FALSE

3.3.4 Type I and Type II Errors

3.3.5 Power and Sample Size The power of a test is the probability of correctly rejecting the null hypothesis The power of a test increases as the sample size increases Effect size d = difference between the means It is important to consider an appropriate effect size for the problem at hand

3.3.5 Power and Sample Size

3.3.6 ANOVA (Analysis of Variance) A generalization of the hypothesis testing of the difference of two population means Good for analyzing more than two populations ANOVA tests if any of the population means differ from the other population means