Data Science and Big Data Analytics Chap 3: Data Analytics Using R

Data Science and Big Data Analytics Chap 3: Data Analytics Using R
Charles Tappert Seidenberg School of CSIS, Pace University

Chap 3 Data Analytics Using R
This chapter has three sections An overview of R Using R to perform exploratory data analysis tasks using visualization A brief review of statistical inference Hypothesis testing and analysis of variance

3.1 Introduction to R Generic R functions are functions that share the same name but behave differently depending on the type of arguments they receive (polymorphism) Some important functions used in chapter (most are generic) head() displays first six records of a file summary() generates descriptive statistics plot() can generate a scatter plot of one variable against another lm() applies a linear regression model between two variables hist() generates a histogram help() provides details of a function

3.1 Introduction to R Example: number of orders vs sales
lm(formula = (sales$sales_total ~ sales$num_of_orders) intercept = slope = 166.2

3.1 Introduction to R 3.1.1 R Graphical User Interfaces
Getting R and RStudio 3.1.2 Data Import and Export Necessary for project work 3.1.3 Attributes and Data Types Vectors, matrices, data frames 3.1.4 Descriptive Statistics summary(), mean(), median(), sd()

3.1.1 Getting R and RStudio Download R and install (32-bit and 64-bit)
Download RStudio and install

3.1.1 RStudio GUI

Data frame A data.frame object in R has similar dimensional properties to a matrix but it may contain categorical data, as well as numeric.

Data Frame Restrictions
When can a list be made into a data.frame? Components must be vectors (numeric, character, logical) or factors. All vectors and factors must have the same lengths. Matrices and even other data frames can be combined with vectors to form a data frame if the dimensions match up.

Creating Data Frames

Data Frame Attribute

Components as vector

Expanding Data Frame

Reading and Writing .CSV

Select Rows based variable values

Sort a Data Frame by Selected Column

3.2 Exploratory Data Analysis
3.2.1 Visualization before Analysis 3.2.2 Dirty Data 3.2.3 Visualizing a Single Variable 3.2.4 Examining Multiple Variables 3.2.5 Data Exploration versus Presentation

Exploratory Data Analysis
Summary() – provides descriptive statistics Descriptive statistics does not provide other aspects such as linear relationship and distribution Exploratory data analysis – A useful way to detect patterns and anomalies in the data Visualization – it gives holistic view of the data that may be difficult to grasp from the numbers and summaries alone

Visualization before Analysis
Visualization assesses data cleanliness and suggests potential important relationships in the data prior to the model planning and building process

Example1 x=rnorm(50) x y=x+rnorm(50,mean=0,sd=0.5)
data=as.data.frame(cbind(x,y)) summary(data) library(ggplot2) ggplot(data,aes(x=x,y=y))+ geom_point(size=2)+ ggtitle("Scatter plot of X and Y")+ theme(axis.text=element_text(size=12), axis.title=element_text(size=14), plot.title=element_text(size=28,face="bold"))

3.2 Exploratory Data Analysis Scatterplots show possible relationships
x <- rnorm(50) # default is mean=0, sd=1 y <- x + rnorm(50, mean=0, sd=0.5) plot(y,x)

Example2 - Anscombe’s quartet
y1 y2 y3 y4 1 10 8 8.04 9.14 7.46 6.58 2 6.95 8.14 6.77 5.76 3 13 7.58 8.74 12.74 7.71 4 9 8.81 8.77 7.11 8.84 5 11 8.33 9.26 7.81 8.47 6 14 9.96 8.10 7.04 7 7.24 6.13 6.08 5.25 19 4.26 3.10 5.39 12.50 12 10.84 9.13 8.15 5.56 4.82 7.26 6.42 7.91 5.68 4.74 5.73 6.89

3.2.1 Visualization before Analysis Anscombe’s quartet – 4 datasets, same statistics
should be x

levels=gl(4,nrow(anscombe)) levels
data() data("anscombe") anscombe levels=gl(4,nrow(anscombe)) levels mydata=with(anscombe, data.frame(x=c(x1,x2,x3,x4),y=c(y1,y2,y3,y4),mygroup=levels)) mydata library(ggplot2) theme_set(theme_bw()) ggplot(mydata,aes(x,y))+ geom_point(size=4)+ geom_smooth(method="lm",fill=NA,fullrange=TRUE)+ facet_wrap(~mygroup)

3.2.1 Visualization before Analysis Anscombe’s quartet – visualized

3.2.1 Visualization before Analysis Anscombe’s quartet – Rstudio exercise
) Enter and plot Anscombe’s dataset #3 and obtain the linear regression line x <- 4:14 x y <- c(5.39,5.73,6.08,6.42,6.77,7.11,7.46,7.81,8.15,12.74,8.84) y summary(x) var(x) summary(y) var(y) plot(y~x) lm(y~x)

3.2.2 Dirty Data Age Distribution of bank account holders
What is wrong here? This section addresses how dirty data can be detected in the data exploration phase with visualizations. In general, analyst look for anomalies, verify the data with domain knowledge, and decide the most appropriate approach to clean the data Consider the scenario in which a bank is conducting data analyses of its account holders to gauge customer retention. The figure shows that the median age of the account holders is around 40. A few accounts with account holder age less than 10 are unusual but reasonable They could be custodial accounts or college savings accounts set by the parents of young children. These accounts should be retained for future analyses However the left side of the graph shows huge spike of customers who are zero years old or have negative ages. This is likely to be evidence of missing data One possible explanation is that the null age value could have been replaced by 0 or negative value during the data input It might be caused by transferring data among several systems that have different definitions for null values (NULL,NA,0,-1 or -2) Therefore data cleaning needs to be performed over the accounts with abnormal age values

Missing Value x=c(1,2,3,NA,4) is.na(x) [1] FALSE FALSE FALSE TRUE FALSE mean(x) [1] NA mean (x, na.rm=TRUE) [1] 2.5

Missing Value na. exclude() function returns the object with incomplete cases removed DF=data.frame(x=c(1,2,3),y=c(10,20,NA) DF x y 1 10 2 20 3 NA DF1=na.exclude(DF) DF1

3.2.2 Dirty Data Age of Mortgage
What is wrong here?

ggplot: The Grammar of Graphics
The basic idea: independently specify plot building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include data aesthetic mapping geometric object statistical transformations scales coordinate system position adjustments faceting

Aesthetic Mapping In ggplot land aesthetic means “something you can see”. Examples include: position (i.e., on the x and y axes) color (“outside” color) fill (“inside” color) shape (of points) linetype size

Geometric Objects (geom)
Geometric objects are the actual marks we put on a plot. Examples include: points (geom_point, for scatter plots, dot plots, etc) lines (geom_line, for time series, trend lines, etc) boxplot (geom_boxplot, for, well, boxplots!)

Faceting Faceting is ggplot2 parlance for small multiples
The idea is to create separate graphs for subsets of data ggplot2 offers two functions for creating small multiples: facet_wrap(): define subsets as the levels of a single grouping variable facet_grid(): define subsets as the crossing of two grouping variables Facilitates comparison among plots, not just of geoms within a plot

3.2.3 Visualizing a Single Variable Example Visualization Functions

Dotchart Dotchart and barplot portray continuous values with labels from a discrete variables Dotchart (x, label=………) Where x is a numeric vector Label is a vector of categorial labels for x A dot chart or dot plot is a statistical chart consisting of data points plotted on a fairly simple scale, typically using filled in circles. There are two common, yet very different, versions of the dot chart.

Dotchart The dot plot as a representation of a distribution consists of group of data points plotted on a simple scale. Dot plots are used for continuous, quantitative, univariate data. Data points may be labelled if there are few of them.

Dot chart USArrests dataset, which gives arrest rates per 100,000 population for serious crimes in each of the US states in 1973

Dot chart dotchart(USArrests$Murder)
Are the arrest rates nearly the same or very different? Are they clustered together or spread out? What would you have expected?

Dotchart dotchart(USArrests$Murder, labels = row.names(USArrests), cex = .5)

dotchart(data2$Murder, labels = row. names(data2), cex =
dotchart(data2$Murder, labels = row.names(data2), cex = .5, main = "Murder arrests by state, 1973", xlab = "Murder arrests per 100,000 population") A more interesting view of this data might be to see the murder arrest rates arranged by size. To do that, the data must first be sorted by Murder. This means that the dataset’s rows will be rearranged in order of their murder arrest rates

Barplot A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable. R uses the function barplot() to create bar charts. R can draw both vertical and Horizontal bars in the bar chart. In bar chart each of the bars can be given different colors.

Barchart The basic syntax to create a bar-chart in R is
barplot(H, xlab, ylab, main, names.arg, col) H is a vector or matrix containing numeric values used in bar chart. xlab is the label for x axis. ylab is the label for y axis. main is the title of the bar chart. names.arg is a vector of names appearing under each bar. col is used to give colors to the bars in the graph.

barplot(H,names.arg =M, xlab=“Month”, ylab=“Revenue”,
col="blue", main="Revenue Chart", border="red")

Exercise Using Barplot visualize the weakly temperature with necessary details

# barchart with added parameters
max.temp <- c(22, 27, 26, 24, 23, 26, 28) barplot(max.temp, main = "Maximum Temperatures in a Week", xlab = "Degree Celsius", ylab = "Day", names.arg = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"), col = "darkred", horiz = TRUE)

Histogram A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is similar to barchat but the difference is it groups the values into continuous ranges. Each bar in histogram represents the height of the number of values present in that range. R creates histogram using hist() function. This function takes a vector as an input

Histogram hist(v,main,xlab,xlim,ylim,breaks,col,border)
v is a vector containing numeric values used in histogram. main indicates title of the chart. col is used to set color of the bars. border is used to set border color of each bar. xlab is used to give description of x-axis. xlim is used to specify the range of values on the x-axis. ylim is used to specify the range of values on the y-axis. breaks is used to mention the width of each bar.

v <- c(9,13,21,8,36,22,12,41,31,33,19) hist(v,xlab = "Weight",col = "green",border = "red", xlim = c(0,40), ylim = c(0,5),breaks = 5)

Comparison

Density Plot v <- c(9,13,21,8,36,22,12,41,31,33,19) d=density(v) plot(d, main="density graph")

Density Plot generic function density computes kernel density estimates. Its default method does so with the given kernel and bandwidth for univariate observations. Kernel density estimates the probability density function of a random variable

Density plot Just remember that the density is proportional to the chance that any value in your data is approximately equal to that value. In fact, for a histogram, the density is calculated from the counts, so the onlydifference between a histogram with frequencies and one with densities, is the scale of the y-axis. An advantage density plots have over histograms is that they are better at determining the distribution shape because they are not affected by the number of bins used

Multiple Variables A Scatterplot is a simple and widely used visualization for finding the relationship among multiple variables A scatterplot can represent data with up to five variables using x-axis, y-axis, size, color and shape Usually only two to four variables are portrayed in a scatterplot to minimize confusion

Scatterplot plot(x, y, main, xlab, ylab, xlim, ylim, axis)
x is the data set whose values are the horizontal coordinates. y is the data set whose values are the vertical coordinates. main is the tile of the graph. xlab is the label in the horizontal axis. ylab is the label in the vertical axis. xlim is the limits of the values of x used for plotting. ylim is the limits of the values of y used for plotting. axis indicates whether both axes should be drawn on the plot.

Dataset -mtcars

plot(wt, mpg, main="Scatterplot Example", xlab="Car Weight ", ylab="Miles Per Gallon ", pch=19)

Scatterplot

Dotchart and Barchart Dotchart and barchart can be used to visualize multiple variable. Both of them use color as an additional dimension for visualizing the data

Example cars=mtcars[order(mtcars$mpg)] cars$cyl=factor(cars$cyl) cars$color[cars$cyl==4]="red" cars$color[cars$cyl==6]="blue" cars$color[cars$cyl==8]="darkgreen" dotchart(cars$mpg, labels=row.names(cars),cex=0.7,groups=cars$cyl, main="Miles Per Gallon (MPG) of car models", xlab="Miles per Gallon", color=cars$color) Group: an optional factor indicating how the elements of x are grouped. If x is a matrix, groups will default to the columns of x. Groups an optional factor indicating how the elements of x are grouped. If x is a matrix,groups will default to the columns of x. Color the color(s) to be used for points and labels. Gcolor the single color to be used for group labels and values.

3.2.4 Examining Multiple Variables Dotchart: MPG of car models grouped by cylinder

Barplot to visualize multiple variable
counts=table(mtcars$gear,mtcars$cyl) barplot(counts, main="Distribution of Car Cylinders counts and gears”, xlab="Number of cylinders", ylab="Counts", col=c("#0000FFFF","#0080FFFF","#00FFFFFF"), beside=TRUE) Distribution of car cylinder counts and number of gears

3.2.4 Examining Multiple Variables Barplot: visualize multiple variables

Box-and-Whisker plot A box and whisker plot is defined as a graphical method of displaying variation in a set of data boxplot is a method for graphically depicting groups of numerical data through their quartiles A box and whisker plot—also called a box plot—displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum. In abox plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median. Q3-Q1*1.5

Box-and-Whisker plot Box and whisker plots are very effective and easy to read, as they can summarize data from multiple sources and display the results in a single graph Box and whisker plots allow for comparison of data from different categories for easier, more effective decision-making.

When to use Box-and-Whisker plot
Use box and whisker plots when you have multiple data sets from independent sources that are related to each other in some way Example Test scores between schools or classrooms Data from duplicate machines manufacturing the same products

Example boxplot(mtcars$mpg,mtcars$qsec)

boxplot(mpg~cyl,data=mtcars, main="Car Milage Data", xlab="Number of Cylinders", ylab="Miles Per Gallon") formula is y~group where a separate boxplot for numeric variable y is generated for each value of group

3.2.4 Examining Multiple Variables Box-and-whisker plot: income versus region
Visualizes mean household incomes as a function of region in the united states

Hexbinplot for Large Datasets
Drawbacks of Scatterplot High volume data –structure of the data may become difficult to see in a scatterplot This is a Big Data type of problem. Millions or billions of data points would require different approaches for exploration, visualization and analysis

Hexbinplot for Large Datasets
A hexbinplot combines the ideas of scatterplot and histogram Similar to a scatterplot, a hexbinplot visualizes data in the x-axis and y-axis Data is placed into hexbins, and the third dimension uses shading to represent the concertation of data in each hexbin

3.2.4 Examining Multiple Variables Scatterplot (a) & Hexbinplot – income vs education
The hexbinplot combines the ideas of scatterplot and histogram For high volume data hexbinplot may be better than scatterplot

Scatterplot Matrix A scatterplot matrix shows many scatterplots in a compact, side-by-side fashion The scatterplot matrix can visually represent multiple attributes of a dataset to explore their relationships, magnify differences and disclose hidden patterns Tables of scatter plot

3.2.4 Examining Multiple Variables Matrix of Scatterplots

formula An expression of the form y ~ model is interpreted as a specification that the response y is modelled by a linear predictor specified symbolically by model Such a model consists of a series of terms separated by + operators. The terms themselves consist of variable and factor names

Scatterplot matrix The pairs R function returns a plot matrix, consisting of scatterplots for each variable-combination of a data frame pairs(~mpg+disp+drat+wt,data=mtcars, main="Simple Scatterplot Matrix"

Analyzing a Variable over Time
Visualizing a variable over time is the same as visualizing any pair of variables, but in this case the goal is to identify time-specific patterns

3.2.4 Examining Multiple Variables Variable over time – airline passenger counts

3.2.5 Exploration vs Presentation
Data visualization for data exploration is different from presenting results to stakeholders Data scientists prefer graphs that are technical in nature Nontechnical stakeholders prefer simple, clear graphics that focus on the message rather than the data

3.2.5 Exploration vs Presentation Density plots better for data scientists

3.2.5 Exploration vs Presentation Histograms better to show stakeholders

Statistical Methods for Evaluation
Statistics A branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data Example A statistic is a characteristic of a sample Generally, a statistic is used to estimate the value of a population parameter suppose we selected a random sample of 100 students from a school with 1000 students. The average height of the sampled students would be an example of a statistic

Importance of Statistics Statistical knowledge helps you use the proper methods to collect the data, employ the correct analyses, and effectively present the results Types of statistics Descriptive Statistics Inferential Statistics

Descriptive Statistics Statistical methods used to summarize or describe a collection of data is called descriptive statistics. This is useful in research, when communicating the results of experiments

Inferential Statistics Inferential statistics is a statistical method that deduces from a small but representative sample the characteristics of a bigger population In other words, it allows the researcher to make assumptions about a wider group, using a smaller portion of that group as a guideline. Example : regression models, normal distributions and R-squared analysis One of the most common places we can find this method is at forecasting models

Inferential Statistics
These statistical models study a small portion of data to predict the future behavior of the variables, making inferences based on historical data.

3.3 Statistical Methods for Evaluation Statistics helps answer data analytics questions
Model Building What are the best input variables for the model? Can the model predict the outcome given the input? Model Evaluation Is the model accurate? Does the model perform better than an obvious guess? Does the model perform better than other models? Model Deployment Is the prediction sound? Does model have the desired effect (e.g., reducing cost)? Visualization is useful for data exploration and presentation. But statistics is crucial because it may exist throughout the entire data analytics lifecycle.

Statistical tool 3.3.1 Hypothesis Testing 3.3.2 Difference of Means
3.3.3 Wilcoxon Rank-Sum Test 3.3.4 Type I and Type II Errors 3.3.5 Power and Sample Size 3.3.6 ANOVA (Analysis of Variance) Some useful statistical tools that answer these questions

Hypothesis A supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation Supposition – a belief held without proof

Hypothesis testing When comparing populations, such as testing or evaluating the difference of the means from two samples of data, a common technique to assess the difference or the significance of the difference is hypothesis testing

3.3.1 Hypothesis Testing Basic concept is to form an assertion and test it with data Common assumption is that there is no difference between samples (default assumption) Statisticians refer to this as the null hypothesis (H0) The alternative hypothesis (HA) is that there is a difference between samples

Hypothesis test There are two types of hypotheses:
The null hypothesis, H0, is the current belief. The alternative hypothesis, Ha, is your belief; it is what you want to show.

Example If the task is to identify the effect of drug A compared to drug B on patients, the null hypothesis and alternate hypothesis would be this H0: Drug A and Drug B have the same effect on patients Ha : Drug A has a greater effect than Drug B on patients

Purpose The purpose of hypothesis testing is to determine whether there is enough statistical evidence in favor of certain belief or hypothesis about a parameter The basic concept of hypothesis testing is to form an assertion and test it with data

3.3.1 Hypothesis Testing Example Null and Alternative Hypotheses

Example It is believed that a candy machine makes chocolate bars that are on average of 5g. A worker claims that the machine after maintenance no longer makes 5g. Write Ho and Ha Ho: µ = 5g HA: µ ≠ 5g bars.

Hypothesis Testing Steps
Null and alternative hypotheses Test statistic P-value and interpretation Significance level (optional)

Null and Alternative Hypotheses
Chapter 9 12/2/2019 Null and Alternative Hypotheses Convert the research question to null and alternative hypotheses The null hypothesis (H0) is a claim of “no difference in the population” The alternative hypothesis (Ha) claims “H0 is false” Collect data and seek evidence against H0 as a way of bolstering Ha (deduction) The first step in the procedure is to state the hypotheses null and alternative forms. The null hypothesis (abbreviate “H naught”) is a statement of no difference. The alternative hypothesis (“H sub a”) is a statement of difference. Seek evidence against the claim of H0 as a way of bolstering Ha. The next slide offers an illustrative example on setting up the hypotheses. Basic Biostat

Illustrative Example: “Body Weight”
Chapter 9 12/2/2019 Illustrative Example: “Body Weight” The problem: In the 1970s, 20–29 year old men in the U.S. had a mean μ body weight of 170 pounds. Standard deviation σ was 40 pounds. We test whether mean body weight in the population now differs. Null hypothesis H0: μ = 170 (“no difference”) The alternative hypothesis can be either Ha: μ > 170 (one-sided test) or Ha: μ ≠ 170 (two-sided test) In the late 1970s, the weight of U.S. men between 20- and 29-years of age had a log-normal distribution with a mean of 170 pounds and standard deviation of 40 pounds. As you know, the overweight and obese conditions seems to be more prevalent today, constituting a major public health problem. To illustrate the hypothesis testing procedure, we ask if body weight in this group has increased since Under the null hypothesis there is no difference in the mean body weight between then and now, in which case μ would still equal 170 pounds. Under the alternative hypothesis, the mean weight has increased Therefore, Ha: μ > 170. This statement of the alternative hypothesis is one-sided. That is, it looks only for values larger than stated under the null hypothesis. There is another way to state the alternative hypothesis. We could state it in a “two-sided” manner, looking for values that are either higher- or lower-than expected. For the current illustrative example, the two-sided alternative is Ha: μ ≠ 170. Although for the current illustrative example, this seems unnecessary, two-sided alternative offers several advantages and are much more common in practice. Basic Biostat

3.3.2 Difference of Means Two populations – same or different?
The basic testing approach is to compare the observed sample means x1 and x2 corresponding to each population. If the values of x1 and x2 are Approximately equal to each other, the distributions of x1 and x2 overlap substantially and null hypothesis supported. A large observed difference between the Sample means indicates that the null hypothesis should be rejected Formally the difference in means can be tested student’s t-test and welch’s t-test

3.3.2 Difference of Means Two Parametric Methods
Student’s t-test Assumes that the two populations have normal distributions with equal variances Welch’s t-test It is designed for unequal variances, but the assumption of normality is maintained

Normal Distribution The normal distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions.

Standard Deviation The standard deviation is a measure of variability. It defines the width of the normal distribution. The standard deviation determines how far away from the mean the values tend to fall. It represents the typical distance between the observations and the average

Standard Deviation On a graph, changing the standard deviation either tightens or spreads out the width of the distribution along the X-axis. Larger standard deviations produce distributions that are more spread out

T distribution The t-distribution is symmetric and bell-shaped, like the normal distribution, but has heavier tails, meaning that it is more prone to producing values that fall far from its mean

Student’s t test A method of testing hypotheses about the mean of a small sample drawn from a normally distributed population when the population standard deviation is unknown

t-value

Student’s t-test

Welch’s t test It is designed for unequal variance but the assumption of normality is maintained

Welch’s t-test

P-value The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of a given event

Example

t-value

3.3.3 Wilcoxon Rank-Sum Test A Nonparametric Method
Makes no assumptions about the underlying probability distributions

Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 8 92 7 97 9 88 6 ... ... Rank Sum 19.5 25.5

3.3.4 Type I and Type II Errors
An hypothesis test may result in two types of errors Type I error – rejection of the null hypothesis when the null hypothesis is TRUE Type II error – acceptance of the null hypothesis when the null hypothesis is FALSE

3.3.4 Type I and Type II Errors

3.3.5 Power and Sample Size The power of a test is the probability of correctly rejecting the null hypothesis The power of a test increases as the sample size increases Effect size d = difference between the means It is important to consider an appropriate effect size for the problem at hand

3.3.5 Power and Sample Size

3.3.6 ANOVA (Analysis of Variance)
A generalization of the hypothesis testing of the difference of two population means Good for analyzing more than two populations ANOVA tests if any of the population means differ from the other population means

References

Data Science and Big Data Analytics Chap 3: Data Analytics Using R

Similar presentations

Presentation on theme: "Data Science and Big Data Analytics Chap 3: Data Analytics Using R"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Science and Big Data Analytics Chap 3: Data Analytics Using R

Similar presentations

Presentation on theme: "Data Science and Big Data Analytics Chap 3: Data Analytics Using R"— Presentation transcript:

Similar presentations

About project

Feedback