Download presentation
Presentation is loading. Please wait.
Published byVivien Cooper Modified over 6 years ago
1
Introduction to Advanced Analytics in R Language
Timothy Wong Data Scientist
2
What is R Language? Offers modern & sophisticated statistical algorithms Used by over 2 million data scientists, statisticians and analysts Has a thriving open-source community Big Data analytics via ‘Microsoft R Server’
3
RStudio
4
Packages CRAN # Install a new package install.packages('dplyr') # Load a package (either one below) require(dplyr) library(dplyr)
5
R Basics Variable creation Subsetting your data Missing values
Vectorised operation Writing your own function Data frame
6
Easy to Use PROC REG = lm(), glm() PROC SQL = %>%
PROC SORT = order() PROC MEANS = mean(), sd() PROC GPLOT = plot(), ggplot() …(goes on)
7
Linear Regression Univariate Bivariate / Multivariate
𝑌 = 𝛽 0 + 𝛽 1 𝑥 residual Univariate 𝑌 = 𝛽 0 + 𝛽 1 𝑥 1 Bivariate / Multivariate 𝑌 = 𝛽 0 + 𝛽 1 𝑥 1 + 𝛽 2 𝑥 2 𝐾th order polynomial function 𝑌 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 = 𝛽 0 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡 + 𝑘=1 𝐾 𝛽 k 𝑥 𝑘 𝑘 𝑝𝑜𝑙𝑦𝑛𝑜𝑚𝑖𝑎𝑙 𝑌 = 𝛽 0 + 𝛽 1 𝑥+ 𝛽 2 𝑥 2 +…+ 𝛽 𝑀 𝑥 𝑀 # Univariate linear model myModel <- lm(y~x, myData) summary(myModel)
8
Linear Regression Term Description Residuals
This is the unexplained bit of the model, defined as observed value minus fitted value ( 𝜖 𝑖 = 𝑦 𝑖 − 𝑦 𝑖 ). If parametric assumption is correct, the mean and median value should be very close to zero. Estimate Coefficient of the corresponding independent variable (i.e. the 𝛽 values). Standard error Standard deviation of the slope. t-value The number of standard deviations away from zero (i.e. the null hypothesis). 𝑷𝒓 > 𝒕 𝑝-value of the model estimate. In general, you may consider any variable with 𝑝-value above 0.05. Multiple 𝑹 𝟐 Pearson’s correlation squared which indicates strength of relationship between original and fitted values. Adjusted 𝑹 𝟐 Adjusted version of 𝑹 𝟐 . 𝑭-statistics Global hypothesis for the model as a whole.
9
Linear Regression in R # Load internal dataset data(USArrests) # Read top 10 rows head(USArrests, 10) # Checks dimension of this data frame dim(USArrests) # Univariate linear model arrestModel1 <- lm(Murder ~ UrbanPop, USArrests) summary(arrestModel1) # Multivariate linear model arrestModel2 <- lm(Murder ~ UrbanPop + Assault + Rape, USArrests) summary(arrestModel2) # Polynomial term arrestModel3 <- lm(Murder ~ poly(UrbanPop,2) + poly(Assault,2) + poly(Rape,2), USArrests) summary(arrestModel3)
10
Regression Diagnostics in R
# Partial regression plot (Check variable influence) require(car) avPlots(arrestModel2) # Standardised regression coefficients (Check variable influence) require(QuantPsyc) lm.beta(arrestModel2) # Quantile-Quantile plot (Check normality assumption) qqnorm(arrestModel2$residuals) qqline(arrestModel2$residuals) # Regression residual plot (Check heteroscedasticity) plot(arrestModel2$fitted.values, rstandard(arrestModel2)) # Compare all models using Chi-square test anova(arrestModel1, arrestModel2, arrestModel3, test='Chisq')
11
Regression Diagnostics: Residual plot (Homoscandiscity vs
Regression Diagnostics: Residual plot (Homoscandiscity vs. Heteroscandiscity) Source: StackExchange
12
Regression Diagnostics: Quantile-Quantile Plot
Checks normality assumption
13
Regression Diagnostics: Pearson’s Correlation
Source: Wikipedia
14
Regression Diagnostics: Model Overfitting
𝑌 = 𝛽 0 + 𝑗=1 3 𝛽 𝑤𝑡 𝑗 𝑥 𝑤𝑡 𝑗 + 𝑘=1 2 𝛽 ℎ𝑝 𝑘 𝑥 ℎ𝑝 𝑘 𝑌 = 𝛽 0 + 𝑗=1 8 𝛽 𝑤𝑡 𝑗 𝑥 𝑤𝑡 𝑗 + 𝑘=1 5 𝛽 ℎ𝑝 𝑘 𝑥 ℎ𝑝 𝑘
15
Poisson Regression Modelling # of discrete events
Total number of inbound calls of each customer over a fixed period Number of child in each household Number of tea-refills each employee has during office hour 𝜆=1 𝜆=2 𝜆=3 𝜆=4 𝜆=5 # Poisson regression myModel <- glm(y ~ x1 + x2, family="poisson", data=myData) summary(myModel)
16
Logistic Regression Modelling binomial distribution Logistic function
Toss a coin: Head / tail Examination: pass / fail Product: sold / unsold Logistic function 𝑃𝑟 𝑌 = 1 1+ 𝑒 − 𝛽 0 + 𝛽 1 𝑥 1 Odds-ratio ( 𝑒 𝛽 1 ) The change in probability when 𝑥 1 increases by 1 unit # Logistic regression myModel <- glm(y ~ x1 + x2, family=“binomial", data=myData) summary(myModel)
17
Recursive Partitioning
Divide data into regions recursively 𝑥 2 𝑥 2 𝑥 2 𝑥 2 ℛ 2 ℛ 3 ℛ 2 ℛ 1 ℛ 2 ℛ 1 ℛ 1 ℛ 1 ℛ 3 ℛ 4 𝑥 1 𝑠 𝑥 1 𝑥 1 𝑥 1
18
Decision Tree Data gets divided recursively into regions (a.k.a. ‘leaves’) Tree pruning Removes weaker leaves Hence avoids overfitting Stronger nodes Regions Prune require(rpart) # Grow a simple tree myTree<- rpart(y~x1+x2, myData) summary(myTree) Weaker nodes
19
Random Forest Consists of many decision trees
Randomly selected variables will be used in each tree Usually no need to prune them (i.e. all trees are allowed to grow big) 𝑀 trees in a forest will produce 𝑀 predictions Final prediction is calculated as mean value for regression problem Classification problem will use most the common label (i.e. majority voting) library(randomForest) # Grow a large forest with 1000 trees myForest <- randomForest(y ~ x1 + x2, ntree = 1000, data = myData)
20
Time Series Analysis: Correlograms
Regularly-spaced time series Explore variable relationship across temporal space Observed Data (+ other time series variables) Cross-correlation Function (CCF) Autocorrelation Function (ACF) Partial Autocorrelation Function (PACF)
21
Time Series Analysis: Decomposition
Observed data 𝑡 Trend Seasonality Noise 𝑡 𝑡 𝑡 (+ other time series variables) Forecast 𝑡
22
Autoregressive Moving Average (𝐴𝑅𝑀𝐴)
𝐴𝑅𝑀𝐴(𝑝,𝑞) 𝑋 𝑡 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 = 𝑖=1 𝑝 𝜙 𝑖 𝑋 𝑡−𝑖 𝐴𝑅 𝑝 + 𝑖=1 𝑞 𝜃 𝑖 𝜖 𝑡−𝑖 𝑀𝐴 𝑞 + 𝜖 𝑡 𝑒𝑟𝑟𝑜𝑟 𝐴𝑅𝐼𝑀𝐴 𝑝,𝑑,𝑞 𝐴𝑅𝐼𝑀𝐴: Autoregressive Integrative Moving Average 𝑑th order integration can be added ‘integration’ simply refers to the difference from previous time step! First order differencing (d=1): 𝑋 𝑡 ′ = 𝑋 𝑡 − 𝑋 𝑡−1 To satisfy stationarity requirement
23
𝐴𝑅𝐼𝑀𝐴 Forecasting with Seasonality
𝐴𝑅𝐼𝑀𝐴 𝑝,𝑑,𝑞 𝑃,𝐷,𝑄 𝑚 All parameter values can be automatically identified in R language. Simple models are preferred Therefore we intend to keep 𝑝+𝑞+𝑃+𝑄 small library(forecast) # Automatically search p,d,q,P,D,Q values myArima <- auto.arima(myTs, xreg = cbind(x1, x2)) summary(myArima)
24
Neural Network: Multilayer Perceptron
Hidden layer 1 Hidden layer 2 Input layer Output layer Fully-interconnected layers Non-linear activation function Captures subtle ‘non-linear’ relationships Gradient Descent Iterative optimisation algorithm Reduce error bit by bit Converge at local minimum Random initiation Loss Parameter space … … … . Local minimum
25
𝐾-means Clustering Clustering is subjective
How many clusters are there? 𝐾=3 𝐾=4 𝐾=5
26
𝐾-means Clustering Iteratively move towards cluster centroid
# Runs K-means clustering algorithm K <- 3 myCluster <- kmeans(myData, K) Iteratively move towards cluster centroid Terminates when clusters stop changing Random initiation Convergence
27
Hierarchical Clustering
Agglomerative hierarchical clustering Starts from 𝑁 clusters Merge clusters one by one according to Euclidean distance # Calculates Euclidean distance myDistance <- dist(myData) # Runs hierarchical clustering algorithm myDendrogram <- hclust(myDistance) # Draws dendrogram plot(myDendrogram) # Prune the tree K <- 3 myClusters <- cutree(myDendrogram, K)
28
Hierarchical Clustering
Iteration 1 Iteration 4 Iteration 2 Iteration 3 Iteration 7 Iteration 8 Iteration 5 Iteration 6
29
User Communities (1) http://www.londonr.org
30
User Communities (2) R user Conference (useR!)
Effective Applications of the R Language (EARL) European R Users Meeting (eRum)
31
Learning Resources Data Analysis Examples (UCLA) Regression Models in R (Harvard) The R Project (NYU) Choosing a Statistical Test Statistical Computing (Oxford) Forecasting: Principles and Practice (Monash) Time Series Analysis and Its Applications (Pittsburgh) R in Action Quantitative Financial Modelling & Trading Framework for R Econometrics in R (Northwestern) Data Analysis with R (Facebook) Rstatistics
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.