Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke

Similar presentations

Presentation on theme: "Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke"— Presentation transcript:

1 Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke / @AndyPryke

2 My Bias… I work in commercial data mining, data analysis and data visualisation Background in computing and artificial intelligence Use R to write programs which analyse data

3 What is Big Data? Depends who you ask. Answers are often too big to …. …load into memory …store on a hard drive …fit in a standard database Plus Fast changing Not just relational

4 My Big Data Definition Data collections big enough to require you to change the way you store and process them. - Andy Pryke

5 Data Size Limits in R Standard R packages use a single thread, with data held in memory (RAM) help("Memory-limits") Vectors limited to 2 Billion items Memory limit of ~128Tb Servers with 1Tb+ memory are available Also, Amazon EC2 servers up to 244Gb

6 Overview Problems using R with Big Data Processing data on disk Hadoop for parallel computation and Big Data storage / access In Database analysis What next for Birmingham R User Group?

7 matrix - Built in (package base). - Stored in RAM - Dense - takes up memory to store zero values) Can be replaced by….. Background: R matrix class

8 Sparse / Disk Based Matrices Matrix – Package Matrix. Sparse. In RAM big.matrix – Package bigmemory / bigmemoryExtras & VAM. On disk. VAM allows access from parallel R sessions Analysis – Packages irlba, bigalgebra, biganalytics (R-Forge list)etc.R-Forge list More details? Large-Scale Linear Algebra with R, Bryan W. Lewis, Boston R Users MeetupLarge-Scale Linear Algebra with R

9 Commercial Versions of R Revolution Analytics have specialised versions of R for parallel execution & big data I believe many if not most components are also available under Free Open Source licences, including the RHadoop set of packages Plenty more info heremore info here

10 Background: Hadoop Parallel data processing environment based on Googles MapReduce model Map – divide up data and sending it for processing to multiple nodes. Reduce – Combine the results Plus: Hadoop Distributed File System (HDFS) HBase – Distributed database like Googles BigTable

11 RHadoop – Revolution Analytics Package: rmr2, rhbase, rhdfs Example code using RMR (R Map-Reduce) Example code using RMR R and Hadoop – Step by Step Tutorials Install and Demo RHadoop (Google for more of these online) Install and Demo RHadoop Data Hacking with RHadoop

12 RHadoop <- function(., lines) { ## split "lines" of text into a vector of individual "words" words <- unlist(strsplit(x = lines,split = " ")) keyval(words,1) ## each word occurs once } wc.reduce <- function(word, counts ) { ## Add up the counts, grouping them by word keyval(word, sum(counts)) } wordcount <- function(input, output = NULL){ mapreduce( input = input, output = output, input.format = "text", map =, reduce = wc.reduce, combine = T) } E.g. Function Output ## In, 1 ## the, 1 ## beginning, 1 ##... ## the, 2345 ## word, 987 ## beginning, 123 ##...

13 Other Hadoop libraries for R Other packages: hive, segue, RHIPE…segue – easy way to distribute CPU intensive work - Uses Amazons Elastic Map Reduce service, which costs money. - not designed for big data, but easy and fun. Example follows…

14 RHadoop # first, let's generate a 10-element list of # 999 random numbers + 1 NA: > myList <- getMyTestList() # Add up each set of 999 numbers > outputLocal <- lapply(myList, mean, na.rm=T) > outputEmr <- emrlapply(myCluster, myList, mean, na.rm=T) RUNNING - 2011-01-04 15:16:57 RUNNING - 2011-01-04 15:17:27 RUNNING - 2011-01-04 15:17:58 WAITING - 2011-01-04 15:18:29 ## Check local and cluster results match > all.equal(outputEmr, outputLocal) [1] TRUE # The key is the emrlapply() function. It works just like lapply(), # but automagically spreads its work across the specified cluster

15 Oracle R Connector for Hadoop Integrates with Oracle Db, Oracle Big Data Appliance (sounds expensive!) & HDFS Map-Reduce is very similar to the rmr example Map-Reduce Documentation lists examples for Linear Regression, k-means, working with graphs amongst otherslists examples Introduction to Oracle R Connector for Hadoop. Introduction to Oracle R Connector for Hadoop Oracle also offer some in-database algorithms for R via Oracle R Enterprise (overview)Oracle R Enterpriseoverview

16 Teradata Integration Package: teradataRteradataR Teradata offer in-database analytics, accessible through R These include k-means clustering, descriptive statistics and the ability to create and call in- database user defined functions

17 What Next? I propose an informal big data Special Interest Group, where we collaborate to explore big data options within R, producing example code etc. R you interested?

Download ppt "Big Data, Bigger Data & Big R Data Birmingham R Users Meeting 23 rd April 2013 Andy Pryke"

Similar presentations

Ads by Google