Presentation is loading. Please wait.

Presentation is loading. Please wait.

The R language and its Dynamic Runtime

Similar presentations


Presentation on theme: "The R language and its Dynamic Runtime"— Presentation transcript:

1 The R language and its Dynamic Runtime
Carlos Ordonez

2 Acknowledgments ATT Labs Simon Urbanek, (ATT Labs, R core team)
Mike Stonebraker (MIT) Hadley Wickam (formerly at Rice U) Bryan Lewis (SciDB team) Divesh Srivastava (my “boss” at ATT)

3 Outline History R features R runtime R programming
Research: analyzing streams

4 History Originally S language, invented at ATT Bell Labs (Chambers got Turing award) The core runtime subsystem is still based on S expressions 1st solid version 1979: ported to Unix and programmed in C Two branches: commercial=S-plus open-source=R (NZ)

5 Other analytic systems
SAS: more a script language, but well tested libraries and external tools Matlab: numerical analysis, optimization, mathematical modeling DBMSs interacting with math libraries: SQL #1 to write queries Spark: new generation of MapReduce Pure C or C++; Java; Python growing (flat files)

6 Features Interpreted Functional; Recursion Object-oriented
Lists, vectors and matrices Goal: Statistical computing, but also numerical analysis, data pre-processing Garbage collector

7 Pros Robust core interpreter system; portable
More RAM => easier, 64-bit memory addresing (but still 32 bit ints) Growing user population: expected to surpass SAS in 2015; already passed S-plus Machine learning now uses R instead of Matlab, but Julia (MIT) growing Scalable systems and libraries exist Revolution bought by Microsoft pBDR snow, biglm

8 Drawbacks Syntax OK, but run-time R semantics not formally specified: GNU is the current standard Can be slow, especially because there are many ways to program the same task Difficult to integrate data structures (e.g. trees, hash tables, binary files) String manipulation acceptable, but sometimes cumbersome Dynamically typed: unexpected errors Highly variable quality of libraries in CRAN Does not scale well for large n; block-based processing feasible, but needs to be reprogrammed per library (IO tools)

9 R runtime Single threaded Text file I/O Garbage collector
Environments; variable generations

10 R internals S expressions
Data types: integer (32 bit), real, string, Posix timestamp Memory allocation: lists, vectors, matrices, data frames (most general) Memory deallocation: automatic, but can force calls to garbage collector in embedded Bash script-based interpreter: easy integration into diverse Unix environments

11 Programming in R Examples Interactive debugging
Reusable and maintenable code Faster processing Extending R

12 Examples

13 Debugging Tracking variable contents List, vector, matrix sizes Ranges
Environments

14 Tracking variable content Initialization commonly not needed; Data type can change any time with new assignment

15 Sizes

16 Reusable and maintanable code
Functions Closures Functionals named arguments, defaults Libraries R embedded R embedded C

17 Functional

18 Faster processing Profiling code Direct calls to C math library
Vectorized code Avoid type casting Chunk-based processing

19 Faster processing

20 Extending R New functions Libraries Embedded code

21 Research goal: analyzing network data streams
Stream data warehouse, constantly refreshed every 1-5 minutes from multiples streams Time windows Intermittent feeds Enable complex analytics for network monitoring

22 Embedded code Main motivation: bypass ODBC, JDBC. JSON
Embedding R code inside C code Vectors and matrices Exploit existing R functions May be faster than host language Embedding C code inside R code better performance more flexibility algorithm already programmed in C or C++

23 Embedded R inside C Setup libraries Setup Unix environment
Convert external data to list, vector or data frame: memcpy() when possible retrieve results: transformed data set (most common) model (harder) associated statistical metrics (model-specific)

24 Embedded R inside C main guidelines
Avoid reprogramming an existing R function Consider tradeoffs between data set size and RAM Two subsystems will compete for RAM Single threaded, but feasible to call R multiple times as different Unix processes

25 Embedded R

26 Embeded R generate time series

27 Embedded R create data frame

28 Embedded R final: call R from C

29 Embedded R direct binding to DBMS

30 Embedded R main

31 Embedded C code guidelines
Identify bottlenecks Substitute nested interpreted loops Eliminate or reduce dynamic type checking

32 Embedded C code programming
Understand data type manipulation, especially C arrays and ** pointers Memory management Function argument binding Linker

33 Improve efficiency of R Alternative 1: built-in matrix ops

34 Improve R efficiency Alternative 2: C code for the operator: 10X faster


Download ppt "The R language and its Dynamic Runtime"

Similar presentations


Ads by Google