Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing for Data Analysis R statistics programming environment Ming Ni 11/14/2014.

Similar presentations


Presentation on theme: "Computing for Data Analysis R statistics programming environment Ming Ni 11/14/2014."— Presentation transcript:

1 Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014

2 http://tinyurl.com/ise-r-talk

3 Outline 1.Overview and History of R 2.Data types in R 3.Reading and Writing Data 4.Plotting Data

4 What is S? R is a dialect of S language S is a language that was developed by John Chambers and others at Bell Labs. S was initiated in 1976 as an internal statistical analysis environment – originally implemented as Fortran libraries. Version 4 of the S language was release in 1998 and is the version we use today

5 1991: Created in New Zealand by Ross Ihaka and Robert Gentleman 1993: First announcement of R to public 1995:Use the GNU General Public License to make R free software 1997: The R Core Group is formed. The core group controls the source code for R. 2000: R version 1.0.0 is released 2014: R version 3.1.2 is most recently released. What is R?

6 Features of R 1.It is free! 2.The syntax and semantics are very similar to S 3.R is case sensitive 4.Commands are separated either by ; or by a newline 5.Run on almost any standard computing platform/OS (Windows, Mac, Linux even on the PlayStation 4

7 6.Frequent releases (annual + bugfix releases); active development 7.Core software is quite lean; Functionality is divided into modular packages 8.Graphics capabilities are very sophisticated 9.Useful for interactive work, but contains a powerful programming language for developing new tools 10.Very active and vibrant user community. (mailing lists and Stack Overflow Features of R, cont’d

8 1.Essentially based on 40 year old technology 2.Little built in support for dynamic or 3-D graphics 3.No help line you can call for support or explaining features 4.Objects must generally be stored in physical memory of computer! (Big data age 5.Not ideal for all possible situation. R cannot do everything! Drawbacks of R

9 Other Data Analysis Software The number of analytics jobs for the more popular software (250 jobs or more, 2/2014).

10 Number of scholarly articles found for each software (2/2014). Other Data Analysis Software

11 honorable mention: Python with package numpy, pandas, Scipy SPSS modeler Easy drag and drop nodes to access to advanced data analytics Other Data Analysis Software

12 http://cran.us.r-project.org/ Downloading and Installing R

13 The R system is divided into 2 conceptual parts: The “base” R system that you download from CRAN Everything else R functionality is divided into a number of packages There are 4000+ packages on CRAN Users contributed and not controlled by R Core There are also large amount R packages outside of CRAN Design of the R System

14 R ConsoleR Script Get start of R

15 You can work directly in R, but most users prefer a graphical interface. Integrated Development Environment (IDE): RStudio Tinn-R Deducer Revolution R (leverage R in Hadoop environments Text editor with plugins: Vim Eclipse +statET RStudio server on web browser Get start of R

16 Interactive environment, where people did not consciously think of themselves as programming Read tables Data analysis User After sophistication increased and have clear need, people are able to slide gradually into programming Data processing Develop the own tools Programmer Get start of R

17 Outline 1.Overview and History of R 2.Data types in R 3.Reading and Writing Data 4.Plotting Data

18 Basic classes: numeric, integer, character, logical (TRUE/FALSE), complex vector, matrix, list factor missing value data frame

19 Entering Input At the R prompt we type expressions. The <- symbol is the assignment operator Expression: x<- 1 Object: x Value: 1 Class of x: numeric Hash symbol Assignment Operator

20 When a complete expression is entered at prompt, it is evaluated and result of the evaluated expression is returned. The result may be auto-printed. The [1] indicates that x is a vector and the first element of the object x is value 1 Printing

21 The : operator is used to create integer sequences Printing

22 The c() function can be used to create vectors of objects. When different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class. class(object) # class or type of an object Create Vectors

23 Objects can be explicitly coerced from one class to another using as.* functions, if available Explicit Coercion

24 1.vector: A vector can only contain objects of the same class 2.matrix: Matrix are vectors with a dimension attribute. The dimension attribute is an integer vector of length 2 (nrow, ncol) 3.list: List are a special type of vector that can contain elements of different classes. It can be multiple dimensions. vector, matrix, list

25 Matrices can be created by column-binding or row-binding with cbind() and rbind(). They are also able to be used for data frame. cbind-ing and rbind-ing

26 Basic classes: numeric, integer, character, logical (TRUE/FALSE), complex vector, matrix, list factor missing value data frame

27 Factor is special type of vector. Factors are used to represent categorical data. Factors can be unordered or ordered. Each element of factors has a label. Factors are treated specially by modelling functions like lm() and glm() Factor

28 generate frequency tables using the table( ) function Factor

29 Missing values are denoted by NA or NaN for undefined mathematical operations. NaA means 0/0 – stands for Not a Number NA is generally interpreted as a missing value. NA values have a class also, so there integer NA, character NA, logical NA, etc. A NaN value is also NA but the converse is not true Missing Values

30 Missing Values Functions

31 Summary Basic classes: numeric, integer, character, logical (TRUE/FALSE), complex vector, matrix, list factor missing value data frame

32 Outline 1.Overview and History of R 2.Data types in R 3.Reading and Writing Data 4.Plotting Data

33 Principal functions reading data into R. read.table, read.csv, for reading tabular data (.csv,.txt readLines, for reading lines of a text file source, for reading in R code file (.r load, for reading in saved workspaces (.rdata Analogous functions writing data to files. write.table (txt,.csv writeLines dump save

34 The read.table function is one of most commonly used function for reading data. It has few important arguments: read.table(file, header, sep, colClasses, nrows, skip, stringAsFactors) file, the name of a file, or a connection header, logical indicating if the file has a header line sep, a string indicting how the columns are separated colClasses, a character vector indicating the class of each column in the dataset nrows, the number of rows in the dataset skip, the number of lines to skip from the beginning stringAsFactors, should character variables be coded as factors?

35 read.table(file, header, sep) The other arguments of the function use default parameters. How to check it? read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text, skipNul = FALSE) The help file for the read.table function from R Documentation:

36 Check with the R help Documentation 1.?read.table: precede the name of the function with ? 2.??keyword: searches R documentation for keyword 3.Google read.table r If you cannot follow the help documentation, please see the example first, which is at end of the webpage

37 Data frames are used to store tabular data (Key data type used in R) 1.They are represented as a special type of list where every element of the list has to have the same length 2.Unlike matrix, data frames can store different classes of objects in each column (just like lists) 3.Data frames also have a special attribute called row.names, used to annotate the data 4.Data frames are usually created by calling read.table() or read.csv() 5.Can be converted to a matrix by calling data.matrx()

38 Demo The Iris Data Set consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor). 4 attributes were measured from each sample.

39 Outline 1.Overview and History of R 2.Data types in R 3.Reading and Writing Data 4.Plotting Data

40 The plotting and graphics engine in R is in a few base and recommend packages: graphics: contains plotting functions for the “base” graphing systems, including plot, hist, boxplot, etc. graphics: contains plotting functions for the “base” graphing systems, including plot, hist, boxplot, etc. lattice; Grid; grDevices;

41 Common questions about R plotting Where to plot: R graphic devices. How to plot: Function with parameter Need to resize: Exportation Format selection The process of making a R base plotting: Base graphics are usually constructed piece by piece. Each aspect of the plot handled separately through a series of function calls Mirror the thought process Base plotting is used most commonly and are a very powerful system for creating 2-D Graphics.

42 Plot Title Y label X label Margin 1,2,3,4

43 Some Important Base Graphics Parameters The par() function is used to specify global graphics parameters that affect all plots in an R session. pch: the plotting symbol (default is open circle lty: the line type (solid line, dashed, dotted lwd: the line width col: the plotting color las: the orientation of the axis labels bg: the background color mar: the margin size mfrow: number of plots per row, column (plots are filled row-wise) mfcol: number of plots per row, column (plots are filled column-wise)

44 Demo R base plotting

45 Ming Ni Student of Industrial and Systems Engineering, State University of New York at Buffalo Email: mingni@buffalo.edu Advisor: Qing He, Ph.D.


Download ppt "Computing for Data Analysis R statistics programming environment Ming Ni 11/14/2014."

Similar presentations


Ads by Google