CSCI N317 Computation for Scientific Applications Unit 2 - 2 R Data processing
Create Data Use R Commands Good for small amount of data Enter data About data frame – http://www.r-tutor.com/r-introduction/data-frame Note: “.” can be viewed as an underscore in variable or function names.
Create Data Edit data Use edit() function, must assign an output to a variable to get hold of the result Use fix() function, will assign the result to the same variable Or use the “Data editor” feature in GUI. Will call the fix() function on the object. No undo, redo or save options.
Import Data Import data from external files Delimited text files
Import Data Import data from external files Other options csv files From a url, e.g. http://download.finance.yahoo.com/d/quotes.csv?s=MMM,AA&f=aboyk Use data retrieval packages, e.g. “quantmod” package for finance data See file dowGetData.R, get.multiple.quotes.R
Export Data Export data (usually data frames and matrices) as text files write.table(), write.csv(), write.csv2(), …
Combine Data “I’d estimate that 80% of the effort on a typical project is spent on finding, cleaning and preparing data for analysis. Less than 5% of the effort is devoted to analysis. (The rest of the time is spent on writing up what you did.)” - Joseph Adler, “R in a Nutshell” Combining Data Sets Data files are stored at different locations. paste(): concatenate multiple vectors into a single vector
Combine Data cbind(): combine objects by adding columns
Combine Data rbind(): combine objects by adding rows
Combine Data merge(): merge.R
Transformation Reassign variables and generate new columns Note: dow30_2.csv is one of the output files of the “quantmod” example on slide 5, with adjusted file name and column names Create a new field
Transformation Use the “transform” function Specify a data frame and a set of expressions that use variables within the data frame
Transformation Applying a Function to Each Element of an Object When transforming data, one common operation is to apply a function to a set of objects(or each part of a composite object) and return a new set of objects (or a new composite object). The base R library includes a set of different functions for doing this. Applying a function to an array apply() function accepts three argument: X is the array to which a function is applied, MARGIN specifies the dimensions to which you would like to apply a function, FUN specifies the function. You can also define your own function.
Transformation Applying a Function to a List or Vector - lapply() list data type (http://www.r-tutor.com/r-introduction/list) Apply to a list and return a list Apply to a vector and returns a vector
Subsets Bracket Notation Use a simple expression describing the set of rows to select from a data frame as an index Subset function as an alternative to bracket notation subset(dataset, rowexpression, columnexpression)
Binning Data cut()
Sampling Data Combine a set of vectors or data frames
Sampling Data Random Sampling Use the sample() function and specifying values and sample size
Summarizing Functions tapply(X=…, INDEX=…, FUN=, …) Summarizing X, for each subset specified by INDEX, applying function to subset
Summarizing Functions aggregate(x=…, by=…, FUN=, …) Similar to tapply(), but works on data frames rowsum(x, group=…) Similar, but only applying the sum function
Counting Values tabulate()
Counting Values table() function for categorical values
Reshaping Data transpose
Reshaping Data
Reshaping Data unstack() Change the format of a data frame from a stacked form to an unstacked form “form” attribute specifies a formula. The right side of ~ represents the vector to be unstacked. The left side of ~ indicates the groups to create
Reshaping Data reshape() Specify row IDs and expand values to columns
Sorting Sort a single vector Order a data frame
Data Cleaning Identifying problems caused by data collection, processing and storage processes and modifying the data so that these problems don’t interfere with analysis, e.g. duplicate patient records, incorrect credit scores(outside of 340 – 840 range), null values Can be achieved through functions or programming methods
Data Cleaning Finding and Removing Duplicates
Data Cleaning Using programming methods to remove rows that contains in valid or null values E.g. use the NationalSalaries.xlsx write a program to remove rows that has null values and rows that are summarized data e.g. major groups, all occupations.