Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCI N317 Computation for Scientific Applications Unit R

Similar presentations


Presentation on theme: "CSCI N317 Computation for Scientific Applications Unit R"— Presentation transcript:

1 CSCI N317 Computation for Scientific Applications Unit 2 - 2 R
Data processing

2 Create Data Use R Commands Good for small amount of data Enter data
About data frame – Note: “.” can be viewed as an underscore in variable or function names.

3 Create Data Edit data Use edit() function, must assign an output to a variable to get hold of the result Use fix() function, will assign the result to the same variable Or use the “Data editor” feature in GUI. Will call the fix() function on the object. No undo, redo or save options.

4 Import Data Import data from external files Delimited text files

5 Import Data Import data from external files Other options csv files
From a url, e.g. Use data retrieval packages, e.g. “quantmod” package for finance data See file dowGetData.R, get.multiple.quotes.R

6 Export Data Export data (usually data frames and matrices) as text files write.table(), write.csv(), write.csv2(), …

7 Combine Data “I’d estimate that 80% of the effort on a typical project is spent on finding, cleaning and preparing data for analysis. Less than 5% of the effort is devoted to analysis. (The rest of the time is spent on writing up what you did.)” - Joseph Adler, “R in a Nutshell” Combining Data Sets Data files are stored at different locations. paste(): concatenate multiple vectors into a single vector

8 Combine Data cbind(): combine objects by adding columns

9 Combine Data rbind(): combine objects by adding rows

10 Combine Data merge(): merge.R

11 Transformation Reassign variables and generate new columns
Note: dow30_2.csv is one of the output files of the “quantmod” example on slide 5, with adjusted file name and column names Create a new field

12 Transformation Use the “transform” function
Specify a data frame and a set of expressions that use variables within the data frame

13 Transformation Applying a Function to Each Element of an Object
When transforming data, one common operation is to apply a function to a set of objects(or each part of a composite object) and return a new set of objects (or a new composite object). The base R library includes a set of different functions for doing this. Applying a function to an array apply() function accepts three argument: X is the array to which a function is applied, MARGIN specifies the dimensions to which you would like to apply a function, FUN specifies the function. You can also define your own function.

14 Transformation Applying a Function to a List or Vector - lapply()
list data type ( Apply to a list and return a list Apply to a vector and returns a vector

15 Subsets Bracket Notation
Use a simple expression describing the set of rows to select from a data frame as an index Subset function as an alternative to bracket notation subset(dataset, rowexpression, columnexpression)

16 Binning Data cut()

17 Sampling Data Combine a set of vectors or data frames

18 Sampling Data Random Sampling
Use the sample() function and specifying values and sample size

19 Summarizing Functions
tapply(X=…, INDEX=…, FUN=, …) Summarizing X, for each subset specified by INDEX, applying function to subset

20 Summarizing Functions
aggregate(x=…, by=…, FUN=, …) Similar to tapply(), but works on data frames rowsum(x, group=…) Similar, but only applying the sum function

21 Counting Values tabulate()

22 Counting Values table() function for categorical values

23 Reshaping Data transpose

24 Reshaping Data

25 Reshaping Data unstack()
Change the format of a data frame from a stacked form to an unstacked form “form” attribute specifies a formula. The right side of ~ represents the vector to be unstacked. The left side of ~ indicates the groups to create

26 Reshaping Data reshape() Specify row IDs and expand values to columns

27 Sorting Sort a single vector Order a data frame

28 Data Cleaning Identifying problems caused by data collection, processing and storage processes and modifying the data so that these problems don’t interfere with analysis, e.g. duplicate patient records, incorrect credit scores(outside of 340 – 840 range), null values Can be achieved through functions or programming methods

29 Data Cleaning Finding and Removing Duplicates

30 Data Cleaning Using programming methods to remove rows that contains in valid or null values E.g. use the NationalSalaries.xlsx write a program to remove rows that has null values and rows that are summarized data e.g. major groups, all occupations.


Download ppt "CSCI N317 Computation for Scientific Applications Unit R"

Similar presentations


Ads by Google