CSCI N317 Computation for Scientific Applications Unit R

Slides:



Advertisements
Similar presentations
Data in R. General form of data ID numberSexWeightLengthDiseased… 112m … 256f3.61 NA1… 3……………… 4……………… n91m5.1711… NOTE: A DATASET IS NOT A MATRIX!
Advertisements

Concepts of Database Management Sixth Edition
3/5/2009Computer systems1 Analyzing System Using Data Dictionaries Computer System: 1. Data Dictionary 2. Data Dictionary Categories 3. Creating Data Dictionary.
Exploring Microsoft Excel 2002 Chapter 7 Chapter 7 List and Data Management: Converting Data to Information By Robert T. Grauer Maryann Barber Exploring.
Introduction to GTECH 201 Session 13. What is R? Statistics package A GNU project based on the S language Statistical environment Graphics package Programming.
Lecture 7 Sept 19, 11 Goals: two-dimensional arrays (continued) matrix operations circuit analysis using Matlab image processing – simple examples Chapter.
Lecture 6 Sept 15, 09 Goals: two-dimensional arrays matrix operations circuit analysis using Matlab image processing – simple examples.
Concatenation MATLAB lets you construct a new vector by concatenating other vectors: – A = [B C D... X Y Z] where the individual items in the brackets.
C++ for Engineers and Scientists Third Edition
Introduction to Structured Query Language (SQL)
Tutorial 11: Connecting to External Data
Word Processing. ► This is using a computer for:  Writing  EditingTEXT  Printing  Used to write letters, books, memos and produce posters etc.  A.
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
The University of Adelaide Table Talk: Using tables in Word Peter Murdoch March 2014 PREPARING GOOD LOOKING DOCUMENTS.
1 Access Lesson 6 Integrating Access Microsoft Office 2010 Introductory Pasewark & Pasewark.
ASP.NET Programming with C# and SQL Server First Edition
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
1 Lesson 22 Getting Started with Access Essentials Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.
Session 3: More features of R and the Central Limit Theorem Class web site: Statistics for Microarray Data Analysis.
Microsoft Excel 2007 © Wiley Publishing All Rights Reserved. The L Line The Express Line to Learning L Line.
© 2008 The McGraw-Hill Companies, Inc. All rights reserved. ACCESS 2007 M I C R O S O F T ® THE PROFESSIONAL APPROACH S E R I E S Lesson 6 – Designing.
Lesson 17 Getting Started with Access Essentials
Data Objects in R Vector1 dimensionAll elements have the same data types Data types: numeric, character logic, factor Matrix2 dimensions Array2 or more.
Arrays 1 Multiple values per variable. Why arrays? Can you collect one value from the user? How about two? Twenty? Two hundred? How about… I need to collect.
Concepts of Database Management Seventh Edition
Key Applications Module Lesson 21 — Access Essentials
Chapter 17 Creating a Database.
Computer Literacy BASICS: A Comprehensive Guide to IC 3, 5 th Edition Lesson 23 Getting Started with Access Essentials 1 Morrison / Wells / Ruffolo.
1 Lab 2 and Merging Data (with SQL) HRP223 – 2009 October 19, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning:
C++ for Engineers and Scientists Second Edition Chapter 11 Arrays.
R packages/libraries Data input/output Rachel Carroll Department of Public Health Sciences, MUSC Computing for Research I, Spring 2014.
Chapter 14 Formatting Readable Output. Chapter Objectives  Add a column heading with a line break to a report  Format the appearance of numeric data.
A table is a set of data elements (values) that is organized using a model of vertical columns (which are identified by their name) and horizontal rows.
Task #1 Create a relational database on computers in computer classroom 308, using MySQL server and any client. Create the same database, using MS Access.
1 An Introduction to R © 2009 Dan Nettleton. 2 Preliminaries Throughout these slides, red text indicates text that is typed at the R prompt or text that.
Math 252: Math Modeling Eli Goldwyn Introduction to MATLAB.
Lesson 9 – Organizing Content Microsoft Word 2010.
Manipulating MATLAB Vector, Matrices 1. Variables and Arrays What are variables? You name the variables (as the programmer) and assign them numerical.
INFORMATION TECHNOLOGY DATABASE MANAGEMENT. A database is a collection of information organized to provide efficient retrieval. The collected information.
SIMPLE FILTERS. CONTENTS Filters – definition To format text – pr Pick lines from the beginning – head Pick lines from the end – tail Extract characters.
To play, start slide show and click on circle Access 1 Access 2 Access 3 Access 4 Access Access
CHAPTER 7 LESSON B Creating Database Reports. Lesson B Objectives  Describe the components of a report  Modify report components  Modify the format.
DATA MANAGEMENT MODULE: USING SQL in R
Matrices Rules & Operations.
Linear Algebra review (optional)
Exploring Excel Chapter 5 List and Data Management: Converting Data to
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
Practical Office 2007 Chapter 10
Chapter 6: Modifying and Combining Data Sets
Plug-In T7: Problem Solving Using Access 2007
REDCap Data Migration from CSV file
ECONOMETRICS ii – spring 2018
Lesson 1: Introduction to Trifacta Wrangler
Lesson 1: Introduction to Trifacta Wrangler
What is a Database? A collection of data organized in a manner that allows access, retrieval, and use of that data.
CSCI N207 Data Analysis Using Spreadsheet
funCTIONs and Data Import/Export
Basics of R, Ch Functions Help Managing your Objects
ICT Spreadsheets Lesson 1: Introduction to Spreadsheets
Multidimensional array
Lab 2 and Merging Data (with SQL)
Topics Introduction to Value-returning Functions: Generating Random Numbers Writing Your Own Value-Returning Functions The math Module Storing Functions.
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
Spreadsheets, Modelling & Databases
Lesson 23 Getting Started with Access Essentials
TransCAD Working with Matrices 2019/4/29.
EECS Introduction to Computing for the Physical Sciences
ESRM 250/CFR 520 Autumn 2009 Phil Hurvitz
Assignment 3 Querying and Maintaining a Database
Presentation transcript:

CSCI N317 Computation for Scientific Applications Unit 2 - 2 R Data processing

Create Data Use R Commands Good for small amount of data Enter data About data frame – http://www.r-tutor.com/r-introduction/data-frame Note: “.” can be viewed as an underscore in variable or function names.

Create Data Edit data Use edit() function, must assign an output to a variable to get hold of the result Use fix() function, will assign the result to the same variable Or use the “Data editor” feature in GUI. Will call the fix() function on the object. No undo, redo or save options.

Import Data Import data from external files Delimited text files

Import Data Import data from external files Other options csv files From a url, e.g. http://download.finance.yahoo.com/d/quotes.csv?s=MMM,AA&f=aboyk Use data retrieval packages, e.g. “quantmod” package for finance data See file dowGetData.R, get.multiple.quotes.R

Export Data Export data (usually data frames and matrices) as text files write.table(), write.csv(), write.csv2(), …

Combine Data “I’d estimate that 80% of the effort on a typical project is spent on finding, cleaning and preparing data for analysis. Less than 5% of the effort is devoted to analysis. (The rest of the time is spent on writing up what you did.)” - Joseph Adler, “R in a Nutshell” Combining Data Sets Data files are stored at different locations. paste(): concatenate multiple vectors into a single vector

Combine Data cbind(): combine objects by adding columns

Combine Data rbind(): combine objects by adding rows

Combine Data merge(): merge.R

Transformation Reassign variables and generate new columns Note: dow30_2.csv is one of the output files of the “quantmod” example on slide 5, with adjusted file name and column names Create a new field

Transformation Use the “transform” function Specify a data frame and a set of expressions that use variables within the data frame

Transformation Applying a Function to Each Element of an Object When transforming data, one common operation is to apply a function to a set of objects(or each part of a composite object) and return a new set of objects (or a new composite object). The base R library includes a set of different functions for doing this. Applying a function to an array apply() function accepts three argument: X is the array to which a function is applied, MARGIN specifies the dimensions to which you would like to apply a function, FUN specifies the function. You can also define your own function.

Transformation Applying a Function to a List or Vector - lapply() list data type (http://www.r-tutor.com/r-introduction/list) Apply to a list and return a list Apply to a vector and returns a vector

Subsets Bracket Notation Use a simple expression describing the set of rows to select from a data frame as an index Subset function as an alternative to bracket notation subset(dataset, rowexpression, columnexpression)

Binning Data cut()

Sampling Data Combine a set of vectors or data frames

Sampling Data Random Sampling Use the sample() function and specifying values and sample size

Summarizing Functions tapply(X=…, INDEX=…, FUN=, …) Summarizing X, for each subset specified by INDEX, applying function to subset

Summarizing Functions aggregate(x=…, by=…, FUN=, …) Similar to tapply(), but works on data frames rowsum(x, group=…) Similar, but only applying the sum function

Counting Values tabulate()

Counting Values table() function for categorical values

Reshaping Data transpose

Reshaping Data

Reshaping Data unstack() Change the format of a data frame from a stacked form to an unstacked form “form” attribute specifies a formula. The right side of ~ represents the vector to be unstacked. The left side of ~ indicates the groups to create

Reshaping Data reshape() Specify row IDs and expand values to columns

Sorting Sort a single vector Order a data frame

Data Cleaning Identifying problems caused by data collection, processing and storage processes and modifying the data so that these problems don’t interfere with analysis, e.g. duplicate patient records, incorrect credit scores(outside of 340 – 840 range), null values Can be achieved through functions or programming methods

Data Cleaning Finding and Removing Duplicates

Data Cleaning Using programming methods to remove rows that contains in valid or null values E.g. use the NationalSalaries.xlsx write a program to remove rows that has null values and rows that are summarized data e.g. major groups, all occupations.