Getting your data into R

Getting your data into R
Yesterday you did: * Start R Studio and set up a project * Prepare and run R script files * what an R package is and how to install and load them * Load a simple data file * Understand how (& why) R uses dataframes & vectors * Prepare simple tables * Produce good quality graphs, and save these * Locate and perform some statistical tests on your data * Prepare a simple function

Getting your data into R
Download -> CSV -> data.frame -> data cleaning Later: EDA At the end of this session, you will be able to: Upload, download, create or import data into R Manipulate large datasets as tables in R Explore datasets using multiple approaches Test for missing, partial or inconsistent data Summarise and compare datasets with R

CSV data First line = names of variables, separated by commas
Variables = proper numbers or plain text - no spaces, funny characters or punctuation Data = mix of numbers and grouping variables = time sequence is ok = dates and times are not Missing data = represented by the two letters NA and nothing else - no dashes, no 999, 77, 88 or anything else

CSV data

Data input and output R comes with several pre-packaged datasets
You can access these datasets with the data function eg 1990 Davis PMID "Body image and weight preoccupation: A comparison between exercising and non-exercising women" View(Davis) head(Davis) str(Davis) glimpse(Davis) summary(Davis)

Data input and output Davis table(Davis$weight) table(Davis$height)
data.frame(Davis$height,Davis$weight) # create a dataframe from heights and weights only data.frame(Davis$height,Davis$weight)[1:5,] look at dataframe for heights and weights - samples 1-5

Dplyr -> Data cleaning + EDA
=> manipulating data: Davis %>% filter(height<quantile(height,0.5)) # subset rows of smallest 50% %>% arrange(desc(height)) # sort rows of smallest 50% %>% select(sex,weight,repwt) # select columns we want %>% mutate(weight_diff=(weight-repwt)) # create new variables, eg "weight_diff" %>% group_by(sex) %>% summarise(mean=mean(weight)) # summarise details of interest

=> manipulating data: # assign Davis heights vs weights using ggplot2 x <- ggplot(Davis,aes(x=weight, y=height, colour=sex)) + geom_jitter() + geom_line() + stat_smooth(span=0.5) + ggtitle('heights v weights') + xlab('height') + ylab('weight') # Higher spans = smoother # plot it (this way avoids errors) ggsave(filename='heights-weights-plot1.png', plot=x, dpi=1200)

=> manipulating data: # assign Davis heights vs weights using ggplot2 x2 <- ggplot(Davis,aes(x=weight, y=height, colour=sex)) + geom_boxplot() + coord_flip() + facet_wrap(~sex) + ggtitle('heights v weights') # plot it (NB not all plot types are sensible) ggsave(filename = 'heights-weights-plot2.png', plot=x2, dpi=1200)

=> manipulating data: scale_h <- function(height) { return(height/185200) } # create function to scale heights as nautical miles Davis$height_nm <-scale_h(Davis$height) # assign a new variable with nautical mile heights Davis$height_nm[1:5] # always check the output

Getting your data into R

Similar presentations

Presentation on theme: "Getting your data into R"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Getting your data into R

Similar presentations

Presentation on theme: "Getting your data into R"— Presentation transcript:

Similar presentations

About project

Feedback