Presentation on theme: "1 A workshop on using R to select a sample for EHES Susie Cooper & Johan Heldal Statistics Norway."— Presentation transcript:
1 A workshop on using R to select a sample for EHES Susie Cooper & Johan Heldal Statistics Norway
2 Overview What is R and why use it? Practical Exercises 1.Installing and loading R and packages 2.Reading external files 3.Calculating sample sizes 4.Stage 1 - Selecting Primary Sampling Units (PSU) 5.Stage 2 - Selecting Secondary Sampling Units (SSU) Where to get more information
3 Why use R for EHES? It has been agreed with EU because It’s free - therefore available for all countries involved. Very flexible Very powerful and fast tool for sampling and analyses. However… There can be a steep learning curve to using the program. No user-friendly interface.
4 What is EHESsampling? A tool for planning the sampling design Can be used to find good stratifications Can calculate cost-variance optimal sample sizes within PSUs. Can calculate costs and variances of alternatives. A tool for taking a probability sample from a sampling frame.
5 Using EHESsampling The EHESsampling manualmanual Before using EHESsampling you have to prepare some input datasets from the main sampling frame. For sampling at stage 1 you need A dataset describing the PSUsPSUs A dataset describing the stratastrata For stage 2 you need The main sampling frame describing the individual units
6 1. Loading Packages Load the EHESsampling package and other necessary packages each time you re-open R: library(EHESsampling)
7 2. Reading External Files Open a new script by selecting File and New script
8 2. Reading External Files Set the working directory where data files are stored by typing into the new script: setwd( " X:/120/EHES/R/Data " ) Then press + R to send the line to the console Location on your computer where the data files are stored
9 2. Reading External Files Read in the chosen file and save it in the working environment. PSUs.df<-read.table("post1000.csv", sep=";", dec=",", header=T) The file is now stored as PSUs.df for this session.
10 To see the start of the data set type: head(PSUs.df) 2. Reading External Files Print the first 6 lines of this
11 2. Reading External Files Rename PSUs.df variables to standard names names(PSUs.df)[c(1,2,3,4,13,14)]<-c("PSU", "name","strata","size","meanX","varX") These are the placements of the columns of names to change These are the names we are changing the chosen columns to. head(PSUs.df)
12 2. Reading External Files Read in the details for each stratum strataDetails.df<-read.table("Norwaystrata.csv", sep=";", dec=",",header=T) Rename the variables to standard ones names(strataDetails.df)[c(1,2)]<-c("strata","size")
13 2. Reading External Files Take a look at the dataset strataDetails.df
14 3. Calculating Sample Sizes Calculate the sample sizes for each PSU stage1<-sample.sizes(PSUs.df,strataDetails.df, n="n2",columns=5:12) This is the data frame with one line for each PSU This is the data frame with one line for each stratum This is the column of the strata data set containing strata sample sizes These are the columns containing the age/gender size information
15 3. Calculating Sample Sizes The sample.sizes function produces 2 datasets: stage1$pop and stage1$strat, which can be saved separately. stage1.strata.df<-stage1$strat stage1.pop.df<-stage1$pop
16 3. Calculating Sample Sizes Look at stage1.strata.df by typing the name into the console. stage1.strata.df
17 3. Calculating Sample Sizes Look at the top of stage1.pop.df by typing head and the name in brackets into the console. head(stage1.pop.df)
18 4. Stage 1 – Selecting PSUs Choose a sample with the correct number of PSUs (mk) from each stratum: stage1.select.df<-stage1.sample(stage1.pop.df) This is what the new data frame is called containing the selected PSUs This is the function we have created to select the PSUs This is the previously saved data frame containing the information for each PSU and age/gender domain
19 4. Stage 1 – Selecting PSUs Look at the chosen PSUs PSU.list(stage1.select.df)
20 4. Stage 1 – Selecting PSUs Export the file of chosen PSUs write.table(stage1.select.df,file="select.csv", sep=";", dec="," row.names=FALSE)
21 5. Stage 2 – Selecting SSUs Combine the file of selected PSU with the file containing individual unit data. This should result in a file of all individual units in all the selected PSUs.
22 5. Stage 2 – Selecting SSUs Read in the merged file: PSU.individuals.df<-read.table("NorwaySelected.csv", sep=";", dec=",", header=T) head(PSU.individuals.df)
23 5. Stage 2 – Selecting SSUs Take a sample of appropriate size in each stratum and PSU: selected.individuals<-stage2.sample(PSU.individuals.df) Look at the top of the selected individual units: head(selected.individuals)
24 Further Sampling Steps Read in the strata dataset Calculate the PSU sample sizes Take a sample of PSUs – stage 1 Merge the selected PSUs with the main sampling frame containing individual units. Sample individual units – stage 2