Presentation is loading. Please wait.

Presentation is loading. Please wait.

This presentation is made available through a Creative Commons Attribution-Noncommercial license. Details of the license and permitted uses are available.

Similar presentations


Presentation on theme: "This presentation is made available through a Creative Commons Attribution-Noncommercial license. Details of the license and permitted uses are available."— Presentation transcript:

1 This presentation is made available through a Creative Commons Attribution-Noncommercial license. Details of the license and permitted uses are available at http://creativecommons.org/licenses/by-nc/3.0/ © 2010 Dr. Juliet Pulliam and the Clinic on the Meaningful Modeling of Epidemiological Data http://creativecommons.org/licenses/by-nc/3.0/ Title: Introduction to Data Management and Cleaning Attribution: Dr. Juliet Pulliam, Clinic on the Meaningful Modeling of Epidemiological Data Source URL:http://lalashan.mcmaster.ca/theobio/mmed/index.php/Introduction_to_data_management_and_clean inghttp://lalashan.mcmaster.ca/theobio/mmed/index.php/Introduction_to_data_management_and_clean ing For further information please contact Dr. Juliet Pulliam (juliet.mmed.clinic@gmail.com).

2 Introduction to data management and cleaning Clinic on the Meaningful Modeling of Biological Data and BSc Honours Course in Biomathematics African Institute for the Mathematical Sciences Muizenberg, South Africa 28 May 2010 Dr. Juliet Pulliam RAPIDD Program, DIEPS Fogarty International Center, NIH, USA and Department of Ecology & Evolutionary Biology University of California at Los Angeles (UCLA)

3 Levels of data aggregation Aggregated data De-identified data Personally identifying data } Individual-level

4 Read in the dataset # cases.R # JRCP 06 Jan 2010 # # Last modified: 07.01.10 rm(list=ls()) library(foreign) library(date) cases<-read.spss(”cases.sav",to.data.frame=T,trim.factor.names=T) dim(cases) [1] 135 91 names(cases) [1] "idno" … "sex" … "age" "occup" "dor" "dtadm" "dtill" … "fever" "dtfever" … "cough" "dtcough" … "headache" … "dthead" … "outcome" … "dtdeath" … "igm1" "igg1" "pcr1ser" "dtserum1" "Igm2" "Igg2" "dtserum2" … "virusiso” … "CXR" "ARDS" …

5 Look at the dataset - format head(cases$dor) [1] 13266288000 13266288000 13266374400 cases$dor<-as.date(cases$dor/60/60/24-137775) range(cases$dor,na.rm=T) [1] 6Mar2003 15Mar2008

6 Look at the dataset - format head(cases$dor) [1] 13266288000 13266288000 13266374400 cases$dor<-as.date(cases$dor/60/60/24-137775) range(cases$dor,na.rm=T) [1] 6Mar2003 15Mar2008 Categorical data - factor (levels) Continuous data - numeric (range) Dates - date, POSIXct (range) Binary data - T, F

7 Look at the dataset - format Categorical data - factor (levels) Continuous data - numeric (range) Dates - date, POSIXct (range) Binary data - T, F "idno" … "sex" … "age" "occup" "dor" "dtadm" "dtill" … "fever" "dtfever" … "cough" "dtcough" … "headache" … "dthead" … "outcome" … "dtdeath" … "igm1" "igg1" "pcr1ser" "dtserum1" "Igm2" "Igg2" "dtserum2" … "virusiso” … "CXR" "ARDS" …

8 Look at the dataset - logic subset(cases,dor-dtill<0)$idno AT567 range(cases$dor-cases$dtill) [1] NA NA range(cases$dor-cases$dtill,na.rm=T) [1] -7 686 hist(cases$dor-cases$dtill,20,xlab="Days from onset to report",main="",col=1)

9 Inconsistencies IDNO: SF999, Year: 2005 dtill given as 16 Feb 2005 dtfever given as 16 Feb 2003 Correction: Change year of dtfever from 2003 to 2005 IDNO: GW321, Year: 2003 dtill given as 12 Jan 2003 dtcough given as 12 Mar 2003 dthead given as 12 Mar 2003 dtdeath given as 15 Jan 2003 Correction: Change month of dtcough and dthead from 12 Mar to 12 Jan

10 Inconsistencies IDNO: LP309, Year: 2002 dtill given as 12 Aug 2002 dthead given as 11 Aug 2002 Correction: none IDNO: AT567, Year: 2005 dor given as 6 May 2005 dtill given as 14 May 2005 Correction: delete case (sample negative when tested at CDC)

11 Correct the dataset # DATABASE CORRECTIONS # cases$dtfever[cases$idno==”SF999"]<-as.date("16Jan2005") # Onset 2005; dtfever was 2003 cases$dthead[cases$idno==”GW321"]<-as.date("12Jan2003") # Onset/death Jan; dtfever was March cases$dtcough[cases$idno==”GW321"]<-as.date("12Jan2003") # Onset/death Jan; dtcough was March cases<-subset(cases,idno!=”AT567") # sample negative when tested at CDC Save corrected dataset as new file today<-strptime(Sys.time(),"%d%b%Y") fn<-paste("casesClean",today,".Rdata",sep="") fn [1] "casesClean07Jan2010.Rdata" save(cases,today,file=fn)


Download ppt "This presentation is made available through a Creative Commons Attribution-Noncommercial license. Details of the license and permitted uses are available."

Similar presentations


Ads by Google