Presentation is loading. Please wait.

Presentation is loading. Please wait.

Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.

Similar presentations


Presentation on theme: "Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3."— Presentation transcript:

1 Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3

2 Housekeeping Lab 1 cleanup Computer and software issues Change final session from 11/29  12/1 –(Thursday instead of Tuesday) Change schedule – Excel NEXT session

3 Today... What we did last week, and why it was unrealistic What does “data cleaning” mean? How to generate a variable How to manipulate the data in your new variable How to label variables and otherwise document your work Examples

4 Last time… What was unrealistic?

5 Last time… What was unrealistic? –The dataset came as a Stata.dta file

6 Last time… What was unrealistic? –The dataset came as a Stata.dta file –The variables were ready to analyze

7 Last time… What was unrealistic? –The dataset came as a Stata.dta file –The variables were ready to analyze –Most variables were labeled

8 Last time… I.e. – The data was “clean”

9 How your data will arrive On paper forms In a text file (comma or tab delimited) In Excel In Access In another data format (SAS, etc)

10 Importing into Stata Options: –Cut and Paste –insheet, infile, fdause, other flexible Stata commands –A convenience program like “Stat/Transfer”

11 Importing into Stata Make sure it worked –Look at the data

12 Importing into Stata Example – neonatal opiate withdrawal data

13 Exploring your data Figure out what all those variables mean Options –Browse, describe, summarize, list in STATA –Refer to a data dictionary –Refer to a data collection form –Guess, or ask the person who gave it to you

14 Exploring your data Example: Neonatal opiate withdrawal data

15 Exploring your data Example: Neonatal opiate withdrawal data Problems arise… –Sex is m/f, not 1/0 –Gestational age has nonsense values (0, 60) –Breastfeeding has a bunch of weird text values –Drug variables coded y or blank –Many variable names are obscure

16 Cleaning your data You must “clean” your data so it is ready to analyze.

17 Cleaning your data Cleaning tasks –Check for consistency and clean up non-sense data and outliers –Deal with missing values –Code all dichotomous variables 1/0 –Categorize variables meaningfully (for Table 1, etc) –Derive new variables –Rename variables With common sense, or with a consistent scheme –Label variables –Label the VALUES of coded variables

18 Cleaning your data The importance of documentation –Retracing your steps Document every step using a “do” file

19 Data cleaning Basic skill 1 – make a new variable Creating new variables generate newvar = expression An expression can be: –A number (constant) - generate allzeros = 0 –A variable - generate ageclone = age –A function - generate agesqrt = sqrt(age)

20 Data cleaning Basic skill 1 – make a new variable Getting rid of a variable drop var Getting rid of observations drop if boolean exp

21 Data cleaning Basic skill 2 – manipulating the values Changing the values of a variable replace var = exp [if boolean exp] A boolean expression evaluates to true or false for each observation

22 Data cleaning Basic skill 2 – manipulating the values Examples generate male = 0 replace male = 1 if sex==“male” generate ageover50 = 0 replace ageover 50 = 1 if age>50 generate complexvar = age replace complexvar = (ln(age)*3) if (age>30 | male==1) & (othervar1>=othervar2)

23 Data cleaning Basic skill 2 – manipulating the values Logical operators for boolean expressions: EnglishStata Equal to == Not equal to! =, ~= Greater than> Greater than/equal to> = Less than < Less than/equal to <= And & Or |

24 Data cleaning Basic skill 2 – manipulating the values Mathematical operators: EnglishStata Add + Subtract - Multiply * Divide/ To the power of ^ Natural log of ln(expression) Base 10 log of log10(expression) Etcetera…

25 Data cleaning Basic skill 2 – manipulating the values Another way to manipulate data Recode var oldvalue1=newvalue1 [oldvalue2=newvalue2] [if boolean expression] More complicated, but more flexible command than replace

26 Data cleaning Basic skill 2 – manipulating the values Examples Generate male = 0 Recode male 0=1 if sex==“male” Generate raceethnic = race Recode raceethnic 1=6 if ethnic==“hispanic” (Replace raceethnic = 6 if ethnic==“hispanic” & race==1) Generate tertilescac = cac Recode min/54=1 55/82=2 83/max=3

27 Data cleaning Basic skill 3 – labeling variables You can label: –A dataset label data “label” –A variable label var varname “label” –Values of a variable (2-step process) label define labelname value1 “label1” [value2 “value2”…] Label values varname labelname

28 Cleaning your data Cleaning tasks –Check for consistency and clean up non-sense data –Deal with missing values –Code all dichotomous variables 1/0 –Categorize variables meaningfully (for Table 1, etc) –Derive new variables –Rename variables With common sense, or with a consistent scheme –Label variables –Label the VALUES of coded variables

29 Data cleaning Example: Neonatal opiate withdrawal data

30 Data cleaning At the end of the day you have: –1 raw data file, original format –1 raw data file, Stata format –1 do file that cleans it up –1 log file that documents the cleaning –1 clean data file, Stata format

31 Summary Data cleaning –ALWAYS necessary to some extent –ALWAYS use a do file, don’t overwrite original data –Check your work –Watch out for missing values –Label as much as you can

32 Lab this week It’s long It’s important It’s hard But this year, we have 2 sessions for it! Email lab to bio212ucsf@yahoo.combio212ucsf@yahoo.com Due 10/11 at Midnight

33 Preview of next week… Using Excel –What is it good for? –Formulas –Designing a good spreadsheet –Formatting


Download ppt "Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3."

Similar presentations


Ads by Google