Presentation is loading. Please wait.

Presentation is loading. Please wait.

HRP223 2008 Copyright © 1999-2008 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.

Similar presentations


Presentation on theme: "HRP223 2008 Copyright © 1999-2008 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and."— Presentation transcript:

1

2 HRP223 2008 Copyright © 1999-2008 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law. HRP 223 - 2008 Topic 6 – Relational Data

3 HRP223 2008 HW 2  SORRY! I apologize for not getting it posted before yesterday! It is due in two weeks. – The new datasets have less variables and one variable renamed. You want to change newID to have the same name as the old subject ID. You can put a rename command on the line that does the import: proc import out = wide (rename = (dude = subjectID)) datafile = "C:\Projects\classes\HRP223- 2008\day6\wideDx.xls" replace; mixed = yes; sheet = "Sheet1"; run;

4 HRP223 2008 Flat Files  Some people try to store all their data in a single file. This causes lots of extra work because of holes in the tables and repeated information.  Both problems can be fixed by a relational model. – Split the data into many tables.  You need to use SQL to work with data split across multiple tables.

5 HRP223 2008 Not Normalized  I frequently get data, from people who are not professional programmers, where the diagnosis data is organized “wide” across the page. Where the first diagnosis is in the first column, the second is in the second, etc. and the task is to find or fix a diagnosis.

6 HRP223 2008 Subsetting Based on 5 Variables

7 HRP223 2008 SQL vs. Datastep  The GUI generates this code:  Or you could write a little data step program:

8 HRP223 2008 Change All 9s to 999s?  It is a lot of clicking.

9 HRP223 2008 Code  The SQL is a bit complicated

10 HRP223 2008 As Data Step  If it is more than 5 columns, things get unruly. Imagine doing this across 20 possible diagnoses. There is an easy solution in data step code.  First, the SQL code can be done easily in a data step.

11 HRP223 2008 A List  As you can see, there is a list of variables and you are doing the same things over and over.  You want to make a list called dx and have the 1 st element refer to dx1, the 2 nd thing refer to dx2, etc. The concept of a named list of variables or an alias to a bunch of variables is instantiated as an array.

12 HRP223 2008 Arrays  A major improvement….. Ummmm.  You want to process the same one line over and over. You need to count from 1 to 5…. Sounds like a loop.

13 HRP223 2008 Change Lots of Things  If you have an array, you can process wide files easily.

14 HRP223 2008 Restructuring with Arrays  You can use similar code to restructure data so that you have only a couple of columns of data.  Add a new column that is called dxNum and another called theDX. Those two columns plus the subject ID number can contain the same information without all the “holes”.

15 HRP223 2008 How does that work?  Go through all five variables, one at a time.  If the variable is not missing, you need to do three things: – Copy the diagnosis counter number into the dxNum variable. – Copy the diagnosis code number into the variable called theDx. – Write to the new data set.

16 HRP223 2008 Repeated Ifs  This is a lot of typing and it obscures the fact that you are doing three things if a condition is true:

17 HRP223 2008 do end  You have seen do statements in the context where you do stuff over and over. There is also a do end command for when you need to do a block of instructions if a condition is true. You need both do and end

18 HRP223 2008 Actual Code

19 HRP223 2008 Normalization Part 2  I got data where I needed to analyze age for people who have a particular diagnosis. The data was a not-normalized mess:

20 HRP223 2008 Normalization Part 2 The Wrong Way  If your database is like this, you need code like this: data bad2; set bad; if (dob1 ne. and not missing(dx1)) then do; if code1= 22 then IsCase1=1; else Iscase1=0; end; if (dob2 ne. and not missing(dx2)) then do; if code2=22 then IsCase2=1; else Iscase2=0; end; if (dob3 ne. and not missing(dx3)) then do; if code3=22 then IsCase3=1; else Iscase3=0; end; if (dob4 ne. and not missing(dx4)) then do; if code4=22 then IsCase4=1; else Iscase4=0; end; if (dob5 ne. and not missing(dx5)) then do; if code5=22 then IsCase5=1; else Iscase5=0; end; run; You will end up with the same code repeated as many times as you have repetitions.

21 HRP223 2008 Normalization Part 2 The Right Way  Instead, you should have a record in a table corresponding to each repetition.  With code like this: data good2; set good; if code= 22 then isCase1=1; else isCase1=0; run;

22 HRP223 2008  Your first attempt could go something like this: data normal1 (keep = sid mid dob dx code); set bad; format dob dx mmddyy8.; if (dob1 ne. and dx1 ne. and code1 ne.) then do; mid = 1;dob = dob1; dx = dx1;code = code1; output; end; if (dob2 ne. and dx2 ne. and code2 ne.) then do; mid = 2; dob=dob2; dx=dx2; code=code2; output; end; if (dob3 ne. and dx3 ne. and code3 ne.) then do; mid=3; dob=dob3; dx=dx3; code=code3; output; end; if (dob4 ne. and dx4 ne. and code4 ne.) then do; mid=4; dob=dob4; dx=dx4; code=code4; output; end; if (dob5 ne. and dx5 ne. and code5 ne.) then do; mid=5; dob=dob5; dx=dx5; code=code5; output; end; run; But you end up with just as many blocks of code.

23 HRP223 2008 Setting up Aliases (Arrays)  What you want is a way to repeat this code over the five sets of variables: if (dob1 ne. and dx1 ne. and code1 ne.) then do; mid = 1;dob = dob1; dx = dx1;code = code1; output; end;  You need: – A dob alias (dob_a) to refer to dob1, dob2, dob3, dob4 and dob5 – A dx alias (dx_a) to refer to dx1, dx2, dx3, dx4 and dx5 – A code alias (code_a) to refer to code1, code2, code3, code4 and code5

24 HRP223 2008 Setting up Aliases (Arrays) data normal2a; set bad; array dob_a dob1-dob5; array dx_a dx1-dx5; array code_a code1-code5; if (dob1 ne. and dx1 ne. and code1 ne.) then do; mid = 1;dob = dob1; dx = dx1;code = code1; output; end; run; This sets up the arrays but they are not used in this program.

25 HRP223 2008 Setting up Aliases (Arrays) data normal2a; set bad; array dob_a dob1-dob5; array dx_a dx1-dx5; array code_a code1-code5; if (dob_a[1] ne. and dx_a[1] ne. and code_a[1] ne.) then do; mid = 1;dob = dob_a[1]; dx = dx_a[1];code = code_a[1]; output; end; run;

26 HRP223 2008 Setting up Aliases (Arrays) data normal2c (keep = sid mid dob dx code); set bad; array dob_a dob1-dob5; array dx_a dx1-dx5; array code_a code1-code5; do c = 1 to 5 by 1; if (dob_a[c] ne. and dx_a[c] ne. and code_a[c] ne.) then do; mid = c; dob = dob_a[c]; dx = dx_a[c]; code = code_a[c]; output; end; run;

27 HRP223 2008 Arrays  You can tell SAS that a set of variables are related by putting them into an array statement.  Arrays in SAS are not like arrays in other languages like BASIC or C. SAS arrays are only aliases to an existing set of variables. They are created using the array statement: array times_a [365] time1-time365; My notation for arrays An optional size of the array What the array refers to

28 HRP223 2008 Arrays (2)  If your array references variables that do not exist, they will be created. Make sure to use the $ if you intend to create character variables.  If you want to reference all numeric variables between theValue and thingy2, do it like this: array x theValue -- thingy2 _numeric_; -- means all values between and including the starting and ending variables - indicates the numeric sequence starting with the first variable and ending with the second

29 HRP223 2008 SQL and Colors  You may have noticed that the guys who made the enhanced editor don’t know SQL commands because some of the key words were not colorized. There are lots of them, but they can be easily fixed.

30 HRP223 2008 Fix Color  Go to Tools > Options ….> SAS Programs and then click Editor Options… then User Defined Keywords

31 HRP223 2008 Missing Words  Add – calculated – coalesce – corresponding – except – full – group – inner – intersect – join – left – on – or – order – outer – right – union

32 HRP223 2008 Minimal SQL  Print a report showing the contents of variables from a single data set. Put a comma-delimited list of variables here. Specify a library.table here.

33 HRP223 2008 What variables?  Use an * to indicate that you want all variables instead of typing them all.  There is no syntax to specify variables based on position in the source files. That is, you can not specify that you want to select the 2 nd and 7 th variables (from left to right) or to select the first 3 variables.

34 HRP223 2008 Likely Tweaks  You can rename a variable in the list with an as statement.  You can also specify variable formats and labels.

35 HRP223 2008 More Tweaks  The from line references tables which are in libraries. Complex queries require you to reference the table name over and over again. Instead of having to type the long library and dataset names repeatedly, you can refer to the files as an alias. Print the column called dude from the table reportedCancers which is in the ovCancer library. Here the c. is optional because dude is only in one table (the query only uses one table).

36 HRP223 2008 Stacking  You already know how to use proc append or the Data > Append Table menu item to combine two sets of data on top of one another. How do you “copy/paste” to insert columns from one table into another?

37 HRP223 2008 The GUI can do easy SQL.  You could write data step or proc sql code.  Happily, most of the merges you need are in the graphical user interface.

38 HRP223 2008 How are tables linked?  You need to tell it who is matched with whom in the tables. If you have a demographics table and a disease table, you need to specify which column says which disease belongs to which person. In this case you would say match on the subject ID numbers in the two tables using a key column.

39 HRP223 2008 Inner Join  If you want records where there is a match in both tables, you want an inner join (aka, equijoin or natural join). – For example, which subjects have demographic and cancer information?

40 HRP223 2008

41 Alternate Syntax This is what I write.

42 HRP223 2008 All Information from the Left Table  If you want all the demographics, as well as the cancers if they occur:

43 HRP223 2008 Left Join Code

44 HRP223 2008 All Information from the Right Table

45 HRP223 2008  If you wanted the cancers info plus demographics where there were any:

46 HRP223 2008 Right Join Code

47 HRP223 2008 Full Join  If you wanted all information: It would be nice if you could combine the two dude variables so the first not –missing value was used.

48 HRP223 2008 Full Join Code

49 HRP223 2008 Coalesce  Coalesce says take the first not-missing value from the set of variables.

50 HRP223 2008 Checking for ID Numbers with SQL  A task that I need to do frequently is to build a list of all subject IDs when data is coming from multiple sources. – List IDs with duplicates. – List unique ID numbers. – List who is in both files. – List who is in one file but not the second. – Make a summary showing all IDs and an indicator for who appears where.

51 HRP223 2008 PROC SQL - Set Operators NO GUI  Outer Union Corresponding – concatenates  Unions – unique rows from both queries  Except – rows that are part of first query  Intersect – rows common to both queries

52 HRP223 2008 outer union corresponding  You can concatenate data files.  I rarely use it.  proc sql; create table isOuter as select dude from baseline outer union corresponding select dude from followup; quit;

53 HRP223 2008 union  You can also concatenate data files and keep unique records: proc sql; create table isUnion as select dude from baseline union select dude from followup; quit;

54 HRP223 2008  Say you needed everyone who did not come back. Start out with the baseline group and remove the people who came back. proc sql; select id from baseline except select id from followup; quit; except

55 HRP223 2008  Say you wanted to know who came back. In other words, what IDs are in both files? proc sql; select id from baseline intersect select id from followup; quit; intersect

56 HRP223 2008 PROC SQL - Set Operators  When you have tables (with more than one column) with the same structure, you can combine them with these set operators. – Be extremely careful because SAS/SQL is forgiving about the structure of the tables and you may not notice problems in the data. – For this to work as intended, the two tables must have the same variables, in the same order, and the variables must be of the same type (variables with the same name must both be character or both be numeric). Use the key word corresponding to have it match like named variables.

57 HRP223 2008 corresponding  The columns do not need to have matching names or even the same length and it will still operate on them.  Use correponding to help spot this problem.

58 HRP223 2008 Summary Table  Say you have two or more files and they are supposed to have the same subject IDs. How do you make a summary table showing who has information in each table? – Make a master list that has all people regardless of the source file. – Add an indicator column with the value 1 where the subject ID in a table matches the master ID table. – Add in a second column with the value 1 where the subject ID in the second file matches the master ID.

59 HRP223 2008 Make Some Data Make a file with 100 random numbers between 1 and 100 (you can get the same number more than once) and an indicator variable holding the value 1. Sort the data and remove the duplicates.

60 HRP223 2008 In Code

61 HRP223 2008 Subquery  In real life the tables that you are comparing will not include a convenient variable that is holding “1”. You can have SQL make a new variable easily enough: Notice there is no column indicating it is inDay1. Add in a column called inDay1 with the value 1 for everyone.

62 HRP223 2008 Subquery You can do this with a single query.

63 HRP223 2008 Order  Notice that the data is never put into order. In this case, it ended up ordered correctly because of the union statement. You can explicitly request having the data sorted so you do not need to use the Data > Sort Data… menu. Just add an order by clause.

64 HRP223 2008 Working with Repeated Keys  A file tracking diagnoses or treatments will have multiple records for some people. – If you want to count the number of records for a person, specify what variable(s) are used to group by. – Count records in the group with count(*) or count not missing values with count(variableName)

65 HRP223 2008 Joining on Duplicated Keys  If you join tables that have duplicated key values, you will end up with lots of records. Specifically, the new table will have as many records as the sum of the product of the two key counts. 2 in appt * 2 in dx = 4 records 2 in appt * 4 in dx = 8 records

66 HRP223 2008 distinct  The word distinct removes duplicates.  If you want the IDs of people who had any records in both tables, use distinct.

67 HRP223 2008 Joint Keys  Sometimes you need to use more than one variable to indicate which records match across tables. For example, if you use both pedigree numbers and family member numbers in tables to identify people, you need to use both these pedigree ID number and accession number variables to join tables.


Download ppt "HRP223 2008 Copyright © 1999-2008 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and."

Similar presentations


Ads by Google