Download presentation
Presentation is loading. Please wait.
1
SQL and SAS Russell.Hendel@Cms.hhs.govRussell.Hendel@Cms.hhs.gov, Ph.d A.S.A Russell.Hendel@Cms.hhs.gov SAS DAY – Oct 31 2007
2
OVERVIEW - GOALS What is SQL? When do you use it? Virtues of SQL over SAS – readability, ease of learning readability, ease of learning Virtues of SAS over SQL – speed,compactness How can SQL help me when thinking about requirements for projects? (For laypeople) NO PREREQUISITES NEEDED
3
METHOD TO PRESENT We will use a THEMATIC presentation We will take one project-the THEME And show how each step can be done in SQL and SAS Along the way we will learn everything we need to know about SQL and its virtues
4
THEMATIC PROJECT START FILE File with Medicare Eligibles Contains state, county, codes (SSA and FIPS), zip, gender, agegroup, Pt A entitlement, Pt A term, Pt B entitlement, Part B term, Eligible count for this record
5
THEMATIC PROJECT END FILE Want Subtotals and aggregation: Official State, County, Code list used in Office of Actuary, along with All Part D eligibles (Entitled to Part A or B) May also want Part A and B eligibles (Part C) Goal of presentation: To get FROM the START TO the END
6
SMALL OVERVIEW OF SQL What is sql? SQL is a computer language designed for databases SQL is a summary of several dozen attempts to create database computer languages It was found that all such languages had 8 items in common. SQL has these 8 items and is a COMPUTER STANDARD for any database language Each database language should be SQL compliant (DB2, SAS, ORACLE, etc)
7
OVERVIEW OF SQL A database FILE is simply a file with ROWS and COLUMNS Each ROW is a RECORD Each COLUMN is a FIELD The most typical example is say the children in your family Each child would have Fields associated with them FirstName, Age, School they attend, Birthday etc This information is STORED in a rectangular table
8
OVERVIEW OF SQL A DATABASE is a collection of DATABASE FILES For example besides your database file for your family you may have a database from the yellow pages which lists for each age group stores carrying clothing, toys and other FIELDS You can use the DATABASE to find out what stores are good for each child to buy what they want
9
OVERVIEW OF SQL OVERVIEW OF SQL We just explained what a DATABASE is We also gave a simple example Now we explain what a DATABASE LANGUAGE is That is we explain what you would want to do with your database files It turns out there are 8 operations that every database should be able to do We will give an alternate version of these 8 basic operations that are used in practice
10
SQL OVERVIEW A database language should enable you to 1)Take a SUBSET OF ROWS of a database file (For example: All children in your family under 5) In sql we call this SELECTING 2) Take a SUBSET OF COLUMNS (e.g. FirstName and School) In SQL we call this PROJECTING 3) Make a new database file by UNIONing two database files with common fields
11
SQL OVERVIEW 4) JOIN two tables by keeping the number of rows the same but adding columns(For example take my family database file and add a column showing the store where to buy clothing for that child) 5) Suppose you take your FAMILY DATABASE FILE and JOIN to it your EXPENSES for each child(Another database file). You might want to SUBTOTAL expenses by GENDER
12
SQL OVERVIEW In other words: You might want a new table with two rows that tells you BOY – SO MANY $; GIRLS – SO MANY $ In SQL we refer to these as AGGREGATE FUNCTIONS. SQL allows 5 types of AGGREGATION: Sum, count, average, min, max
13
SQL OVERVIEW SUMMARY: Every database file should allow you to SELECT certain rows (e.g. children <5) PROJECT columns (e.g. School data) UNION (Add rows) JOIN (Add columns (e.g. add cloth stores AGGREGATE FUNCTIONS (sum,…) The good news: Only need to learn 5 pieces of code – Great advantage of SQL-ease of learning
14
HOW DO YOU DO A PROJECT To do a project you START with certain FIELDS Decide what FIELDS you want to END Decide how to GO from START to END At each step you can do 1 of 5 things SELECT,PROJECT,JOIN, UNION, AGGREGATE
15
QUERY OPTIMIZATION In general each project will allow SEVERAL ways to go from START to END Database experts have identified rules that OPTIMIZE projects The basic rule is Do SELECT,PROJECT early Do JOIN, AGGREGATE later
16
OUR PROJECT: STEP 1 Step 1 in ANY project is data cleaning I have all this data on eligibles. But is the data all OK One way of checking is listing the STATES involved. I will then DELETE those states that are BAD This is a SELECTION (reduction of ROWS) and uses the OPTIMIZATION RULE (SELECT early)
17
OUR PROJECT STEP 1 STEP 1 is done the same whether in SAS or SQL. Here is the code. We will explain the underlined keywords proc sql; create table d.state as select distinct substr(sc,1,2) as st from d.Start; ;quit; Run;
18
OUR PROJECT: Step 2 The output of the code gives me a list of all states codes in my eligible file I find two codes I don’t recognize 00 99 These are codes for badly coded records About 1% of the records are coded this way This is actually quite normal I will now get rid of them using a SELECT
19
PROJET: STEP 2 SELECT PROGRAM to delete bad states proc sql; create table d.Start2 as select distinct * from d.Start where substr(sc,1,2) not in ('99','00'); quit;
20
PROJECT: Step 2 SAS Here is the SAS code accomplishing the same function – Notice how the SQL and SAS are about the same in readability and compactness data d.Start2; set d.Start; if substr(sc,1,2) in ('99','00') then delete; run; SAS in general is usually more compact; SQL is more readable (EXERCISE: Why / How ?)
21
SQL JOINS - INTERLUDE PROBLEM: Many versions of SQL Each one slightly different Differences lie in JOIN operation Best not to rely on documentation Here are some simple programs We test the 5 types of JOINS JOIN, INNER JOIN, OUT JOIN, LEFT JOIN and RIGHT JOIN
22
SQL JOINS We need two datasets TEMP and TEMP2 each with two fields. Here is TEMP which has 2 columns A,B data d.temp; input A B;
23
SQL JOINS Here is the CARD statement for TEMP cards; 1 1 1 2 1 3 Etc. FIELD / COL A can be 1,2,3 FIELD colum B can be 1,2,3 We get 9 records
24
SQL JOIN Data set TEMP2 data d.temp2; input A C; cards; 1 7 2 8 4 9 ;
25
SQL JOIN What should you notice TEMP.A has 3 values 1,2,3 TEMP2.A has 3 values 1,2,4 TEMP.A has values not in TEMP2.A TEMP2.A has values not in TEMP There are also common values Can you the audience name the above
26
SQL JOIN What else should you notice TEMP has COLUMNS A,B TEMP2 has COLUMNS A,C The two files have a COMMON COLUMN The two files also have their own columns The common column allows a JOIN Think back to CLOTH stores for KIDS What was the field we JOINED ON (ANS=)
27
SQL JOIN Here is code for an SQL JOIN proc sql; create table d.temp3 as select * from d.temp join d.temp2 on temp.A = temp2.A; quit;
28
SQL JOIN The OUTPUT of the JOIN query is a table A B C Joined 1 1 7 1 2 8 Temp Temp2 1 3 9 (1,3) join (1,9) The table only has A=1,2. A cannot equal 3,4 The JOIN in SQL SAS takes only A values common to BOTH tables and leaves out A values in only one table
29
SQL JOIN In other languages the JOIN behaves differently When you want only values common to both files you use the INNER JOIN The INNER JOIN query has identical language to the JOIN query except that the word JOIN is replaced by INNER JOIN. In SAS the output is the same
30
SQL JOIN Suppose I wanted all values from my TEMP table whether or not they are linked in my temp2 table Is this reasonable? Sure it is. Think back to our example You want all your children listed WHETHER OR NOT there is store to buy clothing in
31
SQL JOIN So you want the A column in the JOIN table to have all A values in the TEMP table whether or not they have values in the TEMP2 table We call this a LEFT JOIN because in describing the JOIN OF TEMP WITH TEMP2, “TEMP” is on the LEFT and “TEMP2” is on the right. Neat!?!
32
SQL JOIN You make a LEFT JOIN query with identical language to a JOIN query except that the word JOIN is replaced by LEFT JOIN A typical record in the LEFT JOIN query with values from TEMP but not from TEMP2 could look like this A=3 B=1 C=. Here the “.” in the C column indicates a MISSING value. If the field was not numeric the C col is BLANK
33
SQL JOIN The RIGHT JOIN query has all A values from the TEMP2 table even if they don’t occur in the TEMP1 table The language is identical except you use the word RIGHT JOIN The output has the PERIOD (or blank) for missing values.
34
PROJECT: Step 3 So far I removed badly coded states (00,99) I want my final table to only list states and counties found in the office of the actuary tables for which we have rates So I have to do some more ROW reduction (SELECTION) How do I accomplish this
35
PROJECT STEP 3 Well I take my START2 table I will JOIN START2 (add columns) from the Office of Actuary Table The JOIN column will be _________? The STATE field (or its code) What type of JOIN? Well I don’t want anything not in the ACTUARY table and I don’t want anything not in my START2 So I want an INNER JOIN (or JOIN in SAS)
36
PROJECT: STEP 3 The SQL codes is as follows proc sql; create table d.START3 as SELECT * FROM d.Start2 Inner Join d.Actuary ON Start2.sc = Actuary.sc; quit;
37
PROJECT STEP 3 I now have in my file All records with eligible counts Where the state and county codes are also recognized in the actuary file I am now in a position to do some more subsetting and get my totals
38
PROJECT STEP 3 But am I done with step 3? Shouldn’t I look at the data deleted That is: Shouldn’t I look at the records whose county codes arent in the actuary file I can do this doing some queries as follows
39
PROJECT: STEP 3 proc sql; CREATE table d.Start3LeftOut as SELECT * FROM d.Start2 Left Join d.Actuary ON Start2.sc = Actuary.sc ; CREATE table d.Start3LeftOut as CREATE table d.Start3LeftOut as select * from d.Start3LeftOut where d.Start3LeftOut.state = ' ‘; quit;
40
PROJECT: STEP 3 IN WORDS I Left Join my original file with the actuary I then inspect those records with a blank or period Those are precisely the records that could not be matched I now review these codes or subtotal them to find out how bad the data is
41
PROJECT STEP 3 Although SQL is READABLE and carries across many platforms, SAS is much more compact. Here is the SAS code to accomplish all the preceding; SAS however requires sorting the data first SAS also requires using the IN variable This is technical sas code which should only be used by people who are familiar with it
42
PROJECT STEP 3 proc sort data=d.Actuary;by sc; run; proc sort data=d.Start2; by sc; run; data d.Start3 d.Start3a d.Start3b; merge d.Start3(in=t1) d.Actuary(in=t2); by sc; if t1 and t2 then output d.Start3; if t1 and not t2 then output d.Start3A; if not t1 and t2 then output d.Start3B; run;
43
PROJECT: STEP 4 What next? Remember the OPTIMIZE golden rule SELECT / PROJECT early GET RID Of unwanted ROWS/COL We got rid of the unwanted ROWS So next we get rid of unwanted COLS
44
PROJECT STEP 4 SAS CODE (Compact) data d.Start4; set d.Start3; keep sc st county elig fips fg fa fat fb fbt;run; SQL CODE (Slightly more readable?); PROC SQL; CREATE TABLE d.Start3 AS SELECT sc,st,county,elig,fips,fg,fa,fat,fb,fbt FROM d.Start3; QUIT;
45
PROJECT STEP 5 Next step is given as an exercise I explained the importance of LOOKING at the data vs. EXPECTING it TO LOOK a certain way So I should CHECK on the various possibliities of A,B Entitlement Term; This is similar to checking on the STATES It allows me to see all the cases I am revuing QUIZ: What are the key words to be used?
46
PROJECT STEP 6 I have done all my row/col reductions I am now ready to SUBTOTAL by STATE COUNTY (I don’t want to subtotal by zip) The sas code is very compact /technical (Fa, fat=Pt A entitlement flag, Part A “T”erm;) PROC MEANS Data=d.Start4 noprint; var elig; by sc st county fips fa fat fb fbt; output out=d.Start6 SUM; run; data d.Start6;set d.Start6 ; drop _Type_ _Freq_;
47
PROJECT STEP 6 Here is the more readable SQL code proc sql; create table d.Start6 as select sc,st,county,fips,FA,FAT,FB, FBT,,SUM(Elig) as Elig from d.Start4 group by sc,st,county,fips,FA,FAT,FB,FBT order by elig desc; Note New keywords (SUM, GROUP BY, ORDER)
48
PROJECT STEP 7 What next? Well we have ENTITLEMENT STATUSES and TERM STATUSES But I want PART D status Part D = Entitled to A or B = Pt A status=Yes(1) and Part A term=0 OR Pt B status=Yes and Part B term=0 Need code to make a new field Will only give SQL version since late Might also want (to add) A and B for which we use AND vs OR
49
PROJECT: STEP 7 PROC SQL;create table d.Start7 as select *, CASE WHEN (fa='1' and fat='0') and (fb='1' and fbt='0') and (fb='1' and fbt='0') THEN 'A and B' WHEN (fa='1' and fat='0') or (fb='1' and fbt='0') or (fb='1' and fbt='0') THEN 'A or B' ELSE ' ' END as STATUS from d.Start6; QUIT; RUN;
50
PROJECT COMPLETION There are two steps left First: We have to ONLY select rows where the status is A and B or A or B (depending what we want). Exercise: What is the KEYWORD? What is a previous example? LASTLY: We have to AGGREGATE AGAIN (Why again???) What do we GROUP BY? What do we AGGREGATE?
51
SUMMARY Databases consist of collections of files Each file is row/record x column/field You only have to know 5 things SELECT(rows);PROJECT(Columns); UNION(Rows);JOIN(columns); AGGREGATE Project=Sequence Queries (SQL!) PROJECT/SELECT early;JOIN/AGG late Identify START and END files at Project begin Be sure to check data as you proceed Sometimes you DEFINE new fields.
52
Advanced Reference SQL Processing with the SAS System; Book Code 58231 SAS INSTITUTE 13MAY92, CARY NC Excellent REFERENCE TIP To review/see ex any PARTICULAR SQL keyword like CASE, JOIN, simply GOOGLE SQL To review/see ex any PARTICULAR SQL keyword like CASE, JOIN, simply GOOGLE SQL Note: Overview given here (optimazation theory and 5 basic operations is not on web It should guide you in all you do
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.