Introduction to SAS Essentials Mastering SAS for Data Analytics

Slides:



Advertisements
Similar presentations
Session 2Introduction to Database Technology Data Types and Table Creation.
Advertisements

Introduction to PHP MIS 3501, Fall 2014 Jeremy Shafer
Introduction to SAS Programming Christina L. Ughrin Statistical Software Consulting Some notes pulled from SAS Programming I: Essentials Training.
 2005 Pearson Education, Inc. All rights reserved Introduction.
I OWA S TATE U NIVERSITY Department of Animal Science Getting Started Using SAS Software Animal Science 500 Lecture No. 2.
1 An Introduction to IBM SPSS PSY450 Experimental Psychology Dr. Dwight Hennessy.
A Guide to SQL, Seventh Edition. Objectives Understand the concepts and terminology associated with relational databases Create and run SQL commands in.
Introduction to Array The fundamental unit of data in any MATLAB program is the array. 1. An array is a collection of data values organized into rows and.
Chapter 3: Introduction to C Programming Language C development environment A simple program example Characters and tokens Structure of a C program –comment.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS Essentials - Elliott & Woodward1.
Into to SAS ®. 2 List the components of a SAS program. Open an existing SAS program and run it. Objectives.
Creating SAS® Data Sets
Welcome to SAS…Session..!. What is SAS..! A Complete programming language with report formatting with statistical and mathematical capabilities.
A Guide to SQL, Eighth Edition Chapter Three Creating Tables.
Introduction to Access By Mary Ann Chaney and Alicia Harkleroad.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS ESSENTIALS -- Elliott & Woodward1.
Fortran 1- Basics Chapters 1-2 in your Fortran book.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.
Introduction to SAS Essentials Mastering SAS for Data Analytics
IPC144 Introduction to Programming Using C Week 1 – Lesson 2
Introduction to SAS. What is SAS? SAS originally stood for “Statistical Analysis System”. SAS is a computer software system that provides all the tools.
Using Advanced INPUT Techniques Peter Cosette Dave Hall Amy Dunn-Ruiz Eric Lyon.
EPIB 698C Lecture 2 Notes Instructor: Raul Cruz 2/14/11 1.
BMTRY 789 Lecture 2 SAS Syntax, entering raw data, etc. Lecturer: Annie N. Simpson, MSc. Readings – Chapters 1, 2, 12, & 13 Lab Problems 1.1, 1.2, 1.3,
I OWA S TATE U NIVERSITY Department of Animal Science Getting Your Data Into SAS (Chapter 2 in the Little SAS Book) Animal Science 500 Lecture No. 3 September.
Lesson 2 Topic - Reading in data Chapter 2 (Little SAS Book)
Introduction to SAS Essentials Mastering SAS for Data Analytics
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
BMTRY 789 Lecture 11: Debugging Readings – Chapter 10 (3 rd Ed) from “The Little SAS Book” Lab Problems – None Homework Due – None Final Project Presentations.
CONSTANTS Constants are also known as literals in C. Constants are quantities whose values do not change during program execution. There are two types.
Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward SAS Essentials - Elliott & Woodward1.
Chapter 1: Overview of SAS System Basic Concepts of SAS System.
Computing with SAS Software A SAS program consists of SAS statements. 1. The DATA step consists of SAS statements that define your data and create a SAS.
FORMAT statements can be used to change the look of your output –if FORMAT is in the DATA step, then the formats are permanent and stored with the dataset.
Lesson 2 Topic - Reading in data Programs 1 and 2 in course notes –Chapter 2 (Little SAS Book)
SAS Programming Training Instructor:Greg Grandits TA: Textbooks:The Little SAS Book, 5th Edition Applied Statistics and the SAS Programming Language, 5.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 5 & 6 By Ravi Mandal.
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 3 & 4 By Tasha Chapman, Oregon Health Authority.
DATA TYPES.
(Winter 2017) Instructor: Craig Duckett
Topics Designing a Program Input, Processing, and Output
GO! with Microsoft Office 2016
SQL and SQL*Plus Interaction
© 2016 Pearson Education, Ltd. All rights reserved.
Structured Programming
Lesson 2 Topic - Reading raw data into SAS
ICS103 Programming in C Lecture 3: Introduction to C (2)
GO! with Microsoft Access 2016
Introduction to SAS®.
Instructor: Raul Cruz-Cano 7/9/2012
Chapter 2: Getting Data into SAS
The Selection Structure
Variables and Arithmetic Operations
Variables In programming, we often need to have places to store data. These receptacles are called variables. They are called that because they can change.
Chapter 3 The DATA DIVISION.
Introduction to SAS A SAS program is a list of SAS statements executed in order Every SAS statement ends with a semicolon! SAS statements can be in caps.
Creating Tables & Inserting Values Using SQL
Introduction to DATA Step Programming: SAS Basics II
Nagendra Vemulapalli Access chapters 3&5 Nagendra Vemulapalli 1/18/2019.
Introduction to SAS Essentials Mastering SAS for Data Analytics
Chapter 2: Introduction to C++.
Introduction to SAS Essentials Mastering SAS for Data Analytics
Topics Designing a Program Input, Processing, and Output
Functions continued.
Topics Designing a Program Input, Processing, and Output
Introduction to SAS Essentials Mastering SAS for Data Analytics
Introduction to SAS Essentials Mastering SAS for Data Analytics
Introduction to SAS Essentials Mastering SAS for Data Analytics
Introduction to SAS Essentials Mastering SAS for Data Analytics
Presentation transcript:

Introduction to SAS Essentials Mastering SAS for Data Analytics Alan Elliott and Wayne Woodward

Class 1, Session 2 Alan Elliott Intro to SAS Chapter 2 Class 1, Session 2 Alan Elliott

LEARNING OBJECTIVES To enter data using freeform list input To enter data using the compact method To enter data using column input To enter data using formatted input To enter data using the INFILE technique To enter multiple-line data

2.1 Using SAS Data Steps The DATA step is used to define a SAS data set, and to manipulate it and prepare it for analysis. In the SAS language, the DATA statement signals the creation of a new data set. For example, a DATA statement in SAS code may look like this: DATA MYDATA;

DATA Statement Signals the beginning of the DATA step. Assigns a name (of your choice) to the data set created in the DATA Step. The general form of the DATA statement is: DATA datasetname;

Temporary Data Set Names The SAS “datasetname” in the DATA statement can have several forms: The SAS datasetname can be a single name (used for temporary data sets, kept only during current SAS session). For example:   DATA EXAMPLE; DATA OCT2007;

Permanent Data Set Names Or, the SAS datasetname can be a two-part name. The two-part name tells SAS that this permanent data set will be stored on disk beyond the current SAS session in a SAS “library” indicated by the prefix name. For example: DATA RESEARCH.JAN2016; DATA CLASS.EX1; DATA JONES.EXP1;

Windows Data Set Names A SAS data set name can also refer directly to the Windows name of a file on your hard disk. For example: DATA “C:\SASDATA\SOMEDATA”; DATA “C:\MYFILES\DECEMBER\MEASLES”; (note these refer to SAS data set files with .sas7bdat extensions – thus SOMEDATA.SAS7BDAT and MEASLES.SAS7BDAT)

Tasks done within the DATA Step DATA datasetname; <code that defines the variables in the data set>; <code to enter data>; <code to create new variables>; <code to assign missing values>; <code to output data>; <code to assign labels to variables>; <and other data tasks>;

2.2 Understanding SAS data set structure   ID SBP DBP GENDER AGE WT obs 1 001 120 80 M 15 115 obs 2 002 130 70 F 25 180 ... . obs 100 100 125 20 110 Columns are data variables (each named) and rows are subjects, observations, or records.

Columns and Rows Each column represents a variable and is designated with a variable name (ID, SBP, etc.) Every SAS variable (column) must have a name, and the names must follow certain naming rules. Each row, marked here as obs1, obs2, etc., indicate observations or records. An observation consists of data observed from one subject or entity.

2.3 Rules for SAS variable names must be 1‑32 characters long – but must not include any blanks. must start with the letters A through Z or the _ (underscore). A name cannot include a blank. may include numbers (but not as first character in name). may include upper and lower case characters (variable names are case insensitive). should be descriptive of the variable (optional but recommended).

Correct & Incorrect variable names Correct SAS variable names are GENDER AGE_IN_1999 AGEin1999 _OUTCOME_HEIGHT_IN_CM WT_IN_LBS   Incorrect SAS variable names are (WHY?) AGE IN 2000 2000MeaslesCount S S Number Question 5 WEIGHT IN KG AGE-In-2000

2.4 Understanding SAS Variable Types Numeric Variables (Default): A numeric variable is used to designate values that could be used in arithmetic calculations or are grouping codes for categorical variables. For example, the variables SBP (systolic blood pressure), AGE, and WEIGHT are numeric. However, an ID number, phone number, or Social Security number should not be designated as a numeric variable. For one thing, you typically would not want to use them in a calculation. Moreover, for ID numbers, if ID = 00012 were stored as a number, it would lose the zeros and become 12.

Character (Text, String) Variables Character (Text, String) Variables: Character variables are used for values that are not used in arithmetic calculations. For example, a variable that uses M and F as codes for gender would be a character variable. For character variables, case matters, because to the computer a lowercase f is a different character from an uppercase F. It is important to note that a character variable may contain numerical digits. As mentioned previously, a Social Security number (e.g., 450- 67-7823) or an ID number (e.g., 143212) should be designated as a character variable because their values should never be used in mathematical calculations. When designating a character variable in SAS, you must indicate to SAS that it is of character type. This is shown in upcoming examples.

Date Variables Date Variables: A date value may be entered into SAS using a variety of formats, such as 10/15/09, 01/05/2010, JAN052010, and so on. As you will see in upcoming examples, dates are handled in SAS using format specifications that tell SAS how to read or display the date values. For more information about dates in SAS, see Appendix B. Technically, dates are integers. We’ll learn more about dates, and how to manipulate them, later.

2.5 Methods of reading data into SAS Reading data using freeform list input Reading data using the compact method Reading data using column input Reading data using formatted input.

Freeform Data Entry Open the file DFREEFORM.SAS DATA MYDATA; The INPUT statement defines the variables (some character, designated by $ after name.) Open the file DFREEFORM.SAS DATA MYDATA; INPUT ID $ SBP DBP GENDER $ AGE WT; DATALINES; 001 120 80 M 15 115 002 130 70 F 25 180 003 140 100 M 89 170 004 120 80 F 30 150 005 125 80 F 20 110; ; PROC PRINT; RUN; DATALINE indicates that data are listed next. The data must match the INPUT statement – the same number of values per line, separated with blanks. DATA ends with a semicolon.

Data set created from code   Obs ID SBP DBP GENDER AGE WT 1 001 120 80 M 15 115 2 002 130 70 F 25 180 3 003 140 100 89 170 4 004 20 150 5 125 110

Advantages of freeform list input Easy, very little to specify. No rigid column positions which makes data entry easy. If you have a data set where the data are separated by blanks, this is the quickest way to get your data into SAS.

Restrictions for freeform list input Every variable on each data line must be in the order specified by the INPUT statement. Fields must be separated by at least one blank. Blank spaces representing missing variables are not allowed. Having a blank space in the data causes values to be out of sync. If there are missing values in the data, a dot (.) should be placed in the position of that variable in the data line. For example, a data line with AGE missing might read: 4 120 80 F . 150 No embedded blanks are allowed within the data value for a character field, like MR ED. A character field has a default maximum length of 8 characters in freeform input.

Compact Data Format The @@ symbol in the INPUT statement tells SAS to allow multiple rows of data on each line. You must be careful that the data matches the input definition. DATA WEIGHT; INPUT TREATMENT LOSS @@; DATALINES; 1 1.0 1 3.0 1 -1.0 1 1.5 1 0.5 1 3.5 2 4.5 2 6.0 2 3.5 2 7.5 2 7.0 2 6.0 2 5.5 3 1.5 3 -2.5 3 -0.5 3 1.0 3 .5 ; PROC PRINT; RUN;

Hands-On Example p 27 1. Open the program file DFREEFORM.SAS. (The code was shown above.) Run the program. 2. Observe that the output listing (shown here) illustrates the same information as in Table 2.4. Etc…

Column Input (This is DCOLUMN.SAS) DATA MYDATA; INPUT ID $ 1-3 SBP 4-6 DBP 7-9 GENDER $ 10 AGE 11-12 WT 13-15; DATALINES; 001120 80M15115 002130 70F25180 003140100M89170 004120 80F30150 005125 80F20110 ; RUN; Note how data are in specific columns. In the INPUT statement, the columns are specified by the ranges following a variable name. INPUT variable startcol‑endcol ...;

Advantages of column input Data fields can be defined and read in any order in the INPUT statement and unneeded columns of data can be skipped. Blanks are not needed to separate fields. Character values can range from 1 to 200 characters. For example: INPUT DIAGNOSE $ 1‑200; For character data values, embedded blanks are no problem, e.g., John Smith Input only the variables you need -- skip the rest. This is handy when your data set (perhaps downloaded from a large database) contains variables you’re not interested in using. Only read the variables you need.

Rules and restrictions for column Input Data values must be in fixed column positions. Blank fields are read as missing. Character fields are read right justified in the field. Column input has more specifications than list input. You must specify the column ranges for each variable.

How SAS interprets column data INPUT GENDER $ 1‑3;   1 2 3 4 5 6 7 1 2 3 4 M M ---> All read as M

Numbers in Columns INPUT X 1‑6; 1 2 3 4 5 6 7 READ AS 2 3 0 230   1 2 3 4 5 6 7 READ AS 2 3 0 230 2 3 . 0 23.0 2 . 3 E 1 2.3E1 or 23 2 3 23 - 2 3 -23

Hands-On Exercise p 31 (DCOLUMN.SAS) DATA MYDATA; INPUT ID $ 1-3 SBP 4-6 DBP 7-9 GENDER $ 10 AGE 11-12 WT 13-15; DATALINES; 001120 80M15115 002130 70F25180 003140100M89170 004120 80F30150 005125 80F20110 ; RUN; PROC PRINT DATA=MYDATA;

Reading Data Using Formatted Input DATA MYDATA; INPUT @col variable1 format. @col variable2 format. ...; Example: DATA MYDATA INPUT @1 SBP 3. @4 DBP 3. @7 GENDER $1. @8 WT 3. @12 OWE COMMA9.;     Note the difference in the INPUT statement.

Three Components In Formatted Input @1 SBP 3. The @ is the starting column “pointer” so @1 means start in column 1. The variable name. The “informat” defines what kind of data to input. In this case “3.” indicates an integer with at most 3 digits.

Input Formats Table 2.6, p 33 Informat Meaning 5. Five columns of data as numeric data. $5. Character variable with width 5, removing leading blanks. $CHAR5. Character variable with width 5, preserving leading blanks. COMMA7. Seven columns of numeric data and strips out any commas or dollar signs (i.e., $40,000 is read as 40000). COMMA10.2 Reads 10 columns of numeric data with 2 decimal places (strips commas and dollar signs.) $19,020.22 is read as 19020.22. MMDDYY8. Date as 01/12/16. (Watch out for Y2K issue.) MMDDYY10. Date as 04/07/2016 DATE7. Date as 20JUL16 DATE9. Date as 12JAN2016. (No Y2K issue.)

Informats & Formats In SAS INFORMATS are used to read data into a SAS data set FORMATS are used to specify how to output (write) data values Most Format specifications (such as MMDDYY10.) can be used as EITHER an informat or a format.

More about formats Formats must end with a dot (.) or a dot followed by a number 5. : A number up to five digits, no decimals, so could take on values from -9999 to 99999. 5.2 : A number up to five digits, and up to 2 decimals – so could take on values from -9.99 to 99.99 $5. : A character value of up to five digits, such as ABCDE, abcde, 12345 or (*&6%

Advantages & restrictions for using formatted input Advantages and restrictions are similar to those for column input. The primary difference is the ability to read in data using INFORMAT specifications. Is particularly handy for reading dates and dollar values. Restrictions are similar to those for column input.

Hands-On Exercise, p 36 (DINFORMAT.SAS) DATA MYDATA; INPUT @1 SBP 3. @4 DBP 3. @7 GENDER $1. @8 WT 3. @12 OWE COMMA9.; DATALINES; 120 80M115 $5,431.00 130 70F180 $12,122 140100M170 7550 120 80F150 4,523.2 125 80F110 $1000.99 ; PROC PRINT DATA=MYDATA; RUN;

Some Common Output Formats Compare these to the INPUT formats listed earlier.

Using the SAS INFORMAT Statement There is a SAS statement named INFORMAT that you could use in the freeform data entry case. For example, (see p 37) DATA PEOPLE; INFORMAT LASTNAME FIRSTNAME $12. AGE 3. SCORE 4.2; INPUT LASTNAME FIRSTNAME AGE SCORE; DATALINES; Lincoln George 35 3.45 Ryan Lacy 33 5.5 ; PROC PRINT DATA=PEOPLE; RUN; In this case, the INFORMAT statement can specify that a freeform text value is longer than the default eight characters.

Reading External Data Using INFILE Suppose you have text data in a file in this format 101 A 12 22.3 25.3 28.2 30.6 5 0 102 A 11 22.8 27.5 33.3 35.8 5 0 104 B 12 22.8 30.0 32.8 31.0 4 0 110 A 12 18.5 26.0 29.0 27.9 5 1 How can you read this data into SAS?

INFILE Statement INSTEAD of the DATALINES statement (followed by data), you use the INFILE Statement: DATA MYDATA; INFILE 'C:\SASDATA\EXAMPLE.DAT'; INPUT ID $ 1-3 GP $ 5 AGE 6-9 TIME1 10-14 TIME2 15-19 TIME3 20-24; RUN; PROC MEANS; INFILE replaces DATALINES, and defines where the data are located on disk. NOTE: There is no DATALINES statement. This is where the DATA step ends and the PROC step begins.

DATALINES vs INFILE When data are in the program code: Use DATALINES statement When data are read from external source: Use INFILE statement. ALSO NOTE: Do not confuse INFILE with the INPUT statement.

Hands-On Example p 38 (DINFILE1.SAS) DATA MYDATA; INFILE 'C:\SASDATA\EXAMPLE.TXT'; INPUT ID $ 1-3 GP $ 5 AGE 6-9 TIME1 10-14 TIME2 15-19 TIME3 20-24; PROC MEANS DATA=MYDATA; RUN; RESULTS

2.6 Going Deeper – More Techniques Reading Multiple Records per Observation If your data for each record extends multiple lines: You can use more than one INPUT statement INPUT ID $ SEX $ AGE WT; INPUT SBP DBP BLDCHL; INPUT OBS1 OBS2 OBS3; Or you can read the data using the / (advance) indicator: INPUT ID $ SEX $ AGE WT/ SBP DBP BLDCHL/ OBS1 OBS2 OBS3;

Hands-On Example p 40. (DMULTILINE.SAS) DATA MYDATA; INPUT ID $ SEX $ AGE WT/ SBP DBP BLDCHL/ OBS1 OBS2 OBS3; DATALINES; 10011 M 15 115 120 80 254 15 65 102 10012 F 25 180 130 70 240 34 120 132 10013 M 89 170 140 100 279 19 89 111 ; PROC PRINT DATA=MYDATA; RUN; RESULTS

Input Pointer Controls

Using Advanced INFILE Options DLM: This option allows you to define a delimiter to be something other than a blank. For example, if data are separated by commas, include the option DLM = ‘,’ in the INFILE statement. FIRSTOBS= n :Tells SAS on what line you want it to start reading your raw data file. This is handy if your data file contains one or more header lines or if you want to skip the first portion of the data lines. OBS= n :Indicates which line in your raw data file should be treated as the last record to be read by SAS. See Table 2.13 p 41 for more options

Hands-On Example, p 42 (DINFILE2.SAS) DATA MYDATA; INFILE 'C:\SASDATA\EXAMPLE.CSV' DLM=', ' FIRSTOBS=2 OBS=26; INPUT GROUP $ AGE TIME1 TIME2 TIME3 Time4 SOCIO; PROC MEANS; RUN;

Hands On Exercise DINFILE3.SAS p 43 Note strange delimiter Open the program file DINFILE3.SAS. DATA PLACES; INFILE DATALINES DLMSTR='!∼!'; INPUT CITY $ STATE $ ZIP; DATALINES; DALLAS!∼!TEXAS!∼!75208 LIHUE!∼!HI!∼!96766 MALIBU!∼!CA!∼!90265 ; PROC PRINT; Etc…

Input a data set where there are blanks This is a ruler.

Do Hands On Example p 44 (DINFILE4.SAS) Note TRUNCOVER DATA TEST; INFILE "C:\SASDATA\DINFILEDAT.TXT" TRUNCOVER; INPUT LAST $1-21 FIRST $ 22-30 ID $ 31-36 ROLE $ 37-44; RUN; PROC PRINT DATA=TEST;RUN;

Summary One of the most powerful features of SAS, and a reason it is used in many research and corporate environments, is that it is very adaptable for reading in data from a number of sources. This chapter showed only the tip of the iceberg. More information about getting data into SAS is provided in the following chapter. For more advanced techniques, see the SAS documentation.

These slides are based on the book: Introduction to SAS Essentials Mastering SAS for Data Analytics, 2nd Edition By Alan C, Elliott and Wayne A. Woodward Paperback: 512 pages Publisher: Wiley; 2 edition (August 3, 2015) Language: English ISBN-10: 111904216X ISBN-13: 978-1119042167 These slides are provided for you to use to teach SAS using this book. Feel free to modify them for your own needs. Please send comments about errors in the slides (or suggestions for improvements) to acelliott@smu.edu. Thanks.