Chapter 15: Combining Data Horizontally 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.

Slides:



Advertisements
Similar presentations
Haas MFE SAS Workshop Lecture 3:
Advertisements

Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.
How SAS implements structured programming constructs
SAS Programming:File Merging and Manipulation. Reading External Files (review) data barf; * create the dataset BARF; infile ’s:\mysas\Table7.1'; * open.
Chapter 9: Introducing Macro Variables 1 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Performing Queries Using PROC SQL Chapter 1 1 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 13: Creating Samples and Indexes 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Microsoft Visual Basic 2005: Reloaded Second Edition Chapter 8 Arrays.
Parallelizing Compilers Presented by Yiwei Zhang.
Hash vs Join A case study evaluating the use of the data step hash object to replace a SQL join Geoff Ness Sep 2014.
Basic And Advanced SAS Programming
PROC SQL – Select Codes To Master For Power Programming Codes and Examples from SAS.com Nethra Sambamoorthi, PhD Northwestern University Master of Science.
Chapter 18: Modifying SAS Data Sets and Tracking Changes 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Chapter 6Java: an Introduction to Computer Science & Programming - Walter Savitch 1 l Array Basics l Arrays in Classes and Methods l Programming with Arrays.
Chapter 3: Combining Tables Horizontally using PROC SQL 1 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
SAS SQL SAS Seminar Series
Chapter 10:Processing Macro Variables at Execution Time 1 STAT 541 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 21 Reading Hierarchical Files Reading Hierarchical Raw Data Files.
1 Chapter 4: Introduction to Lookup Techniques 4.1 Introduction to Lookup Techniques 4.2 In-Memory Lookup Techniques 4.3 Disk Storage Techniques.
Chapter 20 Creating Multiple Observations from a Single Record Objectives Create multiple observations from a single record containing repeating blocks.
Chapter 7 Advanced SQL Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.
Module 7 Reading SQL Server® 2008 R2 Execution Plans.
©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina Chapter 17 supplement: Review of Formatting Data STAT 541.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
SAS ® PROC SQL or Vanilla Flavor Cecilia Mauldin January
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
Copyright © 2008, SAS Institute Inc. All rights reserved. Hash Objects – Why Use Them? Carolyn Cunnison SAS Technical Training Specialist.
 2006 Pearson Education, Inc. All rights reserved Searching and Sorting.
1 Efficient SAS Coding with Proc SQL When Proc SQL is Easier than Traditional SAS Approaches Mike Atkinson, May 4, 2005.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Chapter 16: Using Lookup Tables to Match Data 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Starting Out with C++ Early Objects Seventh Edition by Tony Gaddis, Judy Walters, and Godfrey Muganda Modified for use by MSU Dept. of Computer Science.
Chapter 22: Using Best Practices 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 8 Arrays.
Irwin/McGraw-Hill Copyright© 2000 by the McGraw-Hill Companies, Inc. PowerPoint® Presentation to accompany prepared by James T. Perry University of San.
Arrays Chapter 13 How to do the following with a one dimensional array: Declare it, use an index.
Chapter 19: Introduction to Efficient SAS Programming 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
1 Statistical Software Programming. STAT 6360 –Statistical Software Programming Modifying and Combining Datasets For most tasks we need to work with multiple.
Efficient SAS programming with Large Data Aidan McDermott Computing Group, March 2007.
Chapter 4: Combining Tables Vertically using PROC SQL 1 © Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Stored Procedures / Session 4/ 1 of 41 Session 4 Module 7: Introducing stored procedures Module 8: More about stored procedures.
Chapter 17 Supplement: Alternatives to IF-THEN/ELSE Processing STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South.
Chapter 23: Selecting Efficient Sorting Strategies 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Use the SET statement to: –create an exact copy of a SAS dataset –modify an existing SAS dataset by creating new variables, subsetting (using a subsetting.
Chapter 21: Controlling Data Storage Space 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Chapter 6: Modifying and Combining Data Sets  The SET statement is a powerful statement in the DATA step DATA newdatasetname; SET olddatasetname;.. run;
Chapter 14: Combining Data Vertically 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
INFORMATION TECHNOLOGY DATABASE MANAGEMENT. A database is a collection of information organized to provide efficient retrieval. The collected information.
1 The Role of Algorithms in Computing. 2 Computational problems A computational problem specifies an input-output relationship  What does the.
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
SAS and Other Packages SAS can interact with other packages in a variety of different ways. We will briefly discuss SPSSX (PASW) SUDAAN IML SQL will be.
Chapter 12 Subqueries and MERGE Oracle 10g: SQL
Chapter 6: Modifying and Combining Data Sets
Chapter 9: Searching, Sorting, and Algorithm Analysis
Chapter 2: Getting Data into SAS
Chapter 19: Introduction to Efficient SAS Programming
Chapter 13: Creating Samples and Indexes
Chapter 18: Modifying SAS Data Sets and Tracking Changes
Former Chapter 23: Selecting Efficient Sorting Strategies
Chapter 24 (4th ed.): Creating Functions
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Programming Logic and Design Fourth Edition, Comprehensive
Selected Topics: External Sorting, Join Algorithms, …
Combining Data Sets in the DATA step.
24 Searching and Sorting.
Chapter 24: Querying Data Efficiently
Join Implementation How is it done? Copyright © Curt Hill.
Presentation transcript:

Chapter 15: Combining Data Horizontally 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina

2 Terminology Table Lookup Table Lookup Base table Base table Lookup tables Lookup tables Lookup values Lookup values

3 Working with Lookup Values Outside of SAS Data Sets Lookup tables are not necessarily SAS data sets. Lookup tables are not necessarily SAS data sets. The following techniques can be used to hard-code lookup values into programs: The following techniques can be used to hard-code lookup values into programs: –IF-THEN/ELSE statements –SAS arrays –User-defined SAS formats

4 IF-THEN/ELSE Statement Advantages: easy to use and to understand, versatile Advantages: easy to use and to understand, versatile Disadvantages: Code requires maintenance. Lookup values might change. Number of statements might be very large and create inefficiencies both in execution and maintenance. Disadvantages: Code requires maintenance. Lookup values might change. Number of statements might be very large and create inefficiencies both in execution and maintenance.

5 IF-THEN/ELSE Statement Example data new; set old; set old; if id=1 then x=4; if id=1 then x=4; else if id=2 then x=5; else if id=2 then x=5; else if id=3 then x=6; else if id=3 then x=6; IDX

6 SAS Arrays Lookup values can be hard-coded into the program or read into the array from a data set Lookup values can be hard-coded into the program or read into the array from a data set Array elements are referenced positionally Array elements are referenced positionally Potential disadvantages: system memory requirements, only returns a single value per lookup operation, dimensions of the array must be known at compile time Potential disadvantages: system memory requirements, only returns a single value per lookup operation, dimensions of the array must be known at compile time

7 Scoring Example with 1-Dimensional SAS Array Item 1Item 2Item 3 Response Variabler1r2r3 Answer KeyBDC data one; input name $4. +1 (r1-r3) ($1.); array answer {3} $1 _temporary_ ('B','D','C'); array response r1-r3; score=0; do _i_=1 to 3; if answer{_i_}=response{_i_} then score+1; end;

8 DATA Step match-merge Familiar technique from STAT 540 Familiar technique from STAT 540 Typically introduced as Typically introduced as –a one-to-one Outer Join –A many-to-one match merge of summary data Not appropriate for a many-to-many match Not appropriate for a many-to-many match

9 DATA Step match-merge BY variables should match, but matching can be done during execution. BY variables should match, but matching can be done during execution. proc sort data=a; by student; proc sort data=b; by name; data gradebook; merge a(in=in_a) b(in=in_b rename=(name=student)); by student; if in_a and in_b; run;

10 DATA Step match-merge vs. PROC SQL Match-merge Match-merge –Unlimited data sets –More complex data management PROC SQL PROC SQL –No pre-sorting –No common variables

11 DATA Step match-merge vs. PROC SQL Match-merge Match-merge –Portable Data Vector (PDV) used to hold information while DATA step executes –Outputs first observation from each data set for each level of the BY group variable PROC SQL PROC SQL –Creates Cartesian product –Eliminates ineligible cases in WHERE clause

12 DATA Step match-merge The DATA step can be used for many-to- one match merges The DATA step can be used for many-to- one match merges –By exporting calculation of summary measures –By computing summary measures within the DATA step itself –STAT 540 example

13 DATA Step match-merge The DATA step tends to over-match on many-to-many match merges The DATA step tends to over-match on many-to-many match merges The text introduces a fix, but it’s cumbersome The text introduces a fix, but it’s cumbersome

14 Using an Index to Combine Data Useful when Useful when –One of the data sets is much larger than the other –The smaller data set contains all the cases of interest (e.g., a left/right join) Appropriate for one-to-one matches only Appropriate for one-to-one matches only

15 Using an Index to Combine Data Example Example –SAS uses the noobs index in Fall08 to find lookup values in Fall10ms to match values of the index. –The smaller data set has to be included first so that lookup values are available in the PDV for use by the index. –_IORC_ (Input/Output Return Code) indicates whether a match for each record in the smaller data set was found.

16 Using an Index to Combine Data Example Example –Full Fall08 data set –Fall10 Marine Science majors proc sql; create index noobs on fall08(noobs); quit; data msretro; set fall10ms; set fall08 key=noobs; run;

17 Using a Transactional Data Set The Base data set can be updated from a lookup table The Base data set can be updated from a lookup table Both data sets have to be sorted Both data sets have to be sorted The lookup table can have missing values for variables that are unchanged The lookup table can have missing values for variables that are unchanged Be careful about “mixed” information (see example) Be careful about “mixed” information (see example)