Introduction to Using the Data Step Hash Object with Large Data Sets Richard Allen Peak Stat.

Slides:



Advertisements
Similar presentations
Debugging ACL Scripts.
Advertisements

Effecting Efficiency Effortlessly Daniel Carden, Quanticate.
Symbol Table.
Copyright © 2006, SAS Institute Inc. All rights reserved. Think FAST! Use Memory Tables (Hashing) for Faster Merging Gregg P. Snell Data Savant Consulting.
S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008.
The Assembly Language Level
Objectives Understand the software development lifecycle Perform calculations Use decision structures Perform data validation Use logical operators Use.
Introduction to SQL Session 2 Retrieving Data From Multiple Tables.
VBA Modules, Functions, Variables, and Constants
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Introduction to Structured Query Language (SQL)
Assemblers Dr. Monther Aldwairi 10/21/20071Dr. Monther Aldwairi.
Physical Database Monitoring and Tuning the Operational System.
An Object-Oriented Approach to Programming Logic and Design Chapter 7 Arrays.
Hash vs Join A case study evaluating the use of the data step hash object to replace a SQL join Geoff Ness Sep 2014.
Introduction to Structured Query Language (SQL)
Basic And Advanced SAS Programming
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
SAS: Managing Memory and Optimizing System Performance Jacek Czajkowski 09/29/2008.
Jeremy W. Poling B&W Y-12 L.L.C. Can’t Decide Whether to Use a DATA Step or PROC SQL? You Can Have It Both Ways with the SQL Function!
Chapter 18: Modifying SAS Data Sets and Tracking Changes 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Understanding SAS Data Step Processing Alan C. Elliott stattutorials.com.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
SAS SQL SAS Seminar Series
Operator Precedence First the contents of all parentheses are evaluated beginning with the innermost set of parenthesis. Second all multiplications, divisions,
 2004 Prentice Hall, Inc. All rights reserved. 1 Chapter 11 - JavaScript: Arrays Outline 11.1 Introduction 11.2 Arrays 11.3 Declaring and Allocating Arrays.
1 Chapter 4: Introduction to Lookup Techniques 4.1 Introduction to Lookup Techniques 4.2 In-Memory Lookup Techniques 4.3 Disk Storage Techniques.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
Introduction to SAS. What is SAS? SAS originally stood for “Statistical Analysis System”. SAS is a computer software system that provides all the tools.
Creating and Managing Indexes Using Proc SQL Chapter 6 1.
SAS ® PROC SQL or Vanilla Flavor Cecilia Mauldin January
SAS Efficiency Techniques and Methods By Kelley Weston Sr. Statistical Programmer Quintiles.
1 PhUSE 2011 Missing Values in SAS Magnus Mengelbier Director.
Copyright © 2008, SAS Institute Inc. All rights reserved. Hash Objects – Why Use Them? Carolyn Cunnison SAS Technical Training Specialist.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Arrays.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Linux+ Guide to Linux Certification, Third Edition
Java Script: Arrays (Chapter 11 in [2]). 2 Outline Introduction Introduction Arrays Arrays Declaring and Allocating Arrays Declaring and Allocating Arrays.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
DBMS Implementation Chapter 6.4 V3.0 Napier University Dr Gordon Russell.
7 1 Chapter 7 Introduction to Structured Query Language (SQL) Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
© Pearson Education Limited, Chapter 13 Physical Database Design – Step 4 (Choose File Organizations and Indexes) Transparencies.
Indexed and Relative File Processing
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Chapter 16: Using Lookup Tables to Match Data 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Methodology – Physical Database Design for Relational Databases.
Creating and Using Custom Formats for Data Manipulation and Summarization Presented by John Schmitz, Ph.D. Schmitz Analytic Solutions, LLC Certified Advanced.
Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.
Chapter 9: Advanced SQL and PL/SQL Guide to Oracle 10g.
A SAS User's Guide to Storage Management Allan Page Senior Marketing Analyst Canadian Tire Financial Services.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 1: Overview of SAS System Basic Concepts of SAS System.
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Efficient SAS programming with Large Data Aidan McDermott Computing Group, March 2007.
Customize SAS Output Using ODS Joan Dong. The Output Delivery System (ODS) gives you greater flexibility in generating, storing, and reproducing SAS procedure.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Text TCS INTERNAL Oracle PL/SQL – Introduction. TCS INTERNAL PL SQL Introduction PLSQL means Procedural Language extension of SQL. PLSQL is a database.
Unit-8 Introduction Of MySql. Types of table in PHP MySQL supports various of table types or storage engines to allow you to optimize your database. The.
An Introduction to Programming with C++ Sixth Edition Chapter 5 The Selection Structure.
LM 5 Introduction to SQL MISM 4135 Instructor: Dr. Lei Li.
 CONACT UC:  Magnific training   
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
Data Structures Interview / VIVA Questions and Answers
The Selection Structure
JavaScript: Functions.
OASUS Spring or Fall YYYY
Object Oriented Programming in java
Software Development Environment, File Storage & Compiling
Presentation transcript:

Introduction to Using the Data Step Hash Object with Large Data Sets Richard Allen Peak Stat

The DATA step hash object is an in-memory lookup table accessible from the DATA step. New in version 9. A hash object is loaded with records and is only available within the DATA step that it is created in. A hash record consists of two parts: 1: Key Part: One or more character or numeric values Must be unique 2: Data Part: Zero or more character or numeric values Introduction:

Lookup occurs by passing a key to the hash object’s FIND method. If a record with the particular key is found, the data part of the hash record is copied into the DATA step record. It is not necessary for the dataset to be sorted or indexed when loading the hash object from a dataset. The key lookup occurs in memory, avoiding costly disk access and speeding up the search considerably. Along with the FIND method, there are methods to ADD, REPLACE, REMOVE and OUTPUT records to the hash object. We will only concentrate on the FIND method in the example to follow.

Object: Create a dataset with all pharmacy claims for all subjects in dataset A for the drugs of interest in dataset B. Example: Dataset A: Pharmacy claims with 9,008,585 records. File size is 3,352,049 kb. (3.35 gb) Dataset B: NDC dataset with drugs of interest. Full NDC file has 272,961 records Full NDC file size 85,489 kb. (85.5 mb)

Dataset B_1: 61 records, file size 65 kb. data Drugs(rename=(NDCnum=NDC)); set raw.redbook; if upcase(ProdNme)='AVONEX' | index(upcase(GenNme),'INTERFERON BETA-1A')>0 | index(upcase(ProdNme),'LUPRON')>0 | index(upcase(GenNme),'LEUPROLIDE')>0 | index(upcase(ProdNme),'ZOLADEX')>0 | index(upcase(GenNme),'GOSERELIN')>0; run; Dataset B_2: 55,786 records, file size 44,641 kb. data Drugs(rename=(NDCnum=NDC)); set raw.redbook(where=(TherCls<50)); run;

Method 1: Sort and Merge proc sort data=Drugs; by NDC; run; proc sort data=in.Pharmacy(rename=(Rx_Prd_ID=NDC)) out=in_Pharmacy; by NDC; run; data Method1; merge in_Pharmacy(in=in1) Drugs(in=in2 keep=NDC ProdNme GenNme); by NDC; if in1 & in2; run; B_1 as Drugs: 3,307 observations B_2 as Drugs: 2,086,842 observations

Method 2: SQL join proc sql; create table Method2 as select d.*, p.* from in.Pharmacy as p join Drugs(keep=NDC ProdNme GenNme) as d on d.NDC=p.Rx_Prd_ID; quit;

Method 3: Format and restrict data Format; set Drugs end=EOF; type='I'; fmtname='NDC'; start=NDC; label=1; output; if EOF then do; hlo='o'; label=0; output; end; run; proc format cntlin=Format; run; data Method3; set in.Pharmacy; if input(Rx_Prd_ID,NDC.)=1 then output; run;  Range is ‘other’ for formatted value (label=0)

hash object keys and data are DATA step variables. key and data values can be 1.directly assigned constant values 2.values from a SAS data set. Hashing Beginning in version 9, SAS provides two predefined component objects for use in a DATA step: the hash object and the hash iterator object. With these objects you can quickly and efficiently store, search, and retrieve data based on lookup keys. One uses the DATA step Component Interface and object dot notation - ObjectName.Method(Parameters) - to create and manipulate these component objects using statements, attributes, and methods.

Hash Object Creation Declare Statement Hash Object Methods ADD Method CHECK Method DEFINEDATA Method DEFINEDONE Method DEFINEKEY Method DELETE Method FIND Method OUTPUT Method REMOVE Method REPLACE Method Hash Object Attribute NUM_ITEMS Attribute  Specify key part of hash object  Completes definition of hash object  Specify data part of hash object  Determine whether record with key exists in hash object  Instantiates hash object definition

Hash Iterator Object Methods FIRST Method LAST Method NEXT Method PREV Method The hash iterator object is also provided as a companion for the hash object. It can not be defined without first defining a hash object. The iterator object enables you to step through the hash object records without performing a key lookup in forward or reverse key order.

Creating a Hash object 1. Instantiate the hash object: Use the DECLARE statement with the keyword HASH followed by the hash name to create the hash object. Options are specified within the parentheses and include DATASET to load the hash object from an existing SAS dataset and ORDERED to specify how the data is returned in key-value order. 2. Define the key part: Use the DEFINEKEY method to create a key of one or more variables. The key must be unique. 3. Define the data part: Use the DEFINEDATA method to specify zero or more data variables to be associated with the key variables. 4. Complete the definition of the hash object: Use the DEFINEDONE method to conclude the definition of the hash object.

Method 4: Hash object data Method4; length NDC $11 ProdNme GenNme $50; if _n_=1 then do; declare hash h(dataset:"Drugs"); h.defineKey('NDC'); h.defineData('ProdNme','GenNme'); h.defineDone(); call missing(ProdNme,GenNme); end; set in.Pharmacy(rename=(Rx_Prd_ID=NDC)); if h.find()=0 then output; run;  1. Create hash h using dataset drugs  2. Define Key Part of h as NDC variable  3. Define Data Part of h  4. Complete the definition of the hash h If length statement is not used, the following errors appear in the log ERROR: Type mismatch for data variable ProdNme at line 28 column 5. ERROR: Hash data set load failed at line 28 column 5. ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.  Suppresses uninitialized messages.  Corrects errors. DefineKey and DefineData do not add variables to PDV. This creates them in the PDV. If call missing is not used, the following notes appear in the log NOTE: Variable ProdNme is uninitialized. NOTE: Variable GenNme is uninitialized If neither statement is used, the following errors appear in the log ERROR: Undeclared data symbol ProdNme for hash object at line 28 column 5. ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.

Comparisons Each of these 4 methods were run to extract all of the the Rx claims records in dataset A that had the NDC codes in datasets B_1 and B_2. The option fullstimer was turned on to collect times and memory amounts used for comparing methods. Real time represents the clock time it took to execute a job or step. It is heavily dependent on the capacity of the system and the current load. CPU time represents the actual processing time required by the CPU to execute the job, exclusive of capacity and load factors. - User CPU is the CPU time spent to execute your SAS code. - System CPU is the CPU time spent to perform system overhead tasks on behalf of the SAS process.

Step Real timeUser CPU time System CPU time Memory (k) Sort Drugs Sort Pharmacy 11: : Merge 1: : TOTAL 13: : : SQL join 2: Cntlin dataset Proc format Data step 1: : TOTAL 1: : Hash 1: Dataset B_1: 61 records, file size 65 kb.

Step Real timeUser CPU time System CPU time Memory (k) Sort Drugs Sort Pharmacy 11: : Merge 5: : TOTAL 17: : : SQL join 17:23.041:06.391: Cntlin dataset Proc format Data step 2: : TOTAL 2: : Hash 2: Dataset B_2: 55,786 records, file size 44,641 kb.

Conclusions: For looking up a value from one dataset in another, the DATA step hash object can be used to improve the performance of the lookup and significantly reduce the time needed to complete the task. The hash object and the hash iterator object allow SAS programmers using version 9 to solve problems within the data step that were either difficult, if not impossible, to code in prior versions of SAS.

References: -Getting Started with the DATA Step Hash Object, Jason Secosky & Janice Bloom, PharmaSUG 2007 Proceedings -Getting Started with the DATA Step Hash Iterator, Janice Bloom & Jason Secosky, support.sas.com -Hashing: Generations, Paul Dorfman & Gregg Snell, SUGI 28 -DATA Step Hash Objects as Programming Tools, Paul Dorfman & Koen Vyverman, SUGI 30 -Think FAST! Use Memory Tables (Hashing) for Faster Merging, Greg Snell, SUGI 31 -A Hash Alternative to the PROC SQL Left Join, Ken Borowiak, NESUG 2006