Presentation is loading. Please wait.

Presentation is loading. Please wait.

SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul.

Similar presentations


Presentation on theme: "SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul."— Presentation transcript:

1

2 SUGI 28, PAPER 4

3 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul M. Dorfman Table Look-Up by Direct Addressing: Key-Indexing — Bitmapping — Hashing Paper 128: Paul M. Dorfman Quick Disk Table Look-up via Hybrid Indexing into a Directly Addressed SAS Data Set SUGI 27 Paper 12: Paul M. Dorfman and Gregg P. Snell Hashing Rehashed GENERATION I Hand-Coded Direct-Addressing Routines Presented At:

4 Associative Array SUGI 27 Presentation: Jason Secosky The DATA step in Version 9: What’s New? SAS® Version 9 DATA Step Object: Associative Array Hash GENERATION II == 1

5 2 Main Entry: pro·pae·deu·tic Pronunciation: "prO-pi-'dü-tik, -'dyü- Function: noun Etymology: Greek propaideuein to teach beforehand, from pro- before + paideuein to teach, from paid-, pais child Date: : preparatory study or instruction Merriam-Webster online PROPAEDEUTICS 1

6 3 Main Entry: di·rect ad·dress·ing Pronunciation: "d&-'rekt, &-'dres-ing Function: speedy table lookup 1: accessing key values “directly” by their location (address, node) in a table, as opposed to searching for them by comparing the search key to all or some table values 2: key-indexed search Hashing Rehashed. SUGI 27, Orlando, FL, DIRECT ADDRESSING 21

7 WORK.SMALL KEYS_SAT data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat >. then output ; end ; stop ; run ; ARRAY HKEY(0:999) HKEY(000)=. … HKEY(185)=. HKEY(400)=. HKEY(971)=. … HKEY(999)=. DEFINE KEY-INDEXING 43

8 WORK.SMALL KEYS_SAT data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat >. then output ; end ; stop ; run ; ARRAY HKEY(0:999) HKEY(000)=. … HKEY(185)=. HKEY(400)=. HKEY(971)=. … HKEY(999)=. 00 LOAD DEFINE KEY-INDEXING 2

9 WORK.SMALL KEYS_SAT data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat >. then output ; end ; stop ; run ; ARRAY HKEY(0:999) HKEY(000)=. … HKEY(185)=. HKEY(400)=. HKEY(971)=. … HKEY(999)= DEFINE LOAD KEY-INDEXING 2

10 WORK.SMALL KEYS_SAT data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat >. then output ; end ; stop ; run ; ARRAY HKEY(0:999) HKEY(000)=. … HKEY(185)=. HKEY(400)=. HKEY(971)=. … HKEY(999)= DEFINE LOAD KEY-INDEXING 2

11 WORK.SMALL KEYS_SAT data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat >. then output ; end ; stop ; run ; ARRAY HKEY(0:999) HKEY(000)=. … HKEY(185)=. HKEY(400)=. HKEY(971)=. … HKEY(999)= DEFINE LOAD KEY-INDEXING 2

12 WORK.SMALL KEYS_SAT data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat >. then output ; end ; stop ; run ; ARRAY HKEY(0:999) HKEY(000)=. … HKEY(185)=. HKEY(400)=. HKEY(971)=. … HKEY(999)= SEARCH DEFINE LOAD KEY-INDEXING 1

13 How can one possibly use key-indexing when SSN (or any other large or non-integer value) is the key? LIMITATIONS Key Must Be Integer Key Range Limited By Memory (9-Digit SSN Would Require 8 GB) So you may be asking yourself... KEY-INDEXING 321

14 Main Entry: 1 hash Pronunciation: 'hash Function: transitive verb Etymology: French hacher, from Old French hachier, from hache battle-ax, of Germanic origin; akin to Old High German hAppa sickle; akin to Greek koptein to cut Date: a : to chop (as meat and potatoes) into small pieces Merriam-Webster online HASHING 2 1

15 Main Entry: 1 hash·ing Pronunciation: 'hash-ing Function: make BIG keys smaller 1: converting a long-range key (numeric or character) to a smaller-range integer number with a mathematical algorithm or function that must be: Rapidly computable Distribute the resulting keys uniformly Produce an integer in [0:HSIZE-1] range 2: H = MOD(KEY,HSIZE) Private Detectives in a Data Warehouse. SUGI 25, Indianapolis, IN, HASHING 2 1

16 GENERATIONS Which generation of hashing is the most efficient for subsetting LARGE based on the values of KEY in SMALL to produce a file MATCH? data small ; input key s_sat ; cards ; ; run ; Our Mission: Match unsorted SMALL (satellite variable S_SAT) To unsorted LARGE LARGE cannot be sorted Assume: Memory will hold SMALL (integer variable KEY)

17 h = mod(key,&hsize); %let load = 0.625; data _null_; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize',left(put(p,best.))); stop; set small nobs=p; run; %put hsize=&hsize; hash_size=17 GENERATION I Choose a proper function Determine Load Factor Calculate Optimal Array Size KEY HASH_ADDR Hash 15 Hash Collision Resolution (Linear Probing) 54321

18 %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_;... GENERATION I KEY S_SAT HKEY HSAT (00)=. (01)=. (02)=. (03)=. (04)=. (05)=. (06)=. (07)=. (08)=. (09)=. (10)=. (11)=. (12)=. (13)=. (14)=. (15)=. (16)=. (17)=. DEFINE LOAD SEARCH DEFINE Load Factor Array Size Define

19 KEY S_SAT KEY S_SAT H(KEY) %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_;... GENERATION I DEFINE HKEY HSAT (00)=. (01)=. (02)=. (03)=. (04)=. (05)=. (06)=. (07)=. (08)=. (09)=. (10)=. (11)=. (12)=. (13)=. (14)=. (15)=. (16)=. (17)=. *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ;... LOAD

20 %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_;... GENERATION I KEY S_SAT H(KEY) DEFINE HKEY HSAT (00)=. (01)=. (02)=. (03)=. (04)=. (05)=. (06)=. (07)=. (08)=. (09)=. (10)=. (11)=. (12)=. (13)=. (14)=. (15)=185 (15)=00 (16)=. (17)=. *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ;... LOAD

21 %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_;... GENERATION I KEY S_SAT H(KEY) DEFINE HKEY HSAT (00)=. (01)=. (02)=971 (02)=11 (03)=. (04)=. (05)=. (06)=. (07)=. (08)=. (09)=. (10)=. (11)=. (12)=. (13)=. (14)=. (15)=. (16)=. (17)=. *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ;... LOAD

22 %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_;... GENERATION I KEY S_SAT H(KEY) DEFINE HKEY HSAT (00)=. (01)=. (02)=971 (02)=11 (03)=. (04)=. (05)=. (06)=. (07)=. (08)=. (09)=400 (09)=22 (10)=. (11)=. (12)=. (13)=. (14)=. (15)=. (16)=. (17)=. *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ;... LOAD

23 %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_;... GENERATION I KEY S_SAT H(KEY) DEFINE HKEY HSAT (00)=. (01)=. (02)=971 (02)=11 (03)=. (04)=. (05)=260 (05)=33 (06)=. (07)=. (08)=. (09)=400 (09)=22 (10)=. (11)=. (12)=. (13)=. (14)=. (15)=. (16)=. (17)=. *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ;... LOAD

24 %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_;... GENERATION I KEY S_SAT H(KEY) DEFINE HKEY HSAT (00)=. (01)=. (02)=971 (02)=11 (03)=. (04)=922 (04)=44 (05)=260 (05)=33 (06)=. (07)=. (08)=. (09)=400 (09)=22 (10)=. (11)=. (12)=. (13)=. (14)=. (15)=. (16)=. (17)=. *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ;... LOAD

25 %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_;... GENERATION I KEY S_SAT H(KEY) DEFINE HKEY HSAT (00)=. (01)=970 (01)=55 (02)=971 (02)=11 (03)=. (04)=922 (04)=44 (05)=260 (05)=33 (06)=. (07)=. (08)=. (09)=400 (09)=22 (10)=. (11)=. (12)=. (13)=. (14)=. (15)=. (16)=. (17)=. *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ;... LOAD

26 %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_;... GENERATION I KEY S_SAT H(KEY) DEFINE HKEY HSAT (00)=. (01)=970 (01)=55 (02)=971 (02)=11 (03)=. (04)=922 (04)=44 (05)=260 (05)=33 (06)=. (07)=. (08)=. (09)=400 (09)=22 (10)=. (11)=. (12)=. (13)=. (14)=. (15)=. (16)=543 (16)=66 (17)=. *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ;... LOAD

27 %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_;... GENERATION I KEY S_SAT H(KEY) DEFINE HKEY HSAT (00)=. (01)=970 (01)=55 (02)=971 (02)=11 (03)=. (04)=922 (04)=44 (05)=260 (05)=33 (06)=. (07)=. (08)=. (09)=400 (09)=22 (10)=. (11)=. (12)=. (13)=. (14)=. (15)=. (16)=543 (16)=66 (17)=. *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ;... LOAD

28 %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_;... GENERATION I KEY S_SAT H(KEY) DEFINE HKEY HSAT (00)=050 (00)=88 (01)=970 (01)=55 (02)=971 (02)=11 (03)=. (04)=922 (04)=44 (05)=260 (05)=33 (06)=. (07)=. (08)=. (09)=400 (09)=22 (10)=. (11)=. (12)=. (13)=. (14)=. (15)=. (16)=543 (16)=66 (17)=. *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ;... LOAD

29 %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_;... GENERATION I KEY S_SAT H(KEY) DEFINE HKEY HSAT (00)=050 (00)=88 (01)=970 (01)=55 (02)=971 (02)=11 (03)=067 (03)=99 (04)=922 (04)=44 (05)=260 (05)=33 (06)=. (07)=. (08)=. (09)=400 (09)=22 (10)=. (11)=. (12)=. (13)=. (14)=. (15)=. (16)=543 (16)=66 (17)=. *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ;... LOAD SEARCH... do until ( eof2 ) ; set large end = eof2 ; do h=mod (key, &hsize) by +1 until ( hkey(h) =. ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; stop ; run ;

30 h = mod(key,&hsize); %let load = 0.625; data _null_; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize',left(put(p,best.))); stop; set small nobs=p; run; %put hsize=&hsize; hash_size=17 Choose a proper function Determine Load Factor Calculate Optimal Array Size KEY HASH_ADDR Hash 15 Hash Collision Resolution (Linear Probing) GENERATION II 54321

31 DEFINE LOAD SEARCH data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ; LOAD SEARCH DEFINE 321

32 return code = object.method; GENERATION II data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ; data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ; data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ; dcl hash hh (); Parameter Type Matching Verbose Syntax DEFINE set small(keep=key s_sat) point = _n_ ;

33 GENERATION II data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ; LOAD data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ; DEFINE set small(keep=key s_sat) point = _n_ ; dcl hash hh ();(dataset:'small'); 1

34 GENERATION II data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ; LOAD SEARCH data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ; DEFINE set small(keep=key s_sat) point = _n_ ; dcl hash hh ();(dataset:'small');

35 DEFINE GENERATIONS %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep = key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_ ; array hsat (0:&hsize) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; do h = mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h) = key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; do until ( eof2 ) ; set large end = eof2 ; do h = mod (key, &hsize) by +1 until ( hkey(h) =. ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; stop ; run ; data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ; set small(keep=key s_sat) point = _n_ ; dcl hash hh ();(dataset:'small'); 32

36 data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ; set small(keep=key s_sat) point = _n_ ; dcl hash hh ();(dataset:'small'); %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep = key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_ ; array hsat (0:&hsize) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; do h = mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h) = key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; do until ( eof2 ) ; set large end = eof2 ; do h = mod (key, &hsize) by +1 until ( hkey(h) =. ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; stop ; run ; LOAD GENERATIONS 1

37 data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ; set small(keep=key s_sat) point = _n_ ; dcl hash hh ();(dataset:'small'); %let load = ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p; run; *0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep = key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_ ; array hsat (0:&hsize) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; do h = mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h) = key and nodupes then leave ; if hkey(h) =. then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; do until ( eof2 ) ; set large end = eof2 ; do h = mod (key, &hsize) by +1 until ( hkey(h) =. ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; stop ; run ; SEARCH GENERATIONS

38 Modify your hash code %let nodupes = 1 ; if nodupes then leave ; Three options ! hh.add (); Keeps 1st occurrence hh.replace (); Keeps last create additional keys to discriminate further DUPLICATE KEYS CHALLENGES 4321

39 Change your hash function rescale the key h = mod (key*1000, &hsize) use a numeric informat h=mod (input(key,16.), &hsize) Change your array declaration array hkey(0:&hsize) $9 _temporary_ Already taken care of ! set small(keep=key s_sat) point = _n_ ; hh.DefineKey ( 'key' ); CHALLENGES 21 NON-INTEGER KEYS Fractional signed SAS numbers Digit strings (char var of digits) Arbitrary characters

40 DYNAMIC TABLE PROCESSING Simple coding DO LOOP through the array do h=0 to &hsize – 1; if missing(hkey(h)) then continue; key = hkey(h); s_sat = hsat(h); end; GENERATION IGENERATION II Use the HASH ITERATOR ! Dcl hash hh (dataset: ‘’sample’, ordered: 1); dcl hiter hi (‘hh’); do rc=hi.first() by 0 while (rc=0); (key is automatically populated) rc=hi.next(); end; CHALLENGES

41 COMPOSITE KEYS Change your hash function Possible but VERY difficult Complexity grows as the number of components to the key grows GENERATION IGENERATION II Just add the key(s) ! set small(keep=k1 k2 k3 s_sat) point = _n_ ; hh.DefineKey ( 'k1', 'k2', 'k3' ); CHALLENGES 321

42 PEEK UNDER THE HOOD STARFLEET COMMAND ENGINEERING ACCESSING VERSION 9 SCHEMATICS

43 PEEK UNDER THE HOOD DATA STEP COMPONENT INTERFACEHASH ITERATORADELSON-VELSKII & LANDIS TREES

44 PEEK UNDER THE HOOD DATA STEP COMPONENT INTERFACE declare AssociativeArray hh instantiates or creates the object ( dcl hash myhash )

45 PEEK UNDER THE HOOD DATA STEP COMPONENT INTERFACE hh.DefineKey define a set of hash keys

46 PEEK UNDER THE HOOD DATA STEP COMPONENT INTERFACE hh.DefineData define a set of hash table satellites

47 PEEK UNDER THE HOOD DATA STEP COMPONENT INTERFACE hh.DefineDone tell SAS the definitions are done

48 PEEK UNDER THE HOOD DATA STEP COMPONENT INTERFACE hh.Add insert the key and satellites (if the key is not yet in the table)

49 PEEK UNDER THE HOOD DATA STEP COMPONENT INTERFACE hh.Replace insert the key and satellites (overwrites any existing data)

50 PEEK UNDER THE HOOD DATA STEP COMPONENT INTERFACE hh.Find search for the key if found, extract the satellite and update the host DATA STEP variables

51 PEEK UNDER THE HOOD DATA STEP COMPONENT INTERFACE hh.Check search for the key if found, just return rc=0

52 PEEK UNDER THE HOOD DATA STEP COMPONENT INTERFACE hh.Delete delete the hash table from memory

53 PEEK UNDER THE HOOD HASH ITERATOR declare hiter hi dcl hiter myhashiterator instansiates or creates the object (don’t forget to add “ordered: 1” to the hash declaration)

54 PEEK UNDER THE HOOD HASH ITERATOR hi.First fetch the smallest key into the host variable

55 PEEK UNDER THE HOOD HASH ITERATOR hi.Next fetch the next key in ascending order

56 PEEK UNDER THE HOOD HASH ITERATOR hi.Last fetch the largest key into the host variable

57 PEEK UNDER THE HOOD HASH ITERATOR hi.Prev fetch the previous key in descending order

58 PEEK UNDER THE HOOD ADELSON-VELSKII & LANDIS TREES Binary Trees populated such that they average O(log(N)) search behavior regardless of the distribution. For example: Insert the value: 05 Insert the value: 02 AVL maintains balance by rotating the values and preserving the search structure HASHEXP controls the number of trees to create 2** EXP=16 HSIZE=65,536 HASHEXP 54321

59 CONCLUSION Simple numeric key falling in a limited range SAS date and time values are good examples This is the area where Generation I key-indexed search complete dominates the competition both in computer and programming efficiency A short comparison of I vs. II 4321

60 CONCLUSION Simple numeric key with the range up to 9 digits; no satellites needed Bitmapping is king 1

61 CONCLUSION Simple numeric key or short (up to 10 bytes) character key Both generations do well If ultimate speed is the issue, Generation I (barely) Generation II has the advantage of coding simplicity 321

62 CONCLUSION Composite keys Generation I is better if the keys can be rapidly combined in a short integer Otherwise, Generation II dominates 21

63 CONCLUSION Retrieving data by key from a hash table in order Generation I can provide such functionality only through array sorting Generation II hash iterator object is designed for this purpose, works very fast 21

64 CONCLUSION Storing and handeling duplicate key entries in a hash table Generation I is more flexible Generation II only lets you control which duplicate takes over 2121

65 CONCLUSION Dynamic Data Step Processing Generation II is ideal Table grows at run-time as new entries are added No need to allocate giant memories beforehand 321

66 CONCLUSION DynamicDATA StepStructure is the first ever… Finally… Although still experimental in V9… The Generation II Hash Object (AssociativeArray) 4321

67 HASHING: GENERATIONS SUGI 28 GENERATIONS


Download ppt "SUGI 28, PAPER 4 SUGI 25 Paper 129: Paul M. Dorfman Private Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing SUGI 26 Paper 8: Paul."

Similar presentations


Ads by Google