Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 23: Selecting Efficient Sorting Strategies 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina.

Similar presentations


Presentation on theme: "Chapter 23: Selecting Efficient Sorting Strategies 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina."— Presentation transcript:

1 Chapter 23: Selecting Efficient Sorting Strategies 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina

2 2 Outline Avoiding Unnecessary Sorts Avoiding Unnecessary Sorts Using a Threaded Sort Using a Threaded Sort Calculating and Allocating Sort Resources (not covered) Calculating and Allocating Sort Resources (not covered) Handling Large Data Sets Handling Large Data Sets Removing Duplicate Observations Efficiently Removing Duplicate Observations Efficiently

3 3 Avoiding Unnecessary Sorts Sorts can be avoided in some situations Sorts can be avoided in some situations BY groups with an Index BY groups with an Index –If a data set includes an index, you can use a BY statement on the indexed variable without having used PROC SORT –The BY statement can be used in a DATA step or PROC step –Processing a data set with an index may be less efficient than PROC SORT –Does not index if DESCENDING or NOTSORTED are used or data is pre-sorted

4 4 Avoiding Unnecessary Sorts The NOTSORTED option groups the data on the BY variable, but doesn’t order groups The NOTSORTED option groups the data on the BY variable, but doesn’t order groups –Useful when sorting on nominal groupings –Results are interesting when data is not pre-grouped proc freq data=stat541.fall2008; by gender notsorted; table major; run;

5 5 Avoiding Unnecessary Sorts You can actually group on formatted values rather than the variable itself You can actually group on formatted values rather than the variable itself GROUPFORMAT option can only be used in the DATA step GROUPFORMAT option can only be used in the DATA step GROUPFORMAT allows you to create groups without creating a new variable GROUPFORMAT allows you to create groups without creating a new variable The CLASS statement is an under-used resource, especially in PROC MEANS and PROC UNIVARIATE The CLASS statement is an under-used resource, especially in PROC MEANS and PROC UNIVARIATE

6 6 Avoiding Unnecessary Sorts PROC CONTENTS can be used to see whether data is already sorted PROC CONTENTS can be used to see whether data is already sorted The SORTED BY option can then be used to include the sort information as a data set attribute The SORTED BY option can then be used to include the sort information as a data set attribute

7 7 Using a Threaded Sort Threaded sorts can distribute sorting across multiple CPUs Threaded sorts can distribute sorting across multiple CPUs PROC SORT dsname THREADS|NOTHREADS; You can modify or query the number of CPUs with CPUCOUNT You can modify or query the number of CPUs with CPUCOUNT

8 8 Handling Large Data Sets If a data set is too large to sort (insufficient space for the multiple copies of the data set needed for a sort), the data set can be split into smaller data sets then reassembled, typically with a SET statement/BY statement combination. If a data set is too large to sort (insufficient space for the multiple copies of the data set needed for a sort), the data set can be split into smaller data sets then reassembled, typically with a SET statement/BY statement combination.

9 9 Handling Large Data Sets Many methods are available Many methods are available  FIRSTOBS= OBS= in DATA step  IF/OUTPUT in DATA step  WHERE in PROC SORT step  WHERE in DATA step A DATA step is better than PROC APPEND for reassembling a large data set A DATA step is better than PROC APPEND for reassembling a large data set

10 10 Handling Large Data Sets The TAGSORT option saves only the BY variables and observation numbers in temporary files The TAGSORT option saves only the BY variables and observation numbers in temporary files This saves on the space set aside for a SORT This saves on the space set aside for a SORT

11 11 Removing Duplicate Observations Efficiently NODUPKEY NODUPKEY NODUPRECS NODUPRECS –Checks the entire record, not just the BY variables –Can be limited to post-DROP and post-KEEP variables FIRST. and LAST. FIRST. and LAST. –I’m surprised the book included this choice


Download ppt "Chapter 23: Selecting Efficient Sorting Strategies 1 STAT 541 ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina."

Similar presentations


Ads by Google