Former Chapter 23: Selecting Efficient Sorting Strategies

Former Chapter 23: Selecting Efficient Sorting Strategies
STAT 541 Former Chapter 23: Selecting Efficient Sorting Strategies This chapter was deleted from later editions ©Spring 2012 Imelda Go, John Grego, Jennifer Lasecki and the University of South Carolina

Outline Avoiding Unnecessary Sorts Using a Threaded Sort
Calculating and Allocating Sort Resources Handling Large Data Sets Removing Duplicate Observations Efficiently Take-away message from first bullet: SORT requires a lot of resources—minimize its use.

Avoiding Unnecessary Sorts
Sorts can be avoided in some situations BY groups with an Index If a data set includes an index, you can use a BY statement on the indexed variable without having used PROC SORT The BY statement can be used in a DATA step or PROC step Processing a data set with an index may be less efficient than PROC SORT Does not index if DESCENDING or NOTSORTED are used or data is pre-sorted Cover first two sub-bullets only 3

The NOTSORTED option groups the data on the BY variable, but doesn’t order groups Useful when sorting on nominal groupings Results are interesting when data is not pre-grouped proc freq data=stat541.fall2008; by gender notsorted; table major; run; We can use FIRST. and LAST. with the NOTSORTED option Great idea for non-SAS sorted/grouped data (e.g., Excel worksheets). Odd results otherwise. 4

You can actually group on formatted values rather than the variable itself GROUPFORMAT option can only be used in the DATA step GROUPFORMAT allows you to create groups without creating a new variable The CLASS statement is an under-used resource, especially in PROC MEANS and PROC UNIVARIATE GROUPFORMAT’s big advantage lies in the third bullet. Get in the habit of using CLASS rather than BY. 5

PROC CONTENTS can be used to see whether data is already sorted The SORTEDBY option can then be used to include the sort information as a data set attribute You should always check whether data is sorted 6

PROC SORT dsname THREADS|NOTHREADS;
Using a Threaded Sort Threaded sorts can distribute sorting across multiple CPUs PROC SORT dsname THREADS|NOTHREADS; You can modify or query the number of CPUs with system option CPUCOUNT Skim 7

Handling Large Data Sets
If a data set is too large to sort (insufficient space for the multiple copies of the data set needed for a sort), the data set can be split into smaller data sets then reassembled, typically with a SET statement/BY statement combination. The BY statement is sometimes unnecessary 8

Many methods are available FIRSTOBS= OBS= in DATA step IF/OUTPUT in DATA step WHERE in PROC SORT step WHERE in DATA step A DATA step is better than PROC APPEND for reassembling a large data set Most of the tools here are very familiar to us. 9

The TAGSORT option saves only the BY variables and observation numbers in temporary files This saves on the space set aside for a SORT “tags” are a convenient feature; this should remind you of order() in R. 10

Removing Duplicate Observations Efficiently
NODUPKEY NODUPRECS Checks the entire record, not just the BY variables Can be limited to post-DROP and post-KEEP variables FIRST. and LAST. I’m surprised the book included this choice 11

Former Chapter 23: Selecting Efficient Sorting Strategies

Similar presentations

Presentation on theme: "Former Chapter 23: Selecting Efficient Sorting Strategies"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Former Chapter 23: Selecting Efficient Sorting Strategies

Similar presentations

Presentation on theme: "Former Chapter 23: Selecting Efficient Sorting Strategies"— Presentation transcript:

Similar presentations

About project

Feedback