Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient SAS programming with Large Data Aidan McDermott Computing Group, March 2007.

Similar presentations


Presentation on theme: "Efficient SAS programming with Large Data Aidan McDermott Computing Group, March 2007."— Presentation transcript:

1 Efficient SAS programming with Large Data Aidan McDermott Computing Group, March 2007

2 Axes if Efficiency processing speed: –CPU –real storage: –disk –memory –… user: –functionality –interface to other systems –ease of use –learning user development: –methodologies –reusable code –facilitate extension, rewriting –maintenance

3 Dataset / Table

4 Datasets consist of three parts

5 General (and obvious) principles Avoid doing the job if possible Keep only the data you need to perform a particular task (use drop, keep, where and if’s)

6 Combining datasets -- concatenation

7 General (and obvious) principles Often efficient methods were written to perform the required task – use them.

8 General (and obvious) principles Often efficient methods were written to perform other tasks – use them with caution. Write data driven code –it’s easier to maintain data than to update code Use length statements to limit the size of variables in a dataset to no more than is needed. –don’t always know what size this should be, don’t always produce your own data. Use formatted data rather than the data itself

9 Memory resident datasets

10 Compressing Datasets Compress datasets with a compression utility such as compress, gzip, winzip, or pkzip and decompress before running each SAS job –delays execution and there is need to keep track of data and program dependency. Use a general purpose compression utility and decompress it within SAS for sequential access. –system dependent (need a named pipe), sequential dataset storage.

11 Compressing Datasets

12 SAS internal Compression allows random access to data and is very effective under the right circumstances. In some cases doesn’t reduce the size of the data by much. “There is a trade-off between data size and CPU time”.

13 indata is a large dataset and you want to produce a version of indata without any observations

14 The data step is a two stage process compile phase execute phase

15 Data step logic

16

17

18 data step

19 data admits; set admits; discharge = admit + length; format discharge date8.; run; Nametypesizedropretainformatvalue patientIDC6ny genderC1ny admitN8nydate8. lengthN8ny dischargeN8nndate8. _N_ _ERROR_0 PDV: compile phase

20 data admits; set admits; discharge = admit + length; format discharge date8.; run; Nametypesizedropretainformatvalue patientIDC6ny321C-4 genderC1nyM admitN8nydate8.15736 lengthN8ny21 dischargeN8nndate8. _N_1 _ERROR_0 PDV: execute phase

21 data admits; set admits; discharge = admit + length; format discharge date8.; run; Nametypesizedropretainformatvalue patientIDC6ny321C-4 genderC1nyM admitN8nydate8.15736 lengthN8ny21 dischargeN8nndate8.15757 _N_1 _ERROR_0 PDV: execute phase

22 data admits; set admits; discharge = admit + length; format discharge date8.; run; /* implicit output */ Nametypesizedropretainformatvalue patientIDC6ny321C-4 genderC1nyM admitN8nydate8.15736 lengthN8ny21 dischargeN8nndate8.15757 _N_1 _ERROR_0 PDV: execute phase

23 Nametypesizedropretainformatvalue patientIDC6ny321C-4 genderC1nyM admitN8nydate8.15736 lengthN8ny21 dischargeN8nndate8. _N_2 _ERROR_0 data admits; set admits; discharge = admit + length; format discharge date8.; run; PDV: execute phase

24 Efficiency: suspend the PDV activities

25 General principles Use by processing whenever you can Given the data below, for each region, siteid, and date, calculate the mean and maximum ozone value.

26 General principles Easy:

27 General principles Suppose there are multiple monitors at each site and you still need to calculate the daily mean? –Combine multiple observations onto one line and then compute the statistics? Suppose you want the 10% trimmed mean? Suppose you want the second maximum? –Use Arrays to sort the data? –Write your own function?

28

29


Download ppt "Efficient SAS programming with Large Data Aidan McDermott Computing Group, March 2007."

Similar presentations


Ads by Google