Presentation is loading. Please wait.

Presentation is loading. Please wait.

Operational Weather Forecasting using GPUs Dr. Shujia Zhou Lawrence Sebald.

Similar presentations

Presentation on theme: "Operational Weather Forecasting using GPUs Dr. Shujia Zhou Lawrence Sebald."— Presentation transcript:

1 Operational Weather Forecasting using GPUs Dr. Shujia Zhou Lawrence Sebald

2 NOAA Long Wave Radiation Code Production version of the weather forecast model code Accounts for about 10-15% of the global weather forecast simulation time NOAA is interested in accelerating this code so that it may be called once per hour rather than once per three hours as it is done now

3 NOAA Long Wave Radiation Code Structure Approximately 4000 lines of Fortran 90 code Additionally, approximately 30,000 lines of raw data within the code Code is structured in a way that has many random accesses into lookup tables in RAM Algorithmically, speeds the code from O(L 2 ) time to O(L) time on CPU Efficient on a CPU, horribly inefficient on a GPU

4 Code Structure and Memory Requirements main lwrad rlwinitcldprop taumo l rtrn or rtrnmr taugb## (01-16) Function Memory usage (using single precision) Memory usage (using double precision) Computation Time lwrad87756138608 rlwinit2444 cldprop4476 taumol (*)4406479912~ 60% rtrn (**)167320334608~ 35% rtrnmr (**)205512410480~ 35% taugb0128 taugb026488 taugb03132204 taugb04104160 taugb05104160 taugb0628 taugb07100152 taugb085268 taugb09144220 taugb1024 taugb1128 taugb12100152 taugb13100152 taugb1428 taugb15100152 taugb16100152 *: Time stated for taumol includes time used by the taugb## functions **: Only one of these two functions is used

5 Optimization Differences between CPU and GPU CPU Each core has fairly large cache sizes For instance, on Intel Nehalem: 32KB L1 (data), 256 KB L2 per core, 4-12MB L3 shared Often, using precomputed lookup tables provides decent speedup over brute-force computation NASA Goddard Solar (short wave) radiation code and NOAA long wave radiation code are optimized in this way GPU Each core has much smaller shared memory (16 cores with 16KB in Tesla, 32 cores with 64KB in Fermi) Brute force calculation is more efficient due to large number of SIMD cores (512 in Fermi) Streaming computation with many threads is preferable to lookup table centric programming Reversing the lookup table approach back to computational functions reduces memory consumption

6 Translation from Fortran 90 to C Utilized a NOAA tool known as F2C-ACC to translate the Fortran 90 code to C C is better supported for GPU programming than is Fortran, and will generally be supported first on future chips as well Fortran only recently supported by a compiler by PGI Little documentation, few examples, potentially less efficient than C code F2C-ACC did a relatively good job of translating the raw computation code, however the tool is not perfect Took approximately 3 months to hand-tune conversion Hand editing of translated code was necessary Some portions of the code were much more negatively impacted than others due to features not implemented in F2C-ACC (lookup tables were translated very poorly)

7 NOAA Long Wave Radiation: CUDA Issues Due to memory requirements of the lookup table centric code, it is impossible to compile with CUDA on a GPU, or even with OpenCL on IBM JS22 (POWER6) Each thread requires approximately 1MB of local storage space (registers/memory), which is too large for CUDA/OpenCL to cope with GPU duplicates the thread memory requirement 32 times to have a full warp, even if less than 32 threads are active within the warp

8 NOAA Long Wave Radiation: Status Successfully ported cldprop() to GPU Successfully ported taugb##() to GPU Currently optimizing performance with these functions We plan to reverse the pre-calculated lookup tables back to brute force computation Need to try to find original code and/or re- implement from AER documentation!

Download ppt "Operational Weather Forecasting using GPUs Dr. Shujia Zhou Lawrence Sebald."

Similar presentations

Ads by Google