Operational Weather Forecasting using GPUs Dr. Shujia Zhou Lawrence Sebald.

Slides:



Advertisements
Similar presentations
DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM I. Kisel (for CBM Collaboration) I. Kisel (for CBM Collaboration)
Advertisements

Deep Packet Inspection Which Implementation Platform? Sarang Dharmapurikar Cisco.
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
SE-292 High Performance Computing
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
GPUs and Future of Parallel Computing Authors: Stephen W. Keckler et al. in NVIDIA IEEE Micro, 2011 Taewoo Lee
C SINGH, JUNE 7-8, 2010IWW 2010, ISATANBUL, TURKEY Advanced Computers Architecture, UNIT 2 Advanced Computers Architecture UNIT 2 CACHE MEOMORY Lecture7.
1 Fast Configurable-Cache Tuning with a Unified Second-Level Cache Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Selected issues of histogramming on GPGPUs
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.
Multipattern String Matching On A GPU Author: Xinyan Zha, Sartaj Sahni Publisher: 16th IEEE Symposium on Computers and Communications Presenter: Ye-Zhi.
SE-292 High Performance Computing
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.
GPU Programming using BU Shared Computing Cluster
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
GPU Computational Screening of Carbon Capture Materials J Kim 1, A Koniges 1, R Martin 1, M Haranczyk 1, J Swisher 2 and B Smit 1,2 1 Berkeley Lab (USA),
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
GPU Architecture and Programming
GPU-Accelerated Beat Detection for Dancing Monkeys Philip Peng, Yanjie Feng UPenn CIS 565 Spring 2012 Final Project – Final Presentation img src:
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Sunpyo Hong, Hyesoon Kim
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
CS427 Multicore Architecture and Parallel Computing
NVIDIA Fermi Architecture
Cristiano Padrin (CASPUR)
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Operational Weather Forecasting using GPUs Dr. Shujia Zhou Lawrence Sebald

NOAA Long Wave Radiation Code Production version of the weather forecast model code Accounts for about 10-15% of the global weather forecast simulation time NOAA is interested in accelerating this code so that it may be called once per hour rather than once per three hours as it is done now

NOAA Long Wave Radiation Code Structure Approximately 4000 lines of Fortran 90 code Additionally, approximately 30,000 lines of raw data within the code Code is structured in a way that has many random accesses into lookup tables in RAM Algorithmically, speeds the code from O(L 2 ) time to O(L) time on CPU Efficient on a CPU, horribly inefficient on a GPU

Code Structure and Memory Requirements main lwrad rlwinitcldprop taumo l rtrn or rtrnmr taugb## (01-16) Function Memory usage (using single precision) Memory usage (using double precision) Computation Time lwrad rlwinit2444 cldprop4476 taumol (*) ~ 60% rtrn (**) ~ 35% rtrnmr (**) ~ 35% taugb0128 taugb taugb taugb taugb taugb0628 taugb taugb taugb taugb1024 taugb1128 taugb taugb taugb1428 taugb taugb *: Time stated for taumol includes time used by the taugb## functions **: Only one of these two functions is used

Optimization Differences between CPU and GPU CPU Each core has fairly large cache sizes For instance, on Intel Nehalem: 32KB L1 (data), 256 KB L2 per core, 4-12MB L3 shared Often, using precomputed lookup tables provides decent speedup over brute-force computation NASA Goddard Solar (short wave) radiation code and NOAA long wave radiation code are optimized in this way GPU Each core has much smaller shared memory (16 cores with 16KB in Tesla, 32 cores with 64KB in Fermi) Brute force calculation is more efficient due to large number of SIMD cores (512 in Fermi) Streaming computation with many threads is preferable to lookup table centric programming Reversing the lookup table approach back to computational functions reduces memory consumption

Translation from Fortran 90 to C Utilized a NOAA tool known as F2C-ACC to translate the Fortran 90 code to C C is better supported for GPU programming than is Fortran, and will generally be supported first on future chips as well Fortran only recently supported by a compiler by PGI Little documentation, few examples, potentially less efficient than C code F2C-ACC did a relatively good job of translating the raw computation code, however the tool is not perfect Took approximately 3 months to hand-tune conversion Hand editing of translated code was necessary Some portions of the code were much more negatively impacted than others due to features not implemented in F2C-ACC (lookup tables were translated very poorly)

NOAA Long Wave Radiation: CUDA Issues Due to memory requirements of the lookup table centric code, it is impossible to compile with CUDA on a GPU, or even with OpenCL on IBM JS22 (POWER6) Each thread requires approximately 1MB of local storage space (registers/memory), which is too large for CUDA/OpenCL to cope with GPU duplicates the thread memory requirement 32 times to have a full warp, even if less than 32 threads are active within the warp

NOAA Long Wave Radiation: Status Successfully ported cldprop() to GPU Successfully ported taugb##() to GPU Currently optimizing performance with these functions We plan to reverse the pre-calculated lookup tables back to brute force computation Need to try to find original code and/or re- implement from AER documentation!