Evaluating Coprocessor Effectiveness for DART Ye Feng SIParCS UCAR/NCAR Boulder, CO University of Wyoming Mentors Helen Kershaw Nancy Collins.

Slides:

Advertisements

Similar presentations

Evaluating Coprocessor Effectiveness for the Data Assimilation Research Testbed Ye Feng IMAGe DAReS SIParCS University of Wyoming.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.

CS 179: Lecture 2 Lab Review 1. The Problem  Add two arrays  A[] + B[] -> C[]

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Empirical Localization of Observation Impact in Ensemble Filters Jeff Anderson IMAGe/DAReS Thanks to Lili Lei, Tim Hoar, Kevin Raeder, Nancy Collins, Glen.

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

GPU Computing with CUDA as a focus Christie Donovan.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

© 2012 National Ecological Observatory Network, Inc. ALL RIGHTS RESERVED. THE DATA ASSIMILATION RESEARCH TESTBED (DART) FOR ECOLOGICAL FORECASTING Andy.

GPGPU platforms GP - General Purpose computation using GPU

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Ensemble Data Assimilation and Uncertainty Quantification Jeffrey Anderson, Alicia Karspeck, Tim Hoar, Nancy Collins, Kevin Raeder, Steve Yeager National.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,

Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPU Architecture and Programming

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Pg 1 A Loosely Coupled Ocean-Atmosphere Ensemble Assimilation System. Tim Hoar, Nancy Collins, Kevin Raeder, Jeffrey Anderson, NCAR Institute for Math.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.

QCAdesigner – CUDA HPPS project

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

PRACTICAL TIME BUNDLE ADJUSTMENT FOR 3D RECONSTRUCTION ON THE GPU Siddharth Choudhary ( IIIT Hyderabad ), Shubham Gupta ( IIIT Hyderabad ), P J Narayanan.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

Expediting Peer-to-Peer Simulation using GPU Di Niu, Zhengjun Feng Apr. 14 th, 2009.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Low-power Task Scheduling for GPU Energy Reduction Li Tang, Yiji Zhang.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Lecture 3 CUDA Programming 1

CS427 Multicore Architecture and Parallel Computing

Image Transformation 4/30/2009

Portable Inter-workgroup Barrier Synchronisation for GPUs

Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Parallel Implementations of Ensemble Kalman Filters for Huge Geophysical Models Jeffrey Anderson, Helen Kershaw, Jonathan Hendricks, Nancy Collins, Ye.

Brook GLES Pi: Democratising Accelerator Programming

NVIDIA Fermi Architecture

All-Pairs Shortest Paths

Parallel programming with GPGPU coprocessors

BWLOCK++: Protecting GPU Kernels on Integrated CPU-GPU Platforms

Evaluate the integral {image}

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

Presentation transcript:

Evaluating Coprocessor Effectiveness for DART Ye Feng SIParCS UCAR/NCAR Boulder, CO University of Wyoming Mentors Helen Kershaw Nancy Collins

Introduction DART Data Assimilation Research Testbed Developed and maintained by the DAReS at NCAR GPU NVIDIA Tesla K20x CUDA FORTRAN Previous Work get_close_obs

Profiling Result Allinea MAP wrf_regular_test_case

Profiling Result Allinea MAP wrf_regular_test_case

Target update_from_obs_inc Linear regression of a state variable onto an observation Compute the state variable increments from observation increments State Obs_inc Update_from_obs_inc Reg_coef State_inc

CPU Implementation 1 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc

CPU Implementation 2 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc A A=obs-obs_prior_mean

CPU Implementation 2 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) A=obs-obs_prior_mean

CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean

CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean D D=sum(A)

CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean D=sum(A) BB EE D

CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = ( E - B )/((ens_size-1)*obs_prior_var) state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean D=sum(A) BB EE D

CPU Implementation 2 For each close state: B=(sum(state) / ens_size)*D E=sum(state*A) reg_coef = (E-B)/((ens_size-1)*obs_prior_var) state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A)

CPU Implementation 2 For each close state: state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) reg_coef = (E-B) /((ens_size-1)*obs_prior_var) B=sum(state*D/ens_size) sum(state*(A-D/ens_size)) /((ens_size-1)*obs_prior_var) B=(sum(state) / ens_size)*D E=sum(state*A)

sum(state*(A-D/ens_size))/((ens_size-1)*obs_prior_var) CPU Implementation 2 For each close state: state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) reg_coef = (E-B)/((ens_size-1)*obs_prior_var) M M=A-D/ens_size K K=(ens_size-1)*obs_prior_var B=sum(state*D/ens_size) B=(sum(state) / ens_size)*D E=sum(state*A)

CPU Implementation 2 For each close state: reg_coef = sum(state*M)/K state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) M=A-D/ens_size K=(ens_size-1)*obs_prior_var

CPU Results

Algorithm For each close state: reg_coef = sum (state*M)/K state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) M=A-D/ens_size K=(ens_size-1)*obs_prior_var

Algorithm Not Enough Computation sum(array[80]) 79 sums 7 steps (After padding) sum(array[4*1024*1024]) 4,194,304 sums 22 steps en.wikipedia.org

Algorithm Low CGMA Compute to Global Memory Access ratio

CPU Implementation 1 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc Float Point Operations LoadStore 79+1: : : : CGMA=1.176

CPU Implementation 2 For each close state: reg_coef = sum(state*M)/K state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) M=A-D/ens_size Float Point Operations LoadStore : : CGMA=0.743 K=(ens_size-1)*obs_prior_var

GPU Implementation 1 state(:,i) …… … ens_size num_close_states thread i reg_coef (i) state_inc (:,i)

GPU Implementation 1 Streams + AsyncMemcpy S1 S2 S3 time

GPU Implementation 1 Streams + AsyncMemcpy time

GPU Implementation 1 Streams + AsyncMemcpy Assumed-shape Array (:) time

GPU Implementation 1 Streams + AsyncMemcpy Assumed-size Array (*) time

GPU Results

GPU Implementation 2 state(:,i) …… … ens_size num_close_states thread 1:ens_size reg_coef (i) state_inc (:,i)

GPU Implementation 2 80   81 Binary TreeTernary Tree Shared Memory

GPU Implementation 2 Streams + AsyncMemcpy time

GPU Results

GPU Implementation 3 Image: pixshark.com

GPU Implementation 3 GPU+CPU 4-way concurrency BE reg_coef and state_inc BE S1 S2 S3 time

GPU Results

Conclusion Reduced redundancy in the CPU version GPU version achieved a 1.9x speedup Explored the ways to implement a memory bound problem on GPU Learned the effects of assumed-shape/size arrays on CUDA FORTRAN performance Integrate more computations into the GPU device kernel to improve the performance

Acknowledgement NCAR / UCAR University of Wyoming DAReS: Jeff Anderson Nancy Collins Helen Kershaw Tim Hoar Kevin Raeder Silvia Gentile CISL/ SIParCS Rich Loft Raghu Raj Kumar Thank You!