Evaluating Coprocessor Effectiveness for DART Ye Feng SIParCS UCAR/NCAR Boulder, CO University of Wyoming Mentors Helen Kershaw Nancy Collins
Introduction DART Data Assimilation Research Testbed Developed and maintained by the DAReS at NCAR GPU NVIDIA Tesla K20x CUDA FORTRAN Previous Work get_close_obs
Profiling Result Allinea MAP wrf_regular_test_case
Profiling Result Allinea MAP wrf_regular_test_case
Target update_from_obs_inc Linear regression of a state variable onto an observation Compute the state variable increments from observation increments State Obs_inc Update_from_obs_inc Reg_coef State_inc
CPU Implementation 1 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc
CPU Implementation 2 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc A A=obs-obs_prior_mean
CPU Implementation 2 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) A=obs-obs_prior_mean
CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean
CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean D D=sum(A)
CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean D=sum(A) BB EE D
CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = ( E - B )/((ens_size-1)*obs_prior_var) state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean D=sum(A) BB EE D
CPU Implementation 2 For each close state: B=(sum(state) / ens_size)*D E=sum(state*A) reg_coef = (E-B)/((ens_size-1)*obs_prior_var) state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A)
CPU Implementation 2 For each close state: state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) reg_coef = (E-B) /((ens_size-1)*obs_prior_var) B=sum(state*D/ens_size) sum(state*(A-D/ens_size)) /((ens_size-1)*obs_prior_var) B=(sum(state) / ens_size)*D E=sum(state*A)
sum(state*(A-D/ens_size))/((ens_size-1)*obs_prior_var) CPU Implementation 2 For each close state: state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) reg_coef = (E-B)/((ens_size-1)*obs_prior_var) M M=A-D/ens_size K K=(ens_size-1)*obs_prior_var B=sum(state*D/ens_size) B=(sum(state) / ens_size)*D E=sum(state*A)
CPU Implementation 2 For each close state: reg_coef = sum(state*M)/K state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) M=A-D/ens_size K=(ens_size-1)*obs_prior_var
CPU Results
Algorithm For each close state: reg_coef = sum (state*M)/K state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) M=A-D/ens_size K=(ens_size-1)*obs_prior_var
Algorithm Not Enough Computation sum(array[80]) 79 sums 7 steps (After padding) sum(array[4*1024*1024]) 4,194,304 sums 22 steps en.wikipedia.org
Algorithm Low CGMA Compute to Global Memory Access ratio
CPU Implementation 1 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc Float Point Operations LoadStore 79+1: : : : CGMA=1.176
CPU Implementation 2 For each close state: reg_coef = sum(state*M)/K state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) M=A-D/ens_size Float Point Operations LoadStore : : CGMA=0.743 K=(ens_size-1)*obs_prior_var
GPU Implementation 1 state(:,i) …… … ens_size num_close_states thread i reg_coef (i) state_inc (:,i)
GPU Implementation 1 Streams + AsyncMemcpy S1 S2 S3 time
GPU Implementation 1 Streams + AsyncMemcpy time
GPU Implementation 1 Streams + AsyncMemcpy Assumed-shape Array (:) time
GPU Implementation 1 Streams + AsyncMemcpy Assumed-size Array (*) time
GPU Results
GPU Implementation 2 state(:,i) …… … ens_size num_close_states thread 1:ens_size reg_coef (i) state_inc (:,i)
GPU Implementation 2 80 81 Binary TreeTernary Tree Shared Memory
GPU Implementation 2 Streams + AsyncMemcpy time
GPU Results
GPU Implementation 3 Image: pixshark.com
GPU Implementation 3 GPU+CPU 4-way concurrency BE reg_coef and state_inc BE S1 S2 S3 time
GPU Results
Conclusion Reduced redundancy in the CPU version GPU version achieved a 1.9x speedup Explored the ways to implement a memory bound problem on GPU Learned the effects of assumed-shape/size arrays on CUDA FORTRAN performance Integrate more computations into the GPU device kernel to improve the performance
Acknowledgement NCAR / UCAR University of Wyoming DAReS: Jeff Anderson Nancy Collins Helen Kershaw Tim Hoar Kevin Raeder Silvia Gentile CISL/ SIParCS Rich Loft Raghu Raj Kumar Thank You!