Download presentation
Presentation is loading. Please wait.
Published byHubert Jennings Modified over 8 years ago
1
Evaluating Coprocessor Effectiveness for DART Ye Feng SIParCS UCAR/NCAR Boulder, CO University of Wyoming Mentors Helen Kershaw Nancy Collins
2
Introduction DART Data Assimilation Research Testbed Developed and maintained by the DAReS at NCAR GPU NVIDIA Tesla K20x CUDA FORTRAN Previous Work get_close_obs
3
Profiling Result Allinea MAP wrf_regular_test_case
4
Profiling Result Allinea MAP wrf_regular_test_case
5
Target update_from_obs_inc Linear regression of a state variable onto an observation Compute the state variable increments from observation increments State Obs_inc Update_from_obs_inc Reg_coef State_inc
6
CPU Implementation 1 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc
7
CPU Implementation 2 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc A A=obs-obs_prior_mean
8
CPU Implementation 2 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) A=obs-obs_prior_mean
9
CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean
10
CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean D D=sum(A)
11
CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean D=sum(A) BB EE D
12
CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = ( E - B )/((ens_size-1)*obs_prior_var) state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean D=sum(A) BB EE D
13
CPU Implementation 2 For each close state: B=(sum(state) / ens_size)*D E=sum(state*A) reg_coef = (E-B)/((ens_size-1)*obs_prior_var) state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A)
14
CPU Implementation 2 For each close state: state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) reg_coef = (E-B) /((ens_size-1)*obs_prior_var) B=sum(state*D/ens_size) sum(state*(A-D/ens_size)) /((ens_size-1)*obs_prior_var) B=(sum(state) / ens_size)*D E=sum(state*A)
15
sum(state*(A-D/ens_size))/((ens_size-1)*obs_prior_var) CPU Implementation 2 For each close state: state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) reg_coef = (E-B)/((ens_size-1)*obs_prior_var) M M=A-D/ens_size K K=(ens_size-1)*obs_prior_var B=sum(state*D/ens_size) B=(sum(state) / ens_size)*D E=sum(state*A)
16
CPU Implementation 2 For each close state: reg_coef = sum(state*M)/K state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) M=A-D/ens_size K=(ens_size-1)*obs_prior_var
17
CPU Results
18
Algorithm For each close state: reg_coef = sum (state*M)/K state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) M=A-D/ens_size K=(ens_size-1)*obs_prior_var
19
Algorithm Not Enough Computation sum(array[80]) 79 sums 7 steps (After padding) sum(array[4*1024*1024]) 4,194,304 sums 22 steps en.wikipedia.org www.manutritionniste.com
20
Algorithm Low CGMA Compute to Global Memory Access ratio
21
CPU Implementation 1 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc Float Point Operations LoadStore 79+1:80+11 80+80+79+80+1:80+80+21 1:1+11 80:1+8080 CGMA=1.176
22
CPU Implementation 2 For each close state: reg_coef = sum(state*M)/K state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) M=A-D/ens_size Float Point Operations LoadStore 79+80+1:80+80+11 80:1+8080 CGMA=0.743 K=(ens_size-1)*obs_prior_var
23
GPU Implementation 1 state(:,i) …… … ens_size num_close_states thread i reg_coef (i) state_inc (:,i)
24
GPU Implementation 1 Streams + AsyncMemcpy www.zoombd24.com S1 S2 S3 time
25
GPU Implementation 1 Streams + AsyncMemcpy time
26
GPU Implementation 1 Streams + AsyncMemcpy Assumed-shape Array (:) time
27
GPU Implementation 1 Streams + AsyncMemcpy Assumed-size Array (*) time
28
GPU Results
29
GPU Implementation 2 state(:,i) …… … ens_size num_close_states thread 1:ens_size reg_coef (i) state_inc (:,i)
30
GPU Implementation 2 80 12880 81 Binary TreeTernary Tree Shared Memory
31
GPU Implementation 2 Streams + AsyncMemcpy time
32
GPU Results
33
GPU Implementation 3 Image: pixshark.com
34
GPU Implementation 3 GPU+CPU 4-way concurrency BE reg_coef and state_inc BE S1 S2 S3 time
35
GPU Results
36
Conclusion Reduced redundancy in the CPU version GPU version achieved a 1.9x speedup Explored the ways to implement a memory bound problem on GPU Learned the effects of assumed-shape/size arrays on CUDA FORTRAN performance Integrate more computations into the GPU device kernel to improve the performance
37
Acknowledgement NCAR / UCAR University of Wyoming DAReS: Jeff Anderson Nancy Collins Helen Kershaw Tim Hoar Kevin Raeder Silvia Gentile CISL/ SIParCS Rich Loft Raghu Raj Kumar Thank You!
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.