Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluating Coprocessor Effectiveness for DART Ye Feng SIParCS UCAR/NCAR Boulder, CO University of Wyoming Mentors Helen Kershaw Nancy Collins.

Similar presentations


Presentation on theme: "Evaluating Coprocessor Effectiveness for DART Ye Feng SIParCS UCAR/NCAR Boulder, CO University of Wyoming Mentors Helen Kershaw Nancy Collins."— Presentation transcript:

1 Evaluating Coprocessor Effectiveness for DART Ye Feng SIParCS UCAR/NCAR Boulder, CO University of Wyoming Mentors Helen Kershaw Nancy Collins

2 Introduction DART Data Assimilation Research Testbed Developed and maintained by the DAReS at NCAR GPU NVIDIA Tesla K20x CUDA FORTRAN Previous Work get_close_obs

3 Profiling Result Allinea MAP wrf_regular_test_case

4 Profiling Result Allinea MAP wrf_regular_test_case

5 Target update_from_obs_inc Linear regression of a state variable onto an observation Compute the state variable increments from observation increments State Obs_inc Update_from_obs_inc Reg_coef State_inc

6 CPU Implementation 1 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc

7 CPU Implementation 2 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc A A=obs-obs_prior_mean

8 CPU Implementation 2 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) A=obs-obs_prior_mean

9 CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean

10 CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean D D=sum(A)

11 CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean D=sum(A) BB EE D

12 CPU Implementation 2 For each close state: (state_mean = sum(state) / ens_size) obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = ( E - B )/((ens_size-1)*obs_prior_var) state_inc = reg_coef*obs_inc sum(state*A)-state_mean*sum(A) *sum(A) A=obs-obs_prior_mean D=sum(A) BB EE D

13 CPU Implementation 2 For each close state: B=(sum(state) / ens_size)*D E=sum(state*A) reg_coef = (E-B)/((ens_size-1)*obs_prior_var) state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A)

14 CPU Implementation 2 For each close state: state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) reg_coef = (E-B) /((ens_size-1)*obs_prior_var) B=sum(state*D/ens_size) sum(state*(A-D/ens_size)) /((ens_size-1)*obs_prior_var) B=(sum(state) / ens_size)*D E=sum(state*A)

15 sum(state*(A-D/ens_size))/((ens_size-1)*obs_prior_var) CPU Implementation 2 For each close state: state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) reg_coef = (E-B)/((ens_size-1)*obs_prior_var) M M=A-D/ens_size K K=(ens_size-1)*obs_prior_var B=sum(state*D/ens_size) B=(sum(state) / ens_size)*D E=sum(state*A)

16 CPU Implementation 2 For each close state: reg_coef = sum(state*M)/K state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) M=A-D/ens_size K=(ens_size-1)*obs_prior_var

17 CPU Results

18 Algorithm For each close state: reg_coef = sum (state*M)/K state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) M=A-D/ens_size K=(ens_size-1)*obs_prior_var

19 Algorithm Not Enough Computation sum(array[80]) 79 sums 7 steps (After padding) sum(array[4*1024*1024]) 4,194,304 sums 22 steps en.wikipedia.org www.manutritionniste.com

20 Algorithm Low CGMA Compute to Global Memory Access ratio

21 CPU Implementation 1 For each close state: state_mean = sum(state) / ens_size obs_state_cov = sum( (state - state_mean) * (obs - obs_prior_mean) ) / (ens_size – 1) reg_coef = obs_state_cov/obs_prior_var state_inc = reg_coef*obs_inc Float Point Operations LoadStore 79+1:80+11 80+80+79+80+1:80+80+21 1:1+11 80:1+8080 CGMA=1.176

22 CPU Implementation 2 For each close state: reg_coef = sum(state*M)/K state_inc = reg_coef*obs_inc A=obs-obs_prior_mean D=sum(A) M=A-D/ens_size Float Point Operations LoadStore 79+80+1:80+80+11 80:1+8080 CGMA=0.743 K=(ens_size-1)*obs_prior_var

23 GPU Implementation 1 state(:,i) …… … ens_size num_close_states thread i reg_coef (i) state_inc (:,i)

24 GPU Implementation 1 Streams + AsyncMemcpy www.zoombd24.com S1 S2 S3 time

25 GPU Implementation 1 Streams + AsyncMemcpy time

26 GPU Implementation 1 Streams + AsyncMemcpy Assumed-shape Array (:) time

27 GPU Implementation 1 Streams + AsyncMemcpy Assumed-size Array (*) time

28 GPU Results

29 GPU Implementation 2 state(:,i) …… … ens_size num_close_states thread 1:ens_size reg_coef (i) state_inc (:,i)

30 GPU Implementation 2 80  12880  81 Binary TreeTernary Tree Shared Memory

31 GPU Implementation 2 Streams + AsyncMemcpy time

32 GPU Results

33 GPU Implementation 3 Image: pixshark.com

34 GPU Implementation 3 GPU+CPU 4-way concurrency BE reg_coef and state_inc BE S1 S2 S3 time

35 GPU Results

36 Conclusion Reduced redundancy in the CPU version GPU version achieved a 1.9x speedup Explored the ways to implement a memory bound problem on GPU Learned the effects of assumed-shape/size arrays on CUDA FORTRAN performance Integrate more computations into the GPU device kernel to improve the performance

37 Acknowledgement NCAR / UCAR University of Wyoming DAReS: Jeff Anderson Nancy Collins Helen Kershaw Tim Hoar Kevin Raeder Silvia Gentile CISL/ SIParCS Rich Loft Raghu Raj Kumar Thank You!


Download ppt "Evaluating Coprocessor Effectiveness for DART Ye Feng SIParCS UCAR/NCAR Boulder, CO University of Wyoming Mentors Helen Kershaw Nancy Collins."

Similar presentations


Ads by Google