Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Evaluation of OhHelp'ed PIC Simulation Hiroshi Nakashima (ACCMS, Kyoto U.) cooperated by Yohei Miyake (ACCMS, Kyoto U.) Hideyuki Usui (Kobe.

Similar presentations


Presentation on theme: "Performance Evaluation of OhHelp'ed PIC Simulation Hiroshi Nakashima (ACCMS, Kyoto U.) cooperated by Yohei Miyake (ACCMS, Kyoto U.) Hideyuki Usui (Kobe."— Presentation transcript:

1 Performance Evaluation of OhHelp'ed PIC Simulation Hiroshi Nakashima (ACCMS, Kyoto U.) cooperated by Yohei Miyake (ACCMS, Kyoto U.) Hideyuki Usui (Kobe U.) Yoshiharu Omura (RISH, Kyoto U.)

2 Contents Introduction: what I said last june PIC simulation: overview & problems OhHelp: load balancer for PIC simulation algorithm overview some detail of load balancing total view of OhHelp'ed PIC simulation performance evaluation Conclusion

3 Introduction: I Said Last June... Why Plasma Simulation ? power/money hungry large scale (128cores, 1TB, 1.28TFlops) shared memory nodes A big user group of plasma simulation insisted that our new system should include this power/money hungry subsystem for their memory hungry SM-parallel application. I failed to persuade them to build Open-Supercomputer-only system. So I swore revenge on them by coding a much more efficient DM-parallel program to run on Open Supercomputer. Now we are very friendly with each other and are pursuing tightly collaborating research work

4 Introduction: Also I Showed Last June... How Efficient on SMP & Small T2K performance @ 16-128 proc on HPC2500 x3.20 x11.71 x8.76 balanced unbalanced original x10.7 T2K Open Supercomputer 4 nodes (64 cores) x1.66 x4.02 wasis 2D benchmark3D production up to 128 procup to 1024 proc strong scaling+ weak scaling mild imbalanceextreme imbalance

5 PIC Simulation What to Do? a large number of (e.g. > 1 trillion) charged particles a large scale (e.g. 1000x1000x1000 grid) electromagnetic field (e.g. magnetosphere) simulate particle movement by

6 PIC Simulation What the Problem? Inherently has much parallelism but believed hardly scalable because... Particle decomposition copying fields cannot sustain large space domain and global operations on it. Domain decomposition cannot work when particles are distributed non- uniformly. Dynamic domain decomposition also fails when particles populate a small subdomain too densely.  Need new idea!!

7 33 12 00 10 031323 01 32 3010 3303 OhHelp: Overview primary subdomainsecondary subdomain uniform block decomposition well-balanced :  #particle-in-subdomain   #p / #nodes  (1 +  )  simulate primary particles  neighboring comm. only each node helps another node having dense subdomain balanced #particles balanced subdomain size well-balanced  stable subdomain assignment 1323 02122232 01112131 00102030 0222 1121 0020 03 21 31 20 30 23 32 33 01 13 31 02 11 22 OhHelp: One-handed Help

8 give p even if becoming less than average  get from somebody afterward OhHelp: Load Balancing 33003201301013032320310211211222 Secondary Subdomain Assignment move p from heaviest to lightest so that lightest has av. #p  av. #p

9 OhHelp: Helper-Tree 12 22 2131 20 3332 01 30 10 13 0323 02 00 11 Helper-Tree is traversed each time-step (i.e. each particle movement) bottom-up: does helper assignment sustain the load variance ? top-down:how (re)distribute particles among family members ?

10 OhHelp: Balancing Check (2/2) 12 22 2131 20 3332 01 30 10 13 0323 02 00 11

11 OhHelp: Balancing Check (1/2) 12 22 2131 20 3332 01 30 10 13 0323 02 00 11

12 particle transfer OhHelp: Simulation Flow load balance & particle transfer particle push current scatter field solve secondary primary + all-reduce broadcast

13 OhHelp: Evaluation Setup weak scaling strong scaling #proc=1#proc=512 #proc=1#proc=1024 64 32x32x32 256 512

14 OhHelp: Performance 10 6 particle/s # of processes strong scaling domain size = 64 3 #particles = 32 x 2 20 weak scaling domain size = 32 3 x #proc #particles = 8 x 2 20 x #proc x293 x253 x721 x626 x35 thicken particles in a 32 3 area also well scalable almost linear speed- up over 16 proc. perf(1024) > perf(16) x 60 uniformly distributed good scalability particle decomp. not scalable

15 Conclusion We confirmed OhHelp'ed PIC simulator is scalable. 600-700 speedup with 1024 process (very near) Future work is to build OhHelp library. lev. 1:load balancing lev. 2:particle transfer lev. 3:inter-subdomain communication lev. 4:semi-automatic PIC code transformation for OhHelp'ing

16 Backup

17 Breakdown: Strong Scaling @ 256 [ms] exec. time/step comm. time/step

18 Breakdown: Weak Scaling @ 256 [ms] exec. time/step comm. time/step


Download ppt "Performance Evaluation of OhHelp'ed PIC Simulation Hiroshi Nakashima (ACCMS, Kyoto U.) cooperated by Yohei Miyake (ACCMS, Kyoto U.) Hideyuki Usui (Kobe."

Similar presentations


Ads by Google