Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation Misbah Mubarak, Christopher D. Carothers Rensselaer Polytechnic.

Similar presentations


Presentation on theme: "Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation Misbah Mubarak, Christopher D. Carothers Rensselaer Polytechnic."— Presentation transcript:

1 Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation Misbah Mubarak, Christopher D. Carothers Rensselaer Polytechnic Institute Robert Ross, Philip Carns Argonne National Laboratory

2 Outline  Dragonfly Network Topology  Validation of the dragonfly model  Performance Comparison with booksim  Scaling dragonfly model on BG/P and BG/Q  Conclusion & future work

3 The Dragonfly Network Topology  A two level directly connected topology  Uses high-radix routers  Large number of ports per router  Each port has moderate bandwidth “p”: Number of compute nodes connected to a router “a”: Number of routers in a group “h”: Number of global channels per router k=a + p + h – 1 a=2p=2h (Recommended configuration)

4 Simulating interconnect networks  Expected size of exascale systems  Millions of compute cores  Up to 1 million compute nodes  Critical to have a low-latency, small diameter and low- cost interconnect network  Exascale HPC systems cannot be effectively simulated with small-scale prototypes  We use Rensselaer Optimistic Simulation System (ROSS) to simulate a dragonfly model with millions of nodes  Our dragonfly model attains a peak event rate of 1.33 billion events/sec  Total committed events: 872 billion

5 Dragonfly Model Configuration  Traffic Patterns  Uniform Random Traffic (UR)  Nearest Neighbor Traffic (or Worst Case traffic WC)  Virtual channels  To avoid deadlocks  Credit based flow control  Upstream nodes/routers keep track of buffer slots  An input-queued virtual channel router  Each router port supports up to ‘v’ virtual channels

6 credit generate send to outbound buffer send arrive channel delay source router send arrive channel delay send arrive send Interval Buffer full? N Y credit Buffer full? N destination router Buffer full? Wait for credit Y N Wait for credit Y Packet Destination node Wait for credit Y N channel delay arrive Sending node Source router Intermediate router(s) Destination node Destination router

7 Dragonfly Model Routing Algorithms  Minimal Routing (MIN)  Uniform random traffic: High throughput, low latency  Nearest neighbor traffic: causes congestion, high latency, low throughput  Non-minimal routing (VAL)  Half the throughput as MIN under UR traffic  Nearest neighbor traffic: optimal performance (about 50% throughput)  Global Adaptive routing  Chooses between MIN and VAL by sensing the traffic conditions on the global channels

8 Dragonfly Model Minimal Routing (iv) Packet traverses to R7 over local channel R0 (i) Packet arrives at R0, Destination Router = R7 (ii) Packet traverses to R1 over local channel P (iii) Packet traverses from R1 to R4 over the global channel G0G0 G1G1 G0G0 G1G1 G0G0 G1G1 R0R1 R2 R3 R4 R5 R6 R7 P G0G0 G1G1 R0R1 R2 R3 R4 R5 R6 R7 P R0 R1 R2 R3 R4 R5 R6 R7 R1 R2 R3 R4 R5 R6 R7 P

9 Outline  Dragonfly Network Topology  Validation of the dragonfly model  Performance Comparison with booksim  Scaling dragonfly model on BG/P and BG/Q  Conclusion & future work

10 Dragonfly Model Validation  Dragonfly network topologies in design  PERCS network topology  Machines from Echelon project  Booksim:  A cycle accurate simulator with dragonfly model  Used by Dally et. al to validate the dragonfly topology proposal  Runs in serial mode only  Supports minimal and global adaptive routing  Performance results shown on 1,024 nodes and 264 routers  We validated our ROSS dragonfly model against booksim

11 Global Adaptive Routing---Threshold selection (ROSS vs. Booksim) If min_queue_size < (2 * nonmin_queue_size) + adaptive_threshold then route minimally Else route non-minimally End if Global Adaptive Routing Booksim uses an adaptive threshold to bias the UGAL algorithm towards minimal or non-minimal routing We incorporated a similar threshold in ROSS To find the threshold value to bias traffic towards non-minimal, we did experiments to find the optimal threshold value. The value that yields maximum non-minimal packets is -180

12 ROSS vs. booksim– Uniform Random traffic With minimal routing, ROSS has an average of 4.2% and a maximum of 7% difference from booksim results With global adaptive routing, ROSS has an average of 3% and a maximum of 7.8% difference from booksim results

13 ROSS vs. booksim– Nearest neighbor traffic -The nearest neighbor traffic yields a very high latency and low throughput with minimal routing. -This traffic pattern can be load balanced by either non-minimal or adaptive routing -Non-minimal routing gives slightly under 50% throughput with nearest neighbor traffic

14 Outline  Dragonfly Network Topology  Validation of the dragonfly model  Performance Comparison with booksim  Scaling dragonfly model on BG/P and BG/Q  Conclusion & future work

15 Dragonfly performance: ROSS vs. booksim  ROSS attains the following performance speedup  Minimum of 5x up to a maximum of 11x speedup over booksim with MIN routing  Minimum of 5.3x speedup and a maximum of 12.38x speedup with global adaptive routing

16 Outline  Dragonfly Network Topology  Validation of the dragonfly model  Performance Comparison with booksim  Scaling dragonfly model on BG/P and BG/Q  Conclusion & future work

17 ROSS Dragonfly model on BG/P and BG/Q  We evaluated the strong scaling characteristics of the dragonfly model on  Argonne Leadership Computing Facility (ALCF) IBM Blue Gene/P system (Intrepid)  Computational Center for Nanotechnology Innovations (CCNI) IBM Blue Gene/Q  We scheduled 64 MPI tasks per node on BG/Q and 4 MPI tasks per node on BG/P  Performance was evaluated through the following metrics  Committed event rate  Percentage of remote events  ROSS event efficiency  Simulation run time

18 ROSS Parameters  ROSS employs Time Warp Optimistic synchronization protocol  To reduce state saving overheads, ROSS employs an event roll back mechanism  ROSS event efficiency determines the amount of useful work performed by the simulation  Global Virtual Time (GVT) imposes a lower bound on the simulation time  GVT is controlled by batch and gvt-interval parameters  On average, batch * gvt-interval events are processed between each GVT epoch

19 ROSS Dragonfly Performance Results on BG/P vs. BG/Q  Event efficiency drops and total rollbacks increase on BG/P after 16K MPI tasks  Less off-node communication on BG/Q vs. BG/P  Each MPI task has more processing power on BG/P and simulation advances quickly

20 ROSS Dragonfly Performance Results on BG/P vs. BG/Q  The event efficiency stays high on both BG/P and BG/Q as each MPI task has substantial work load  The computation performed at each MPI task dominates the number of rolled back events

21 Outline  Dragonfly Network Topology  Validation of the dragonfly model  Performance Comparison with booksim  Scaling dragonfly model on BG/P and BG/Q  Conclusion & future work

22 Conclusion & Future work  Conclusion  We presented a parallel discrete-event simulation for a dragonfly network topology  We validated our simulator with cycle accurate simulator booksim  We demonstrated the ability of our simulator to scale on very large models with up to 50M nodes  Future work  Introduce an improved queue congestion sensing policy for global adaptive routing  Experiment with other variations of nearest neighbor traffic in dragonfly  Compare the dragonfly network model with other candidate topology models for exascale computing


Download ppt "Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation Misbah Mubarak, Christopher D. Carothers Rensselaer Polytechnic."

Similar presentations


Ads by Google