# Misbah Mubarak, Christopher D. Carothers

## Presentation on theme: "Misbah Mubarak, Christopher D. Carothers"— Presentation transcript:

Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation
Misbah Mubarak, Christopher D. Carothers Rensselaer Polytechnic Institute Robert Ross, Philip Carns Argonne National Laboratory

Outline Dragonfly Network Topology Validation of the dragonfly model
Performance Comparison with booksim Scaling dragonfly model on BG/P and BG/Q Conclusion & future work

The Dragonfly Network Topology
A two level directly connected topology Uses high-radix routers Large number of ports per router Each port has moderate bandwidth “p”: Number of compute nodes connected to a router “a”: Number of routers in a group “h”: Number of global channels per router k=a + p + h – 1 a=2p=2h (Recommended configuration) By using high-radix router, one can reduce the diameter of the network and limit the number of global channels traversed by the packet Increasing the degree of the router reduces the hop count, leads to low latency and low network cost As global channels can be expensive, high-radix routers also help to reduce the number of global channels traversed by a packet - Comments by Chris: The network size grows by the p to the 4 power where p is the compute nodes. The precise question is 4*p^4 + 2*p^2 -- this is sub in for the N=p*a*g eq. So, to get a billion node DF network, p only needs to be ~128. This is interesting because a torus can grow by the dimension power D of the K-aryity -- K^D -- e.g the blue gene/L was a 32^3 system by design thus providing Up to 64K cores, but then later IBM push the design out so you could have a much longer Z dimension -- so K was not the same in each dimension, etc. The billion node torus networks we ran before where 32^6 which is very close to p=128 DF network

Simulating interconnect networks
Expected size of exascale systems Millions of compute cores Up to 1 million compute nodes Critical to have a low-latency, small diameter and low- cost interconnect network Exascale HPC systems cannot be effectively simulated with small-scale prototypes We use Rensselaer Optimistic Simulation System (ROSS) to simulate a dragonfly model with millions of nodes Our dragonfly model attains a peak event rate of 1.33 billion events/sec Total committed events: 872 billion

Dragonfly Model Configuration
Traffic Patterns Uniform Random Traffic (UR) Nearest Neighbor Traffic (or Worst Case traffic WC) Virtual channels To avoid deadlocks Credit based flow control Upstream nodes/routers keep track of buffer slots An input-queued virtual channel router Each router port supports up to ‘v’ virtual channels Uniform Random Traffic: Each packet generated by the model randomly chooses a destination node through a uniform distribution. Nearest-Neighbor Traffic: Each node in a group sends a message to a random node in the neighboring group. Credit based flow control in which the upstream node keeps a count of free buffer slots in the downstream VCs Input queued virtual channel router with a specified number of input and output ports with each port supporting up to ‘v’ virtual channels

send to outbound buffer
credit generate send to outbound buffer send arrive channel delay source router Interval Buffer full? N Y destination router Wait for Wait for credit Packet Destination node Wait for channel delay Sending node Source router Intermediate router(s) Destination router

Dragonfly Model Routing Algorithms
Minimal Routing (MIN) Uniform random traffic: High throughput, low latency Nearest neighbor traffic: causes congestion, high latency, low throughput Non-minimal routing (VAL) Half the throughput as MIN under UR traffic Nearest neighbor traffic: optimal performance (about 50% throughput) Global Adaptive routing Chooses between MIN and VAL by sensing the traffic conditions on the global channels With uniform random traffic, MIN gives the optimal throughput as the traffic is scattered over the entire network With nearest neighbor traffic, MIN congests the single global channel going between two groups as it always prefers the shortest path Non-minimal routing works by deviating the traffic to a randomly selected intermediate group first and then to the destination group. Under UR traffic, non-minimal routing gives half the throughput as MIN With nearest neighbor traffic, non-minimal routing gives the best possible throughput as it deviates the traffic to an intermediate group.

Dragonfly Model Minimal Routing
(iv) Packet traverses to R7 over local channel R0 (i) Packet arrives at R0, Destination Router = R7 (ii) Packet traverses to R1 over local channel P (iii) Packet traverses from R1 to R4 over the global channel G0 G1 R1 R2 R3 R4 R5 R6 R7

Outline Dragonfly Network Topology Validation of the dragonfly model
Performance Comparison with booksim Scaling dragonfly model on BG/P and BG/Q Conclusion & future work

Dragonfly Model Validation
Dragonfly network topologies in design PERCS network topology Machines from Echelon project Booksim: A cycle accurate simulator with dragonfly model Used by Dally et. al to validate the dragonfly topology proposal Runs in serial mode only Supports minimal and global adaptive routing Performance results shown on 1,024 nodes and 264 routers We validated our ROSS dragonfly model against booksim The IBM PERCS has a similar topology to the dragonfly that was intended to be a part of the Blue Waters system. We validated the correctness of our dragonfly model against booksim. Booksim is an open source cycle accurate simulator proposed by Kim, Dally et. al to validate the dragonfly topology proposal. Similarities between ROSS and booksim Both support virtual channels, credit based flow control, finite buffers. Both simulators support uniform random and nearest neighbor traffic patterns. Both use single flit packets. The router arbitration policy is FCFS for ROSS as it was simple to implement for discrete event simulations and as the results show, changing the router arbitration policy does not significantly affect the results.

Global Adaptive Routing---Threshold selection (ROSS vs. Booksim)
Booksim uses an adaptive threshold to bias the UGAL algorithm towards minimal or non-minimal routing We incorporated a similar threshold in ROSS To find the threshold value to bias traffic towards non-minimal, we did experiments to find the optimal threshold value. The value that yields maximum non-minimal packets is -180 To load balance nearest-neighbor traffic, we use UGAL routing algorithm which is based on the above algorithm. Booksim uses an adaptive threshold which is set to positive value (currently set to 30) to bias the algorithm to use minimal routing under uniform random traffic ROSS also uses a similar adaptive threshold to bias the routing decision. As booksim doesn’t specify the optimal threshold value for global adaptive routing, we sought the best value of the adaptive threshold for ROSS and booksim that can bias the traffic towards non-minimal routing. We did experiments with different negative threshold values and found the optimal values by recording the number of minimal and non-minimal packets. We decided to select a threshold value of -180 for both ROSS and booksim as it biases maximum number of packets towards nonminimal path and gives minimum latency Global Adaptive Routing If min_queue_size < (2 * nonmin_queue_size) + adaptive_threshold then route minimally Else route non-minimally End if

ROSS vs. booksim– Uniform Random traffic
For MIN, ROSS has an average of 4.2% and a maximum of 7% difference from booksim results. For UGAL, ROSS has an average of 3% and a maximum of 7.8% difference from booksim results. With minimal routing, ROSS has an average of 4.2% and a maximum of 7% difference from booksim results With global adaptive routing, ROSS has an average of 3% and a maximum of 7.8% difference from booksim results

ROSS vs. booksim– Nearest neighbor traffic
The nearest neighbor traffic yields a very high latency and low throughput with minimal routing. This traffic pattern can be load balanced by either non-minimal or adaptive routing Non-minimal routing gives slightly under 50% throughput with nearest neighbor traffic Non-minimal routing gives slightly under 50% throughput with nearest neighbor traffic which is the maximum throughput that can be achieved under this kind of traffic. Minimal routing gives high latency and low throughput Both ROSS & booksim give low latency high throughput for MIN routing under UR traffic Both simulators give high latency and low throughput for MIN routing under WC traffic Both simulator’s UGAL algorithm resembles MIN routing under UR traffic Both simulators yield slightly under 50% latency for UGAL under WC routing. The small differences can be due to We approximated the internal speedup for booksim Both simulators use different random number generators

Outline Dragonfly Network Topology Validation of the dragonfly model
Performance Comparison with booksim Scaling dragonfly model on BG/P and BG/Q Conclusion & future work

Dragonfly performance: ROSS vs. booksim
We compared the performance of ROSS and booksim by measuring the simulation execution time As booksim runs serially, we configured ROSS in its serial mode Both simulators ran for a warm-up phase of 30,000 cycles and a measurement phase of 30,000 cycles Tests were carried out on dual core Intel X5650s running at 2.67GHz ROSS attains the following performance speedup Minimum of 5x up to a maximum of 11x speedup over booksim with MIN routing Minimum of 5.3x speedup and a maximum of 12.38x speedup with global adaptive routing

Outline Dragonfly Network Topology Validation of the dragonfly model
Performance Comparison with booksim Scaling dragonfly model on BG/P and BG/Q Conclusion & future work

ROSS Dragonfly model on BG/P and BG/Q
We evaluated the strong scaling characteristics of the dragonfly model on Argonne Leadership Computing Facility (ALCF) IBM Blue Gene/P system (Intrepid) Computational Center for Nanotechnology Innovations (CCNI) IBM Blue Gene/Q We scheduled 64 MPI tasks per node on BG/Q and 4 MPI tasks per node on BG/P Performance was evaluated through the following metrics Committed event rate Percentage of remote events ROSS event efficiency Simulation run time Intrepid’s BG/P has 40 racks, each of which has 1,024 nodes. Intrepid is equipped with a 3D torus network used for point-to-point communication and collective operations. CCNI BG/Q has 1 rack, with 1,024 nodes. It is equipped with a 5D torus network. Each node has 16 cores with each core supporting 4 threads.

ROSS Parameters ROSS employs Time Warp Optimistic synchronization protocol To reduce state saving overheads, ROSS employs an event roll back mechanism ROSS event efficiency determines the amount of useful work performed by the simulation Global Virtual Time (GVT) imposes a lower bound on the simulation time GVT is controlled by batch and gvt-interval parameters On average, batch * gvt-interval events are processed between each GVT epoch Batch is the number of events that ROSS processes before returning to the top scheduling loop and checking for arrival of remote events and messages The gvt-interval specifies the number of iterations that ROSS goes through the main event scheduling loop before initiating a GVT computation

ROSS Dragonfly Performance Results on BG/P vs. BG/Q
Event efficiency drops and total rollbacks increase on BG/P after 16K MPI tasks Less off-node communication on BG/Q vs. BG/P Each MPI task has more processing power on BG/P and simulation advances quickly As BG/Q has 64 MPI tasks per node vs. 4 MPI tasks on BG/P, there is more off-node communication in BG/P as compared to BG/Q. We conjecture that sending a message through memory on BG/Q takes less time as sending off-node messages on BG/P. Therefore, the probability of a message arriving late is less on BG/Q as compared to BG/P. Each BG/Q core has 1.6 GHz processing power divided among 4 threads. Each thread gets 400 MHz of processing power vs. 850 MHz of processing power on the BG/P. This causes MPI ranks on BG/Q to advance more slowly through the simulated time and also lowers the potential for rollbacks relative to BG/P.

ROSS Dragonfly Performance Results on BG/P vs. BG/Q
This increases computation and dominates the number of events being rolled back. The event efficiency stays high on both BG/P and BG/Q as each MPI task has substantial work load The computation performed at each MPI task dominates the number of rolled back events

Outline Dragonfly Network Topology Validation of the dragonfly model
Performance Comparison with booksim Scaling dragonfly model on BG/P and BG/Q Conclusion & future work

Conclusion & Future work
We presented a parallel discrete-event simulation for a dragonfly network topology We validated our simulator with cycle accurate simulator booksim We demonstrated the ability of our simulator to scale on very large models with up to 50M nodes Future work Introduce an improved queue congestion sensing policy for global adaptive routing Experiment with other variations of nearest neighbor traffic in dragonfly Compare the dragonfly network model with other candidate topology models for exascale computing

Similar presentations