Presentation on theme: "Misbah Mubarak, Christopher D. Carothers"— Presentation transcript:
1Modeling a Million-Node Dragonfly Network using Massively Parallel Discrete-Event Simulation Misbah Mubarak, Christopher D. CarothersRensselaer Polytechnic InstituteRobert Ross, Philip CarnsArgonne National Laboratory
2Outline Dragonfly Network Topology Validation of the dragonfly model Performance Comparison with booksimScaling dragonfly model on BG/P and BG/QConclusion & future work
3The Dragonfly Network Topology A two level directly connected topologyUses high-radix routersLarge number of ports per routerEach port has moderate bandwidth“p”: Number of compute nodes connected to a router“a”: Number of routers in a group“h”: Number of global channels per routerk=a + p + h – 1a=2p=2h(Recommended configuration)By using high-radix router, one can reduce the diameter of the network and limit the number of global channels traversed by the packetIncreasing the degree of the router reduces the hop count, leads to low latency and low network costAs global channels can be expensive, high-radix routers also help to reduce the number of global channels traversed by a packet- Comments by Chris: The network size grows by the p to the 4 power where p is the compute nodes. The precise question is 4*p^4 + 2*p^2 -- this is sub in for the N=p*a*g eq. So, to get a billion node DF network, p only needs to be ~128. This is interesting because a torus can grow by the dimension power D of the K-aryity -- K^D -- e.g the blue gene/L was a 32^3 system by design thus providing Up to 64K cores, but then later IBM push the design out so you could have a much longer Z dimension -- so K was not the same in each dimension, etc. The billion node torus networks we ran before where 32^6 which is very close to p=128 DF network
4Simulating interconnect networks Expected size of exascale systemsMillions of compute coresUp to 1 million compute nodesCritical to have a low-latency, small diameter and low- cost interconnect networkExascale HPC systems cannot be effectively simulated with small-scale prototypesWe use Rensselaer Optimistic Simulation System (ROSS) to simulate a dragonfly model with millions of nodesOur dragonfly model attains a peak event rate of 1.33 billion events/secTotal committed events: 872 billion
5Dragonfly Model Configuration Traffic PatternsUniform Random Traffic (UR)Nearest Neighbor Traffic (or Worst Case traffic WC)Virtual channelsTo avoid deadlocksCredit based flow controlUpstream nodes/routers keep track of buffer slotsAn input-queued virtual channel routerEach router port supports up to ‘v’ virtual channelsUniform Random Traffic: Each packet generated by the model randomly chooses a destination node through a uniform distribution.Nearest-Neighbor Traffic: Each node in a group sends a message to a random node in the neighboring group.Credit based flow control in which the upstream node keeps a count of free buffer slots in the downstream VCsInput queued virtual channel router with a specified number of input and output ports with each port supporting up to ‘v’ virtual channels
6send to outbound buffer creditgeneratesend to outbound buffersendarrivechannel delaysourcerouterIntervalBuffer full?NYdestination routerWaitforWait for creditPacketDestination nodeWait forchanneldelaySending nodeSource routerIntermediate router(s)Destination router
7Dragonfly Model Routing Algorithms Minimal Routing (MIN)Uniform random traffic: High throughput, low latencyNearest neighbor traffic: causes congestion, high latency, low throughputNon-minimal routing (VAL)Half the throughput as MIN under UR trafficNearest neighbor traffic: optimal performance (about 50% throughput)Global Adaptive routingChooses between MIN and VAL by sensing the traffic conditions on the global channelsWith uniform random traffic, MIN gives the optimal throughput as the traffic is scattered over the entire networkWith nearest neighbor traffic, MIN congests the single global channel going between two groups as it always prefers the shortest pathNon-minimal routing works by deviating the traffic to a randomly selected intermediate group first and then to the destination group.Under UR traffic, non-minimal routing gives half the throughput as MINWith nearest neighbor traffic, non-minimal routing gives the best possible throughput as it deviates the traffic to an intermediate group.
8Dragonfly Model Minimal Routing (iv) Packet traverses to R7 over local channelR0(i) Packet arrives at R0, Destination Router = R7(ii) Packet traverses to R1 over local channelP(iii) Packet traverses from R1 to R4 over the global channelG0G1R1R2R3R4R5R6R7
9Outline Dragonfly Network Topology Validation of the dragonfly model Performance Comparison with booksimScaling dragonfly model on BG/P and BG/QConclusion & future work
10Dragonfly Model Validation Dragonfly network topologies in designPERCS network topologyMachines from Echelon projectBooksim:A cycle accurate simulator with dragonfly modelUsed by Dally et. al to validate the dragonfly topology proposalRuns in serial mode onlySupports minimal and global adaptive routingPerformance results shown on 1,024 nodes and 264 routersWe validated our ROSS dragonfly model against booksimThe IBM PERCS has a similar topology to the dragonfly that was intended to be a part of the Blue Waters system.We validated the correctness of our dragonfly model against booksim. Booksim is an open source cycle accurate simulator proposed by Kim, Dally et. al to validate the dragonfly topology proposal.Similarities between ROSS and booksimBoth support virtual channels, credit based flow control, finite buffers.Both simulators support uniform random and nearest neighbor traffic patterns.Both use single flit packets.The router arbitration policy is FCFS for ROSS as it was simple to implement for discrete event simulations and as the results show, changing the router arbitration policy does not significantly affect the results.
11Global Adaptive Routing---Threshold selection (ROSS vs. Booksim) Booksim uses an adaptive threshold to bias the UGAL algorithm towards minimal or non-minimal routingWe incorporated a similar threshold in ROSSTo find the threshold value to bias traffic towards non-minimal, we did experiments to find the optimal threshold value.The value that yields maximum non-minimal packets is -180To load balance nearest-neighbor traffic, we use UGAL routing algorithm which is based on the above algorithm.Booksim uses an adaptive threshold which is set to positive value (currently set to 30) to bias the algorithm to use minimal routing under uniform random trafficROSS also uses a similar adaptive threshold to bias the routing decision.As booksim doesn’t specify the optimal threshold value for global adaptive routing, we sought the best value of the adaptive threshold for ROSS and booksim that can bias the traffic towards non-minimal routing.We did experiments with different negative threshold values and found the optimal values by recording the number of minimal and non-minimal packets. We decided to select a threshold value of -180 for both ROSS and booksim as it biases maximum number of packets towards nonminimal path and gives minimum latencyGlobal Adaptive RoutingIf min_queue_size < (2 * nonmin_queue_size) + adaptive_threshold then route minimally Else route non-minimally End if
12ROSS vs. booksim– Uniform Random traffic For MIN, ROSS has an average of 4.2% and a maximum of 7% difference from booksim results.For UGAL, ROSS has an average of 3% and a maximum of 7.8% difference from booksim results.With minimal routing, ROSS has an average of 4.2% and a maximum of 7% difference from booksim resultsWith global adaptive routing, ROSS has an average of 3% and a maximum of 7.8% difference from booksim results
13ROSS vs. booksim– Nearest neighbor traffic The nearest neighbor traffic yields a very high latency and low throughput with minimal routing.This traffic pattern can be load balanced by either non-minimal or adaptive routingNon-minimal routing gives slightly under 50% throughput with nearest neighbor trafficNon-minimal routing gives slightly under 50% throughput with nearest neighbor traffic which is the maximum throughput that can be achieved under this kind of traffic.Minimal routing gives high latency and low throughputBoth ROSS & booksim give low latency high throughput for MIN routing under UR trafficBoth simulators give high latency and low throughput for MIN routing under WC trafficBoth simulator’s UGAL algorithm resembles MIN routing under UR trafficBoth simulators yield slightly under 50% latency for UGAL under WC routing.The small differences can be due toWe approximated the internal speedup for booksimBoth simulators use different random number generators
14Outline Dragonfly Network Topology Validation of the dragonfly model Performance Comparison with booksimScaling dragonfly model on BG/P and BG/QConclusion & future work
15Dragonfly performance: ROSS vs. booksim We compared the performance of ROSS and booksim by measuring the simulation execution timeAs booksim runs serially, we configured ROSS in its serial modeBoth simulators ran for a warm-up phase of 30,000 cycles and a measurement phase of 30,000 cyclesTests were carried out on dual core Intel X5650s running at 2.67GHzROSS attains the following performance speedupMinimum of 5x up to a maximum of 11x speedup over booksim with MIN routingMinimum of 5.3x speedup and a maximum of 12.38x speedup with global adaptive routing
16Outline Dragonfly Network Topology Validation of the dragonfly model Performance Comparison with booksimScaling dragonfly model on BG/P and BG/QConclusion & future work
17ROSS Dragonfly model on BG/P and BG/Q We evaluated the strong scaling characteristics of the dragonfly model onArgonne Leadership Computing Facility (ALCF) IBM Blue Gene/P system (Intrepid)Computational Center for Nanotechnology Innovations (CCNI) IBM Blue Gene/QWe scheduled 64 MPI tasks per node on BG/Q and 4 MPI tasks per node on BG/PPerformance was evaluated through the following metricsCommitted event ratePercentage of remote eventsROSS event efficiencySimulation run timeIntrepid’s BG/P has 40 racks, each of which has 1,024 nodes. Intrepid is equipped with a 3D torus network used for point-to-point communication and collective operations.CCNI BG/Q has 1 rack, with 1,024 nodes. It is equipped with a 5D torus network. Each node has 16 cores with each core supporting 4 threads.
18ROSS ParametersROSS employs Time Warp Optimistic synchronization protocolTo reduce state saving overheads, ROSS employs an event roll back mechanismROSS event efficiency determines the amount of useful work performed by the simulationGlobal Virtual Time (GVT) imposes a lower bound on the simulation timeGVT is controlled by batch and gvt-interval parametersOn average, batch * gvt-interval events are processed between each GVT epochBatch is the number of events that ROSS processes before returning to the top scheduling loop and checking for arrival of remote events and messagesThe gvt-interval specifies the number of iterations that ROSS goes through the main event scheduling loop before initiating a GVT computation
19ROSS Dragonfly Performance Results on BG/P vs. BG/Q Event efficiency drops and total rollbacks increase on BG/P after 16K MPI tasksLess off-node communication on BG/Q vs. BG/PEach MPI task has more processing power on BG/P and simulation advances quicklyAs BG/Q has 64 MPI tasks per node vs. 4 MPI tasks on BG/P, there is more off-node communication in BG/P as compared to BG/Q. We conjecture that sending a message through memory on BG/Q takes less time as sending off-node messages on BG/P. Therefore, the probability of a message arriving late is less on BG/Q as compared to BG/P.Each BG/Q core has 1.6 GHz processing power divided among 4 threads. Each thread gets 400 MHz of processing power vs. 850 MHz of processing power on the BG/P. This causes MPI ranks on BG/Q to advance more slowly through the simulated time and also lowers the potential for rollbacks relative to BG/P.
20ROSS Dragonfly Performance Results on BG/P vs. BG/Q This increases computation and dominates the number of events being rolled back.The event efficiency stays high on both BG/P and BG/Q as each MPI task has substantial work loadThe computation performed at each MPI task dominates the number of rolled back events
21Outline Dragonfly Network Topology Validation of the dragonfly model Performance Comparison with booksimScaling dragonfly model on BG/P and BG/QConclusion & future work
22Conclusion & Future work We presented a parallel discrete-event simulation for a dragonfly network topologyWe validated our simulator with cycle accurate simulator booksimWe demonstrated the ability of our simulator to scale on very large models with up to 50M nodesFuture workIntroduce an improved queue congestion sensing policy for global adaptive routingExperiment with other variations of nearest neighbor traffic in dragonflyCompare the dragonfly network model with other candidate topology models for exascale computing