Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,

Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California, San Diego CSE † and ECE ‡ Departments June 16, 2010 1

Outline  Motivation  Trace-Driven Problem Formulation  Greedy Addition VC Allocation  Greedy Deletion VC Allocation  Runtime Analysis  Experimental Setup  Evaluation  Experimental Results  Power Impact  Conclusions 2

Motivation Processing Element Router  NoCs needed to interconnect many-core chips  Scalable on-chip communication fabric  An emerging interconnection paradigm to build complex VLSI systems  NoCs can be used to interconnect general-purpose chip multiprocessors (CMPs) or application-specific multiprocessor systems-on-chip (MPSoCs) 3

CMPs vs. MPSoCs  Traditional application domains  MPSoCs target embedded domains  CMPs target general purpose computing  Common  Need for high memory bandwidth  Power efficiency, system control, etc.  Different  CMPs are to run a wide range of applications  MPSoCs have more irregularities  MPSoCs have tighter cost and time-to-market  Conclusion: application-specific optimization is required for MPSoCs 4

Trace Driven vs. Average-Rate Driven  Actual traffic behavior of two PARSEC benchmark traces  Actual traffic tends to very bursty with substantial fluctuations over time  Average-rate driven approaches are misled by the average traffic characteristics  poor design choices  Our approach: trace-driven NoC configuration optimization 5

Head-of-Line (HOL) Blocking Problem Output 1 Output 2  HOL happens in input-buffered routers  Flits are blocked if the head flit is blocked  significantly increases latency and reduces throughput  Virtual channels overcome this problem by multiplexing the input buffers Output 3 Output 1 Output 2 Output 3 Blocked! 6

Average-Rate Driven Shortcoming  3 packets with the following (source, destination): (A, G), (B, E), (F, E)  Suppose all 3 packets are 10-flits in size, and all injected at t = 0  Channels 2 and 3 will carry two packets from (A, G) and (B, E), and Channel 4 will also carry two packets from (B, E) and (F, E)  Average-rate analysis concludes that adding an additional VC to Channels 2 and 3 is as good as adding a VC to Channel 4 since all 3 channels have the same “load”  Average-rate driven approaches lead to poor design choices ABCDF G E 1235 4 6 7 (B, E) (A, G) (F, E)

Wormhole Configuration  At “t = 1”, the above channels color coded are held by each packet, assuming single VC (i.e. wormhole routing)  At “t = 2”, Packet (A, G) is “blocked” from proceeding because Channel 2 already held by packet (B, E)  At “t = 12” (=3 + 9), packet (B, E) can proceed to Channel 4 since it has already been released by packet (F, E)  At “t = 20”, Packet (A, G) acquires Channel 3  At “t = 21”, Packet (A, G) acquires Channel 6 as well, and Packet (B, E) completes ABCDF G E 1235 4 6 Packet (A, G) will complete at “t = 35” 8

Latency Reduction via VC Allocation  Now assume Channels 2 and 3 each have 2 VCs  In this case, Packet (A, G) can “bypass” Packet (B, E) while packet (B, E) is being blocked by packet (F, E) at Channel 4  At “t = 12”, Packet (F, E) completes, and Packet (B, E) can proceed on Channel 4  At “t = 13”, last flit of Packet (A, G) is at Channel 6  At “t = 22”, last flit of Packet (B, E) is at Channel 4, and Packet (A, G) has already completed ABCDF G E 1235 4 6  With 2 VCs at Channels 2 and 3, completion time is 23 cycles vs. 35 cycles without these VCs  Main reason for the improvement is because we prevented Channels 2 and 3 from being “idle” 9

Outline Motivation  Trace-Driven Problem Formulation  Greedy Addition VC Allocation  Greedy Deletion VC Allocation  Runtime Analysis  Experimental Setup  Evaluation  Experimental Results  Power Impact  Conclusions 10

Problem Formulation  Given:  Application communication trace, C trace  Network topology, T(P,L)  Deterministic routing algorithm, R  Target latency, D target  Determine:  A mapping from n VC from the set of links L to the set of positive integers, i.e., n VC : L → Z +, where for any l L, n VC (l) gives the number of VCs associated with link l  Objective:  Minimize  Subject to:  11

Greedy Addition VC Allocation Heuristic (1)  Inputs:  Communication traffic trace, C trace  Network topology, T(P,L)  Routing algorithm, R  Target latency, D target  Output:  Vector n VC, which contains the number of VCs associated with each link time src dest packet size 1 (1,0) (0,2) 4 2 (2,2) (3,1) 4 5 (1,3) (3,1) 4 7 (2,1) (3,2) 4 … (3,0)(3,1)(3,2)(3,3) (2,0)(2,1)(2,2) (1,1) (0,1)(0,0) (1,0)(1,3)(1,2) (2,3) (0,3)(0,2) 12

Greedy Addition VC Allocation Heuristic (2)  Algorithm initializes every link with one VC  Algorithm proceeds in greedy fashion  In each iteration, performance of all VC perturbations are evaluated  Each perturbation consists of adding exactly one VC to one link  Average packet latency (APL) of perturb VC configurations are evaluated  the configuration with the smallest APL is chosen for the next iteration  Algorithm stops if either (1) the total allocated VCs exceeds the VC budget, or (2) a configuration with better APL than the target latency is achieved 13

Greedy Addition VC Allocation Heuristic (3) 1. for i = 1 to N L 2. n VC current (l) = 1; 3. end for 4. n VC best = n VC current ; 5. N VC = N L ; 6. while (N VC <= budget VC ) 7. for l = 1 to N L 8. n VC new = n VC current ; 9. n VC new (l) = n VC current (l) + 1; 10. run trace simulation on n VC new and record D(n VC new,R) 11. end for 12. find n VC best ; 13. n VC current = n VC best ; 14. if (D(n VC new,R) <= D target ) 15. break; 16. end if 17. N VC ++; 18. end while initializing to wormhole configuration check the VC budget VC perturbations evaluated in parallel in each iteration find the best configuration of the current iteration 14

Greedy Addition VC Allocation Heuristic Drawback  Packets (A, F) and (A, E) share links A→B and B→C, both of which have only one VC  (A, F) turns west and (A, E) turns east at Node C  adding a VC to either link A→B or link B→C may not have a significant impact on APL  If VCs are added to both links A→B and B→C, the APL may be significantly reduced  Greedy VC addition approach may fail to realize the benefits of these combined additions and not pick either of the links B A D E F C (A, E) 15 (A, F)

Greedy Deletion VC Allocation Heuristic 1. n VC current = n VC initial ; 2. n VC best = n VC current ; 3. N VC = n VC current (l); 4. while (N VC >= budget VC ) 5. for l = 1 to N L 6. n VC new = n VC current ; 7. if (n VC current (l) > 1) 8. n VC new (l) = n VC current (l) - 1; 9. run trace simulation on n VC new and record D(n VC new,R) 10. end if 11. end for 12. find n VC best ; 13. n VC current = n VC best ; 14. if (D(n VC new,R) <= D target ) 15. break; 16. end if 17. N VC --; 18. end while start with a given VC configuration each link should at least have 1 VC, i.e., wormhole configuration find the best configuration of the current iteration, i.e., the one with least degradation in APL 16

Addition and Deletion Heuristics Comparison  APL decreases as VCs are increased (addition heuristic)  APL increases as VCs are removed (deletion heuristic)  Adding a single VC to a link may not have a significant impact on APL  APL change is much smoother in deletion heuristic 17

Runtime Analysis  Let m be the number of VCs added to (deleted from) an initial VC configuration  T heuristic = m × N L × T(trace simulation)  T heuristic is the average time to run trace simulations on all VC configurations explored by the algorithm  Our heuristics can easily be parallelized  Evaluating all VC configurations in parallel  T heuristic = m × T(trace simulation) max  represents the average of the maximum runtimes of trace simulation at each iteration  For Larger networks, to maintain a reasonable runtime we need O(L) processing nodes  Trace compression  Other metrics to more efficiently capture the impact of VC on APL 18

Outline Motivation Trace-Driven Problem Formulation  Greedy Addition VC Allocation  Greedy Deletion VC Allocation  Runtime Analysis  Experimental Setup  Evaluation  Experimental Results  Power Impact  Conclusions 19

Experimental Setup (1)  We use Popnet for trace simulation  Popnet models a typical four-stage router pipeline  Head flit of a packet traverses all four stages, while body flits bypass the first stage  Number of VCs at the input port can be individually configured to allow nonuniform VC configuration at a router  Latency of a packet is measured as the delay between the time the head flit is injected into the network and the time the tail flit is consumed at the destination  Reported APL value is the average latency over all packets in the input traffic trace 20 GEMSSimics network configuration workload communication trace

Experimental Setup (2)  To evaluate our VC allocation heuristics we use seven different applications from PARSEC benchmark suite  Network traffic traces are generated by running the above applications on Virtutech Simics  GEMS toolset is used for accurate timing simulation  We simulate a 16-core, 4x4: Cores16 Private L1 Cache32KB Shared L21MB distributed over 16 banks Memory Latency170 cycles Network4x4x mesh Packet Sizes72B data packets, 8B control packets 21

Outline Motivation Trace-Driven Problem Formulation  Greedy Addition VC Allocation  Greedy Deletion VC Allocation  Runtime Analysis Experimental Setup  Evaluation  Experimental Results  Power Impact  Conclusions 22

Comparison vs. Uniform-2VC  Average-rate driven method is outperformed by uniform VC allocation  Our addition and deletion heuristics achieve up to 36% and 34% reduction in number of VCs, respectively (w.r.t. uniform-2VC configuration)  On average, both of our heuristics reduce the number of VCs by around 21% across all traces (w.r.t. uniform-2VC configuration) 23

Comparison vs. Uniform-3VC  Our addition and deletion heuristics achieve up to 48% and 51% reduction in number of VCs, respectively  On average, our addition and deletion heuristics achieve up to 31% and 41% reduction in number of VCs across all traces  We observe up to 35% reduction in number of VCs compared against an existing average-rate driven approach 24

Latency and #VC Reductions  With #VC=128 our greedy deletion heuristic improves the APL by 32% and 74% for fluidanimate and vips traces compared with the uniform-2VC configuration, respectively  Our deletion heuristic also achieves 50% and 42% reduction in number of VCs compared with uniform-4VC configuration, respectively  Our proposed trace-driven approach can potentially be used to (1) improve performance within a given power constraint, and (2) reduce power within a given performance constraint Latency reduction VC reduction Latency reduction VC reduction vips tracefluidanimate trace 25

Impact on Power  We use ORION 2.0 to assess the impact of our approach on power consumption  ORION 2.0 assumes same number of VCs at every port in the router  Need to compute the router power for nonuniform VC configurations  Estimate the power overhead of adding a single VC to all router ports  Estimate the power overhead of adding a single VC to just one port  Similar approach is used to estimate the area overhead of adding a single VC to one router port  We observe that our proposed approach achieves up to 7% and 14% reduction in power compared against uniform-2VC and uniform-3VC configurations (without any performance degradation), respectively  Similarly, we observe up to 9% and 16% reduction in area compared against uniform-2VC and uniform-3VC configurations, respectively 26

Outline Motivation Trace-Driven Problem Formulation  Greedy Addition VC Allocation  Greedy Deletion VC Allocation  Runtime Analysis Experimental Setup Evaluation  Experimental Results  Power Impact  Conclusions 27

Conclusions  Proposed trace-driven method for optimizing NoC configurations  Considered the problem of application-specific VC allocation  Showed that existing average-rate driven VC allocation approaches fail to capture the application-specific characteristics to further improve performance and reduce power  In comparison with uniform VC allocation, our approaches achieve up to 51% and 74% reduction in number of VCs and average packet latency, respectively  In comparison with an existing average-rate driven approach, we observe up to 35% reduction in number of VCs  Ongoing work  New metrics to more efficiently capture the impact of VC allocation on average packet latency  New metaheuristics to further improve our performance improvement and VC reduction gains 28

Thank You 29

Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,

Similar presentations

Presentation on theme: "Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,

Similar presentations

Presentation on theme: "Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,"— Presentation transcript:

Similar presentations

About project

Feedback