Presentation is loading. Please wait.

Presentation is loading. Please wait.

Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic.

Similar presentations


Presentation on theme: "Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic."— Presentation transcript:

1 Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic

2 2 /23 Contents Introduction Hybrid Network –Low-Latency Transmission Line Ring –Traffic Steering Evaluation Result Conclusion

3 3 /23 Introduction On-chip communication latency is increasing Broadcast interconnect –Insufficient bandwidth and delay for many-core –Growing core counts → contention –Growing core counts → longer wire → larger wire capacitance → longer delay –Unfavorable wire delay with technology scaling Packet-switched on-chip network (OCN) +Short links → fast communication between adjacent nodes +Scalable aggregated bandwidth –Packets travel many links and pipelined routers –Growing core counts → increasing hop counts/latency for far-apart cores ITRS 2012

4 4 /23 Motivation Switched on-chip network –Good latency for local traffic, but not for long-distance traffic –Much more local than long-distance traffic Broadcast interconnect –Avoids routing latency even for long-distance traffic –Cannot handle much traffic

5 5 /23 Hybrid Network Exploit the strengths –Broadcast on Transmission Line: low latency –Switched on-chip network: throughput … alleviate weakness –Limited TL throughput – use only for critical and/or long-distance traffic –High switching overhead for long-distance traffic – use TL Two critical components to this work –Transmission Line Broadcast Interconnect – the Why and the How –Traffic Steering – which messages use which interconnect

6 6 /23 Transmission Line Why Transmission Line? –Extremely fast propagation Use electromagnetic wave for signal propagation –0.0075 ns/mm (unrepeated wire: 0.54 ns/mm) –Not affected by technology scaling –But expensive in terms of metal area (20 µm-wide vs. 0.135 µm global wire) Limited throughput Transmission Line Traditioanl Wire Ground 4.193 µm 4.571 µm 8.457 µm 4.1 µm 16 µm vs. … 0.135 µm TL Traditional Global Wire

7 7 /23 Transmission Line Ring Transmission Line –Extremely fast propagation –But expensive in terms of metal area Why Ring? –Minimizes overall TL cost –Allows fast arbitration (token passing)

8 8 /23 Unidirectional Transmission Line Ring Two major problems with TL caused by many connections in many-core –Attenuation of signal (power split at connections) –Signal reflections/reverberations (discontinuity at connections) –Signal needs to stay stronger than sum of noise and reverberations! Unidirectional Transmission Line (UTL) ring makes it easy to design –Chained directional couplers in a ring shape –Control of attenuation –Almost no reflected signal Directional Coupler –Two TL lines running in parallel Transmission Line

9 9 /23 Unidirectional Transmission Line Ring Directional Coupler –Two TL lines running in parallel –Signal into one end ① Most comes out on other end ② But some is transferred (EM-coupled) to same direction on other line ③ –Directivity: (almost) no signal on ④ –Chain couplers using one line, use the other to connect transmitters/receivers ① ② ③ ④ Transmission Line Core 2 Rx2Tx2 Core 1 Rx1Tx1 ×

10 10 /23 Using the UTL Ring Simple receiver/transmitter –Simple modulation: on-off keying –1 bit = one or more consecutive pulses How fast can we transfer? –Depends on available spectrum of the transmission medium –UTL coupler: 20–60 GHz –40 GHz clock, 2 pulses/bit → 20 Gbps Transmitter –PLL (pulses) –Pass-gate (on/off pulses) –Amplifier (impedance matching) Receiver –Pulse detector, –Shift register (collect high rate bits) PLL Amp Data Transmitter Detector Data Receiver Shift register

11 11 /23 Traffic Steering Which packet should use which network? Static steering –E.g. >8 hops go to TL, rest goes on mesh –Lacks adaptivity When traffic low, 8-hop, 7-hop, etc. could benefit from ring When traffic high, ring can become saturated

12 12 /23 Adaptive Steering Ring-Affinity Score –More hops  more benefit from using the ring –Non-critical packet  no benefit –Ring Affinity Score = latency difference plus criticality adjustment Threshold –Score above threshold  use ring –Adjust threshold to prevent ring bandwidth saturation Too much traffic on the ring  queuing delays  all benefit dissapears

13 13 /23 Ring-Affinity Score

14 14 /23 Ring Affinity Scoring 310 Core 3 sent packet on ring at cycle 10 Core 10 sent packet on ring at cycle 20

15 15 /23 Threshold and Re-steering Threshold adjusted to manage UTL ring utilization –Low enough to avoid excessive queuing –But high enough not to waste the ring throughput –Target utilizations around 75% tend to work well Threshold Management –Packet steered to ring when its score exceeds the threshold –Increase threshold when ring utilization higher than desired –Decrease the threshold if ring utilization is too low Re-Steeringing –Sudden burst of high-scoring packets… Threshold adaptation takes a while Meanwhile, ring packets have very long latencies –If ring-steered packet sits in queue too long, re-steer to the mesh How long is too long?

16 16 /23 Evaluation Simulated using SESC –64-tile CMP, 2-issue OoO, 1GHz, 32KB L1 D/I cache, 1MB slice of L2 –8×8 mesh (switched NoC) with 128 bit link width, 8 VC (24 buffers) Applications from PARSEC 3, SPLASH-2 benchmark suites –Half of the applications show <20% improvement with ideal interconnect –Focus analysis on on-chip latency sensitive applications

17 17 /23 Speedup 1.14x

18 18 /23 Speedup 4-concentrated mesh + UTL Ring –8.7% improvement: 1.13× → 1.23×

19 19 /23 Speedup 4-concentrated mesh + UTL Ring –8.7% improvement: 1.13× → 1.23× Flattened Butterfly + UTL Ring –5.7% improvement: 1.10× → 1.16×

20 20 /23 Summary Increasing core counts worsens on-chip latency Unidirectional Transmission Line Ring –Low-latency –But limited throughput Use UTL Ring with switched interconnect synergistically –UTL Ring for low latency –Switched interconnect for throughput Adaptive traffic steering enables judicious use of the ring –Proposed traffic steering provides 14% performance improvement

21 21 /23 Thank you!

22 22 /23 Result: Latency Reduction of UTL Ring UTL Ring latency is 55% lower than the mesh –Lower latency than advanced interconnects –>44% latency reduction over concentrated mesh and flattened butterfly –But we can only do this for 13% to 44% of messages (2.0% to 9.9% of the bits) 44.3% 43.9%

23 23 /23 Result: Speedup vs. Mesh Alone 1.14× 1.10× 1.13×

24 24 /23 Adaptive vs Non-Adaptive Steering Non-adaptive random steering –0.63× slowdown on application (ocean-nc) with high on-chip traffic –1.02× speedup if 30% of packets use UTL Ring randomly (RND30) –0.96× slowdown if 50% (RND50) Adaptive traffic steering –1.14×speedup (up to 1.20× with 64 Gbps configuration) slowdown


Download ppt "Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic."

Similar presentations


Ads by Google