Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals.

Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals 1 1 Universidad de Zaragoza and 2 Universidad de Cantabria Spain

2 Load-to-Use cache latency trend

3 Facing the inter-cache latency gap Reconfigurable L1/L2 (Balasubramonian et al., MICRO00) single-ported memory cells low bandwidth NUCA (Kim et al., ASPLOS02) wire-delay dominated large caches routing-cache-routing network overhead L-NUCA: L1 + small cache tiles + specialized networks low latency high bandwidth large associativity

4 Summary Motivation Introduction to L-NUCAs Networking –Topologies –Routing –Messages Global Miss Determination Single cycle cache look-up plus one-hop routing Experimental Platform Results Conclusions

5 L-NUCA introduction LatencyTilesSize (KB) 118 3432 4648 LatencyTilesSize (KB) 5972 613104 715120

6 Topologies and Routing Search Transport Replacement Independent operations, ensures deadlock avoidance Broadcast Tree No flow control 2D mesh On/Off flow control Dynamic Distributed Routing (DDR) Blocks ordered by temporal locality Latency-driven topology On/Off flow control DDR

7 Headerless messages OperationMessage contentSourceDestinationWidth (bits or wires) Search@ + MSHR + st data + ctrlr-tilerest tiles41+4+64+2 = 111 Transportblock + MSHRtile (hit)r-tile256 + 4 = 260 Replacementblock + @tile itile k, lat(k)=lat(i)+1256 + 41 = 297 Assuming 32-byte blocks no header overhead message = packet = flit = phit Implicit destination More than 1k m4/m5 wires fit in one side of an 8KB cache (Intel 32nm, Natarajan et al., IEDM08) 8KB cache > 1000 Worst case: 668

8 Global Miss Determination Logic Tiles stop miss propagation in hits L-NUCA miss iff all last-level tiles miss Scalable hierarchical organization, taken from SRAM bitlines (Yang and Kim, JSSC05) one cycle after the last level look-up

9 Single-cycle tiles Three networks Headerless messages No DC, RT, and VA No virtual channels Low ST/LT latency Avoidance of multiple routing stages Parallel data array access and switch allocation XBar: 3 inputs, 2 outputs low latency

10 Summary Motivation Introduction to L-NUCAs Experimental Platform Results Conclusions

11 Simulator Enhanced simplescalar 3.0d (Alpha) with: Cycle-accurate memory and network models 4-issue processor: Speculative wake-up and selective recovery (Intel Pentium 4 alike) 128 ROB 64 LSQ Load-to-Use L1 miss penalty: 4 + cache latency Memory system: L1/RT: 32KB-4Way-32B (lat. 2/ init. rate 1) (2 ports) L3: 8MB-16Way-128B (lat. 20 / init. rate 15) 16-entry L1/RT MSHR 32 nm technology and 19 FO4s cycle-time

12 Workload and Delay, Power, and Area Models Workloads All but one SPEC CPU 2006 benchmarks ( unable to run 483.xalancbmk on Alpha) Delay, Power, and Area modelling Cacti 5.3 and improved Orion for caches and routers

13 Summary Motivation Introduction to L-NUCAs Experimental Platform Results 3-level conventional cache vs. L-NUCA and L3 D-NUCA vs. L-NUCA and D-NUCA Conclusions

14 Tested Scenarios 3-level conventional cache vs. L-NUCA and L3 D-NUCA vs. L-NUCA and D-NUCA

15 Average IPC, 3-level vs. L-NUCA + 6.1 % + 15 %

16 Hierarchy energy, 3-level vs. L-NUCA -14.2%

17 IPC and Area Comparison L2-256KB L2-512KB LN2- 72 KB LN3- 144KB LN4-248KB IPC AREA 0.91 mm 2 1.29 mm 2 0.86 mm 2 1.59 mm 2 0.46 mm 2 small L-NUCA network overhead (14 to 19 %) The low density of L-NUCAs discourages the use of large sizes

18 Tested Scenarios 3-level conventional cache vs. L-NUCA and L3 D-NUCA vs. L-NUCA and D-NUCA

19 Average IPC, L-NUCA with D-NUCA + 4.2% + 6.8 %

20 Hierarchy Energy, L-NUCA with D-NUCA 4.25 %

21 L-NUCA load-to-use latency IPC L2-256KB1.46 LN3-144KB1.66 In 10 benchmarks, Le2 captures more than 75% of L2 read hits

22 Summary Motivation Introduction to L-NUCAs Experimental Platform Results Conclusions

23 Conclusions & Future Work L-NUCAs leverages the advantages of NoChips for NoCaches, low latency and high bandwidth, and reduces the inter-cache latency gap Design based on 3 specialized networks conveying headerless messages Performance and Energy gains with conventional and D- NUCA LLCs Future Work: Integrate L-NUCAs in CMP and SMT environments Study the effect of prefetching for increasing spatial locality

Light NUCAs: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals 1 1 Universidad de Zaragoza and 2 Universidad de Cantabria Spain

25 L-NUCA summary

26 Out-of-Order processor pipeline

27 Out-of-Order processor pipeline

28 Tile internals Search Transport Replacement MA: Miss Address Register (Search) U bf: Upperstream buffer (replacement) D bf: Downstream buffer (transport) Every D and U buffer has 2 entries (2-cycle round-trip delay)

Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals.

Similar presentations

Presentation on theme: "Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals.

Similar presentations

Presentation on theme: "Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals."— Presentation transcript:

Similar presentations

About project

Feedback