Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals.

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Chapter 1 The Study of Body Function Image PowerPoint
Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.
Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.
ALGEBRA Number Walls
1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.
UNITED NATIONS Shipment Details Report – January 2006.
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
Scalable Routing In Delay Tolerant Networks
Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.
Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
NC STATE UNIVERSITY Transparent Control Independence (TCI) Ahmed S. Al-Zawawi Vimal K. Reddy Eric Rotenberg Haitham H. Akkary* *Dept. of Electrical & Computer.
Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §
SE-292 High Performance Computing
Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.
CS 105 Tour of the Black Holes of Computing
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) The Black Widow High Radix Clos Network S. Scott, D.Abts, J. Kim, and W.
TRIPS Primary Memory System Simha Sethumadhavan 1.
Directory-Based Cache Coherence Marc De Melo. Outline Non-Uniform Cache Architecture (NUCA) Cache Coherence Implementation of directories in multicore.
Improving DRAM Performance by Parallelizing Refreshes with Accesses
Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Power Analysis
1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,
Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,
CRUISE: Cache Replacement and Utility-Aware Scheduling
Learning Cache Models by Measurements Jan Reineke joint work with Andreas Abel Uppsala University December 20, 2012.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
Bypass and Insertion Algorithms for Exclusive Last-level Caches
1 ICCD 2010 Amsterdam, the Netherlands Rami Sheikh North Carolina State University Mazen Kharbutli Jordan Univ. of Science and Technology Improving Cache.
Chapter 15 Integrated Services Digital Network ISDN Services History Subscriber Access Layers BISDN WCB/McGraw-Hill The McGraw-Hill Companies, Inc., 1998.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.
VOORBLAD.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Addition 1’s to 20.
25 seconds left…...
Week 1.
Figure 10–1 A 64-cell memory array organized in three different ways.
SE-292 High Performance Computing
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Connecting LANs, Backbone Networks, and Virtual LANs
Intracellular Compartments and Transport
PSSA Preparation.
VPN AND REMOTE ACCESS Mohammad S. Hasan 1 VPN and Remote Access.
Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Presentation transcript:

Light NUCA: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals 1 1 Universidad de Zaragoza and 2 Universidad de Cantabria Spain

2 Load-to-Use cache latency trend

3 Facing the inter-cache latency gap Reconfigurable L1/L2 (Balasubramonian et al., MICRO00) single-ported memory cells low bandwidth NUCA (Kim et al., ASPLOS02) wire-delay dominated large caches routing-cache-routing network overhead L-NUCA: L1 + small cache tiles + specialized networks low latency high bandwidth large associativity

4 Summary Motivation Introduction to L-NUCAs Networking –Topologies –Routing –Messages Global Miss Determination Single cycle cache look-up plus one-hop routing Experimental Platform Results Conclusions

5 L-NUCA introduction LatencyTilesSize (KB) LatencyTilesSize (KB)

6 Topologies and Routing Search Transport Replacement Independent operations, ensures deadlock avoidance Broadcast Tree No flow control 2D mesh On/Off flow control Dynamic Distributed Routing (DDR) Blocks ordered by temporal locality Latency-driven topology On/Off flow control DDR

7 Headerless messages OperationMessage contentSourceDestinationWidth (bits or wires) + MSHR + st data + ctrlr-tilerest tiles = 111 Transportblock + MSHRtile (hit)r-tile = 260 Replacementblock itile k, lat(k)=lat(i) = 297 Assuming 32-byte blocks no header overhead message = packet = flit = phit Implicit destination More than 1k m4/m5 wires fit in one side of an 8KB cache (Intel 32nm, Natarajan et al., IEDM08) 8KB cache > 1000 Worst case: 668

8 Global Miss Determination Logic Tiles stop miss propagation in hits L-NUCA miss iff all last-level tiles miss Scalable hierarchical organization, taken from SRAM bitlines (Yang and Kim, JSSC05) one cycle after the last level look-up

9 Single-cycle tiles Three networks Headerless messages No DC, RT, and VA No virtual channels Low ST/LT latency Avoidance of multiple routing stages Parallel data array access and switch allocation XBar: 3 inputs, 2 outputs low latency

10 Summary Motivation Introduction to L-NUCAs Experimental Platform Results Conclusions

11 Simulator Enhanced simplescalar 3.0d (Alpha) with: Cycle-accurate memory and network models 4-issue processor: Speculative wake-up and selective recovery (Intel Pentium 4 alike) 128 ROB 64 LSQ Load-to-Use L1 miss penalty: 4 + cache latency Memory system: L1/RT: 32KB-4Way-32B (lat. 2/ init. rate 1) (2 ports) L3: 8MB-16Way-128B (lat. 20 / init. rate 15) 16-entry L1/RT MSHR 32 nm technology and 19 FO4s cycle-time

12 Workload and Delay, Power, and Area Models Workloads All but one SPEC CPU 2006 benchmarks ( unable to run 483.xalancbmk on Alpha) Delay, Power, and Area modelling Cacti 5.3 and improved Orion for caches and routers

13 Summary Motivation Introduction to L-NUCAs Experimental Platform Results 3-level conventional cache vs. L-NUCA and L3 D-NUCA vs. L-NUCA and D-NUCA Conclusions

14 Tested Scenarios 3-level conventional cache vs. L-NUCA and L3 D-NUCA vs. L-NUCA and D-NUCA

15 Average IPC, 3-level vs. L-NUCA % + 15 %

16 Hierarchy energy, 3-level vs. L-NUCA -14.2%

17 IPC and Area Comparison L2-256KB L2-512KB LN2- 72 KB LN3- 144KB LN4-248KB IPC AREA 0.91 mm mm mm mm mm 2 small L-NUCA network overhead (14 to 19 %) The low density of L-NUCAs discourages the use of large sizes

18 Tested Scenarios 3-level conventional cache vs. L-NUCA and L3 D-NUCA vs. L-NUCA and D-NUCA

19 Average IPC, L-NUCA with D-NUCA + 4.2% %

20 Hierarchy Energy, L-NUCA with D-NUCA 4.25 %

21 L-NUCA load-to-use latency IPC L2-256KB1.46 LN3-144KB1.66 In 10 benchmarks, Le2 captures more than 75% of L2 read hits

22 Summary Motivation Introduction to L-NUCAs Experimental Platform Results Conclusions

23 Conclusions & Future Work L-NUCAs leverages the advantages of NoChips for NoCaches, low latency and high bandwidth, and reduces the inter-cache latency gap Design based on 3 specialized networks conveying headerless messages Performance and Energy gains with conventional and D- NUCA LLCs Future Work: Integrate L-NUCAs in CMP and SMT environments Study the effect of prefetching for increasing spatial locality

Light NUCAs: a proposal for bridging the inter-cache latency gap Darío Suárez 1, Teresa Monreal 1, Fernando Vallejo 2, Ramón Beivide 2, and Victor Viñals 1 1 Universidad de Zaragoza and 2 Universidad de Cantabria Spain

25 L-NUCA summary

26 Out-of-Order processor pipeline

27 Out-of-Order processor pipeline

28 Tile internals Search Transport Replacement MA: Miss Address Register (Search) U bf: Upperstream buffer (replacement) D bf: Downstream buffer (transport) Every D and U buffer has 2 entries (2-cycle round-trip delay)