Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.

Slides:



Advertisements
Similar presentations
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
To Include or Not to Include? Natalie Enright Dana Vantrease.
Lecture 12 Reduce Miss Penalty and Hit Time
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Cache Optimization Summary
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.
Alpha Goal: very fast multiprocessor systems, highly scalable Main trick is high-bandwidth, low-latency data access. How to do it, how to do it?
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Manycore Network Interfaces for In-Memory Rack-Scale Computing Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, Boris Grot.
A Novel Directory-Based Non-Busy, Non- Blocking Cache Coherence Huang Yomgqin, Yuan Aidong, Li Jun, Hu Xiangdong 2009 International forum on computer Science-Technology.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.
Computer Science and Engineering Piranha: A Scalable Architecture Based on Single- Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Networks-on-Chips (NoCs) Basics
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.
CSE 661 PAPER PRESENTATION
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Using Prediction to Accelerate Coherence Protocols Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer.
Alpha 21364: A Scalable Single-chip SMP Peter Bannon Senior Consulting Engineer Compaq Computer Corporation Shrewsbury, MA.
The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo,
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet.
Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
Presented by: Nick Kirchem Feb 13, 2004
Lynn Choi School of Electrical Engineering
Parallel Architecture
Lecture 21 Synchronization
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
12.4 Memory Organization in Multiprocessor Systems
Lecture 13: Large Cache Design I
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
The Stanford FLASH Multiprocessor
Lecture 8: Directory-Based Cache Coherence
Improving Multiple-CMP Systems with Token Coherence
Lecture 7: Directory-Based Cache Coherence
/ Computer Architecture and Design
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Dynamic Verification of Sequential Consistency
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June 2000 Presented by Garver Moore ECE259 Spring 2006 Professor Daniel Sorin

Motivation Economic: High demand for OLTP machines Disconnect between ILP-focus and this demand OLTP --High memory latency -- Little ILP (Get, process, store) --Large TLP OLTP unserved by aggressive ILP machines Use “old” cores, ASIC design methodology for “glueless,” scalable OLTP machines and low development costs and time to market Amdahl’s Law

The Piranha Processing Node* *Directly from Barroso et. al Separate I/D L1 for each CPU Logically shared interleaved L2 cache. Eight memory controllers interface to a bank of up to 32 Rambus DRAM chips. Aggregate max bandwidth of 12.8 GB/sec. 180 nm process (2000) Almost entirely ASIC design 50% clock speed, 200% area versus full-custom methodology CPU: Alpha ECE152 work Single in-order 8-stage pipeline

Communication Assist + Home Engine and Remote Engine support shared memory across multiple nodes + System Control tackles system miscellany: interrupts, exceptions, init, monitoring, etc. + OQ, Router, IQ, Switch standard +Total inter-node I/O Bandwidth : 32 GB/sec + Each link and block here corresponds to actual wiring and module. + This allows for rapid parallel development and an semi-custom design methodology + Also facilitates multiple clock domains THERE IS NO INHERENT I/O CAPABILITY.

I/O Organization + Smaller than processing node + Router  2 links, alleviates need for routing table + Memory is globally visible and part of coherency scheme + CPU  optimized placement for drivers, translations etc. with low-latency access needs to I/O. + Re-used dL1 design provides interface to PCI/X interface + Supports arbitrary I/O:P ratio, network topology + Glueless scaling up to 1024 nodes of any type supports application specific customization

Coherence: Local + L2 bank and associated controller contains directory data for intra-chip requests – Centralized directory + Chip ICS responsible for all on-chip communication + L2 is “non-inclusive”. + “Large victim buffer” for L1s. Keeps tags and state copies of L1 data + The L2 controller can determine whether data is cached remotely, and if exclusively. Majority of L1 requests then require no CA assist. + L2 on request can service directly, forward to owner L1, forward to protocol engine, or get from memory. +L2 on forwards blocks conflicting requests

Coherence: Global Trades ECC granularity for “free” directory data storage (4x granularity  leaves 44 bits per 64 bit line) Invalidation-based distributed directory protocol Some optimizations No NACKing: Deadlock avoidance through I/O, L, H priority virtual lanes: L: Home node, low priority. H: Forwarded requests, replies Also guarantee forwards always serviced by targets: e.g. owner writes back to home, holds data until home acknowledges. Removes NACK/Retry traffic, as well as “ownership change” (DASH), retry-counts (Origin), “No, seriously” (Token). Routing toward empty buffers for old messages  linear buffer dependence on N. Share buffer space among lanes, and “CMI” invalidations avoid deadlock.

Evaluation Methodology Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications) Simulated and compared to performance of aggressive OOO core (Alpha 21364) with integrated coherence and cache hardware “Fudged” for full-custom effect Four evaluations: P1 (One-core 500MHz), INO (1GHz single-issue in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)

Results

Questions/Discussion Deadlock avoidance w/o NACK CMP vs SMP “Fishy” evaluation methodology? Specialized computing Buildability?