Presented by: Nick Kirchem Feb 13, 2004

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
To Include or Not to Include? Natalie Enright Dana Vantrease.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
The AMD Athlon ™ Processor: Future Directions Fred Weber Vice President, Engineering Computation Products Group.
© 2004 Mark D. HillWisconsin Multifacet Project Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
The University of Adelaide, School of Computer Science
CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,
Snooping Cache and Shared-Memory Multiprocessors
Chapter 18 Multicore Computers
Computer Science and Engineering Piranha: A Scalable Architecture Based on Single- Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings.
The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.
Alpha 21364: A Scalable Single-chip SMP Peter Bannon Senior Consulting Engineer Compaq Computer Corporation Shrewsbury, MA.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo,
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Sun Starfire: Extending the SMP Envelope Presented by Jen Miller 2/9/2004.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Corse Overview Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.
Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting
The University of Adelaide, School of Computer Science
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
William Stallings Computer Organization and Architecture 8th Edition
Multiprocessing.
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Memory System Characterization of Commercial Workloads
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul.
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Lecture 2: Snooping-Based Coherence
CMSC 611: Advanced Computer Architecture
Improving Multiple-CMP Systems with Token Coherence
11 – Snooping Cache and Directory Based Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
Computer Evolution and Performance
William Stallings Computer Organization and Architecture 8th Edition
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
Chip&Core Architecture
Dynamic Verification of Sequential Consistency
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Presented by: Nick Kirchem Feb 13, 2004 Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz A. Barroso et al. (Compaq Computer Corporation) Presented by: Nick Kirchem Feb 13, 2004

Target and Motivation Commercial applications (databases, OLTP) Most important market for high performance servers Data dependent computation (low ILP) Little gained by complex multiple issue out-of-order processors Complexity of current processors Long design times High development costs Better use of transistors?

Project Goals Design a Chip Multiprocessing (CMP) System Integrate 8 simple processor cores on a single chip Exploit thread-level parallelism instead of ILP High performance, Low Cost Achieve superior performance on commercial workloads Small team, modest investment, short design time

Architecture Overview

Architecture Elements Simple Processors (500 MHz, In-Order) No I/O capability on chip (separate I/O nodes) Up to 1024 nodes in a system Individual L1 Caches (64KB, 2-way set-assoc) One Logical L2 Cache, interleaved, 1MB Intra-Chip Switch Unidirectional crossbar Transaction based, atomic transfers Bandwidth ~3x memory bandwidth

Intra-Chip Cache Coherence MESI protocol No Inclusion (1 MB aggregate L1, 1MB L2) But, L2 holds copy of L1 tags and state (no snooping required at L1) L1 filled directly from memory (L2 = victim cache) Coherence handled by L2 controllers Can service request directly, forward to owner L1, forward to protocol engine, obtain from Memory

Inter-Node Coherence Protocol Engines (microprogrammable controllers) Home: exports local memory Remote: imports remote memory Directory Storage Compute ECC at coarse granularity, use extra bits for directory info  no memory space overhead Directory granularity = 1 node (not individual processor) Interconnect: I/O queues, router (point-to-point, 4 links) No NAKs – avoid deadlock by sufficient buffering, and guarantee forwarded requests can be serviced

Performance Evaluation OLTP and DSS workloads: TPC-B/D, Oracle database SimOS-Alpha environment Compared: Piranha (P8) @ 500 MHz and Full-Custom (P8F) @ 1.25 GHz Next-generation Microprocessor (OOO) 1 GHz Single Chip Evaluation OOO outperforms P1 (individual proc) by 2.3x P8 outperforms OOO by 3x Speedup of P8 over P1 = 7x Multi-chip Configurations Four chips (only 4 CPUs per chip ?!) Results show that Piranha scales better than OOO

Questions/Concerns Would the Piranha design be worthwhile if there were a well-designed SMT processor (with 4 or 8 threads)? Reliability better or worse with multiple chips per processor? Power consumption?