Presented by: Nick Kirchem Feb 13, 2004

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

To Include or Not to Include? Natalie Enright Dana Vantrease.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.

The AMD Athlon ™ Processor: Future Directions Fred Weber Vice President, Engineering Computation Products Group.

© 2004 Mark D. HillWisconsin Multifacet Project Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

The University of Adelaide, School of Computer Science

CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

Snooping Cache and Shared-Memory Multiprocessors

Chapter 18 Multicore Computers

Computer Science and Engineering Piranha: A Scalable Architecture Based on Single- Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings.

The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.

Alpha 21364: A Scalable Single-chip SMP Peter Bannon Senior Consulting Engineer Compaq Computer Corporation Shrewsbury, MA.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo,

Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.

Sun Starfire: Extending the SMP Envelope Presented by Jen Miller 2/9/2004.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Corse Overview Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.

Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting

The University of Adelaide, School of Computer Science

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

William Stallings Computer Organization and Architecture 8th Edition

Multiprocessing.

Architecture and Design of AlphaServer GS320

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Memory System Characterization of Commercial Workloads

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul.

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Lecture 2: Snooping-Based Coherence

CMSC 611: Advanced Computer Architecture

Improving Multiple-CMP Systems with Token Coherence

11 – Snooping Cache and Directory Based Multiprocessors

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Computer Evolution and Performance

William Stallings Computer Organization and Architecture 8th Edition

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

William Stallings Computer Organization and Architecture 8th Edition

Chip&Core Architecture

Dynamic Verification of Sequential Consistency

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Presented by: Nick Kirchem Feb 13, 2004 Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz A. Barroso et al. (Compaq Computer Corporation) Presented by: Nick Kirchem Feb 13, 2004

Target and Motivation Commercial applications (databases, OLTP) Most important market for high performance servers Data dependent computation (low ILP) Little gained by complex multiple issue out-of-order processors Complexity of current processors Long design times High development costs Better use of transistors?

Project Goals Design a Chip Multiprocessing (CMP) System Integrate 8 simple processor cores on a single chip Exploit thread-level parallelism instead of ILP High performance, Low Cost Achieve superior performance on commercial workloads Small team, modest investment, short design time

Architecture Overview

Architecture Elements Simple Processors (500 MHz, In-Order) No I/O capability on chip (separate I/O nodes) Up to 1024 nodes in a system Individual L1 Caches (64KB, 2-way set-assoc) One Logical L2 Cache, interleaved, 1MB Intra-Chip Switch Unidirectional crossbar Transaction based, atomic transfers Bandwidth ~3x memory bandwidth

Intra-Chip Cache Coherence MESI protocol No Inclusion (1 MB aggregate L1, 1MB L2) But, L2 holds copy of L1 tags and state (no snooping required at L1) L1 filled directly from memory (L2 = victim cache) Coherence handled by L2 controllers Can service request directly, forward to owner L1, forward to protocol engine, obtain from Memory

Inter-Node Coherence Protocol Engines (microprogrammable controllers) Home: exports local memory Remote: imports remote memory Directory Storage Compute ECC at coarse granularity, use extra bits for directory info  no memory space overhead Directory granularity = 1 node (not individual processor) Interconnect: I/O queues, router (point-to-point, 4 links) No NAKs – avoid deadlock by sufficient buffering, and guarantee forwarded requests can be serviced

Performance Evaluation OLTP and DSS workloads: TPC-B/D, Oracle database SimOS-Alpha environment Compared: Piranha (P8) @ 500 MHz and Full-Custom (P8F) @ 1.25 GHz Next-generation Microprocessor (OOO) 1 GHz Single Chip Evaluation OOO outperforms P1 (individual proc) by 2.3x P8 outperforms OOO by 3x Speedup of P8 over P1 = 7x Multi-chip Configurations Four chips (only 4 CPUs per chip ?!) Results show that Piranha scales better than OOO

Questions/Concerns Would the Piranha design be worthwhile if there were a well-designed SMT processor (with 4 or 8 threads)? Reliability better or worse with multiple chips per processor? Power consumption?