ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo,

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
To Include or Not to Include? Natalie Enright Dana Vantrease.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Lecture 12 Reduce Miss Penalty and Hit Time
High Performing Cache Hierarchies for Server Workloads
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Contents Even and odd memory banks of 8086 Minimum mode operation
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
Router Architectures An overview of router architectures.
A Novel Directory-Based Non-Busy, Non- Blocking Cache Coherence Huang Yomgqin, Yuan Aidong, Li Jun, Hu Xiangdong 2009 International forum on computer Science-Technology.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Computer Science and Engineering Piranha: A Scalable Architecture Based on Single- Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Networks-on-Chips (NoCs) Basics
The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Alpha 21364: A Scalable Single-chip SMP Peter Bannon Senior Consulting Engineer Compaq Computer Corporation Shrewsbury, MA.
Field Programmable Port Extender (FPX) 1 Modular Design Techniques for the FPX.
The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Forwarding.
MEMORY SYSTEM CHARACTERIZATION OF COMMERCIAL WORKLOADS Authors: Luiz André Barroso (Google, DEC; worked on Piranha) Kourosh Gharachorloo (Compaq, DEC;
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
The Alpha – Data Stream Matt Ziegler.
By Islam Atta Supervised by Dr. Ihab Talkhan
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
The University of Adelaide, School of Computer Science
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Presented by: Nick Kirchem Feb 13, 2004
Overview Parallel Processing Pipelining
Architecture and Design of AlphaServer GS320
CSC 4250 Computer Architectures
The University of Adelaide, School of Computer Science
Memory System Characterization of Commercial Workloads
12.4 Memory Organization in Multiprocessor Systems
Cache Memory Presentation I
CMSC 611: Advanced Computer Architecture
CS 31006: Computer Networks – The Routers
Comparison of Two Processors
Packet Switch Architectures
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Virtual Memory Overcoming main memory size limitation
The University of Adelaide, School of Computer Science
Chip&Core Architecture
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Packet Switch Architectures
Overview Problem Solution CPU vs Memory performance imbalance
Multiprocessors and Multi-computers
Presentation transcript:

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese In Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Problem: complex processors are ill-suited for commercial applications Solution: CMP approach Piranha System: -research prototype developed at COMPAQ -Exploits CMP, intergrades 8 simple Alpha processor cores with 2-level cache hierarchy on a single chip Piranha unique design choices: -shared second level cache with no inclusion -Highly optimized cache coherence protocol -Novel I/O architecture

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Different behavior of commercial workloads relative to technical workloads -large memory stalls -Data-dependent nature of the computation & lack of ILP -No use of high-performance FP and multimedia functionality Techniques: -SMT -CMT Goal of Piranha: build a system to achieve superior performance on commercial workloads

Piranha: Architecture Overview

Alpha CPU Core and L1 Caches CPU Single issue, in order design capable of executing Alpha ISA 500 MHZ pipelined datapath Performance enhancing features: branch target buffer, pre-compute logic for branch predictions, fully by-passed datapath L1 caches 64KB two-way set associative blocking caches 2-bit state field per cache line-> 4 stages in a typical MESI protocol I and D-cache are kept coherent by hardware

Intra-Chip Switch Uses bidirectional, push-only interface Initiator sources data, if destination ready then ICS schedules the data transfer. A grant is issued to the initiator to commense data transfer. The destination receives a request signal: ID of initiator and type of transfer Each port to ICS consists of 2 independent datapaths Implemented by a set of 8 internal datapaths Supports 2 logical lanes (low and high priority)

Second - Level Cache 1MB unified I/O cache, physically partitioned into 8 banks Each bank 8-way set associative and uses round robin replacement policy L2 controllers responsible for intra-chip coherence and cooperate with engines to to enforce inter-chip coherence Non-inclusive on-chip cache: Keep a duplicate copy of the L1 tags and state at the L2 controllers L1 misses that also miss in L2 are filled directly from memory. L2 behaves as victim cache The duplicate L1 state is extended to include “ownership” Intra-chip coherence protocol L2 controllers are responsible Similarities to a full map centralized directory based protocol

Piranha architecture Does not have direct access to ICS, is controlled by and routed through L2 controller Two parts: 1) RAC and 2) Memory Controller Engine Memory controller Protocol Engines Home Engine: responsible for exporting memory whose home is at the local node Remote Engine: imports memory whose hoem is remote

Inter-node Coherence Protocol Invalidation-based directory protocol Support for 4 request types: read, read-exclusive, exclusive, exclusive-without-data Support features: clean-exclusive optimization, reply forwarding from remote owner, eager exclusive replies Unique property: avoids NAK Unique techniques: 1) Network uses “hot potato” routing, 2)Buffer space is shared among all lanes 3)Cruise-missile-invalidates (CMI)

System Interconnect OO Output queue: accepts packets via the packet switch from the protocol engines or from the system controller Router: transmits and receives packets to and from other nodes IQ Input queue: receives packets that are addressed to the local node and forwards them to the target module via the packet switch

Reliability Features RAS features: redundancy on all memory components, CRC protection on most datapaths, redundant datapaths, protocol error recovery, error logging, hot-swappable links and in-band system reconfiguration support

Evaluation Workloads DSS workload with TCP-D benchmark OLTP workload with TCP-B benchmark Simulation environment Use of SinOS Alpha environment:simulates hardware components of Alpha based multiprocessors Simulated architectures

Performance Evaluation of Piranha

Conclusions Use of CMP in future multiprocessor designs Piranha: from evaluation: outperforms other designs