Computer Science and Engineering Piranha: A Scalable Architecture Based on Single- Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

To Include or Not to Include? Natalie Enright Dana Vantrease.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Lecture 12 Reduce Miss Penalty and Hit Time

Nikos Hardavellas, Northwestern University

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Cache Optimization Summary

Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.

DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.

CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.

1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.

ECE 526 – Network Processing Systems Design

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 Lecture 21: Coherence and Interconnection Networks Papers: Flexible Snooping: Adaptive Filtering and Forwarding in Embedded Ring Multiprocessors, UIUC,

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Chapter 18 Multicore Computers

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

The Alpha Network Architecture Mukherjee, Bannon, Lang, Spink, and Webb Summary Slides by Fred Bower ECE 259, Spring 2004.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo,

MEMORY SYSTEM CHARACTERIZATION OF COMMERCIAL WORKLOADS Authors: Luiz André Barroso (Google, DEC; worked on Piranha) Kourosh Gharachorloo (Compaq, DEC;

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

AlphaServer GS320 Architecture & Design Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX)‏

High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Presented by: Nick Kirchem Feb 13, 2004

Lynn Choi School of Electrical Engineering

ASR: Adaptive Selective Replication for CMP Caches

Architecture and Design of AlphaServer GS320

Lynn Choi School of Electrical Engineering

A New Coherence Method Using A Multicast Address Network

Architecture & Organization 1

CMSC 611: Advanced Computer Architecture

Architecture & Organization 1

Lecture 8: Directory-Based Cache Coherence

Improving Multiple-CMP Systems with Token Coherence

Lecture 7: Directory-Based Cache Coherence

/ Computer Architecture and Design

Leveraging Optical Technology in Future Bus-based Chip Multiprocessors

Computer Evolution and Performance

Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini

Chapter 4 Multiprocessors

Cache coherence CEG 4131 Computer Architecture III

Presentation transcript:

Computer Science and Engineering Piranha: A Scalable Architecture Based on Single- Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June 2000 Presented by Wael Kdouh Spring 2006 Professor: Dr. Hisham El Rewini

Computer Science and Engineering Motivation Economic: High demand for OLTP(on-line transaction processing) machines Disconnect between ILP-focus and this demand OLTP --High memory latency -- Little ILP (Get, process, store) --Large TLP OLTP unserved by aggressive ILP machines Use “old” cores, ASIC design methodology for “glueless,” scalable OLTP machines and low development costs and time to market Short wires as opposed to costly and slow long wires that can affect cycle time Amdahl’s Law

Computer Science and Engineering Other Innovations The design of the shared second-level cache uses sophisticated protocol that does not enforce inclusion in first-level instruction and data caches in order to maximize utilization of on-chip caches. The cache coherence protocol among nodes incorporates a number of unique features that result in fewer protocol messages and lower protocol engine occupancies compared to other designs. It has a unique I/O architecture, with an I/O node that is full-fledged member of the interconnect and the global shared memory-memory coherence protocol.

Computer Science and Engineering The Piranha Processing Node Separate I/D L1( 64KB, 2-way set-associative) for each CPU. Logically shared interleaved L2 cache(1MB). Eight memory controllers interface to a bank of up to 32 Rambus DRAM chips. Aggregate max bandwidth of 12.8 GB/sec. 180 nm process (2000) Almost entirely ASIC design 50% clock speed, 200% area versus full-custom methodology CPU: Alpha ECE152 work Single in-order 8-stage pipeline 500 Mhz Intra-Chip Switch, is a Unidirectional crossbar

Computer Science and Engineering Communication Assist + Home Engine (Exporting) and Remote Engine (Importing) support shared memory across multiple nodes + System Control tackles system miscellany: interrupts, exceptions, init, monitoring, etc. + OQ, Router, IQ, Switch standard which links multiple Piranha chips +Total inter-node I/O Bandwidth : 32 GB/sec + Each link and block here corresponds to actual wiring and module. THERE IS NO INHERENT I/O CAPABILITY.

Computer Science and Engineering I/O Organization  Smaller than processing node  Router  2 links, alleviates need for routing table  Memory is globally visible and part of coherency scheme  CPU  optimized placement for drivers, translations etc. with low-latency access needs to I/O.  Re-used dL1 design provides interface to PCI/X interface  Supports arbitrary I/O:P ratio, network topology  Glueless scaling up to 1024 nodes of any type supports application specific customization

Computer Science and Engineering Piranha System

Computer Science and Engineering Coherence: Local  L2 bank and associated controller contains directory data for intra-chip requests – Centralized directory  Chip ICS responsible for all on-chip communication  L2 is “non-inclusive”.  “Large victim buffer” for L1s. Keeps tags and state copies of L1 data  The L2 controller can determine whether data is cached remotely, and if exclusively. Majority of L1 requests then require no CA assist.  L2 on request can service directly, forward to owner L1, forward to protocol engine, or get from memory.  L2 on forwards blocks conflicting requests

Computer Science and Engineering Coherence: Global Trades ECC granularity for “free” directory data storage (4x granularity  leaves 44 bits per 64 bit line) Invalidation-based distributed directory protocol Some optimizations No NACKing: Deadlock avoidance through I/O, L, H priority virtual lanes: L: Home node, low priority. H: Forwarded requests, replies Also guarantee forwards always serviced by targets: e.g. owner writes back to home, holds data until home acknowledges. Removes NACK/Retry traffic, as well as “ownership change” (DASH), retry- counts (Origin), “No, seriously” (Token).

Computer Science and Engineering Evaluation Methodology Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications) Simulated and compared to performance of aggressive OOO core (Alpha 21364) with integrated coherence and cache hardware “Fudged” for full-custom effect Four evaluations: P1 (One-core 500MHz), INO (1GHz single-issue in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)

Computer Science and Engineering Parameters for different processor designs.

Results

Computer Science and Engineering Performance Evaluation OLTP and DSS workloads: TPC-B/D, Oracle database SimOS-Alpha environment Compared: Piranha 500 MHz and Full-Custom 1.25 GHz Next-generation Microprocessor (OOO) 1 GHz Single Chip Evaluation OOO outperforms P1 (individual proc) by 2.3x P8 outperforms OOO by 3x Speedup of P8 over P1 = 7x Multi-chip Configurations Four chips (only 4 CPUs per chip ?!) Results show that Piranha scales better than OOO

Computer Science and Engineering Questions/Discussion Evaluation methodology? Would the Piranha design be worthwhile if there were a well-designed SMT processor (with 4 or 8 threads)? Reliability better or worse with multiple chips per processor? Power consumption?

The authors maintain that : 1) The use of chip multiprocessing is inevitable in future microprocessor designs. 2) As more transistors become available, further increasing on-chip cache sizes or building more complex cores will only lead to diminishing performance gains and possibly longer design cycles. Given the enormous emphasis that Intel engineers are plcaing on massive L2 caches, Intel engineers appear to disagree.enormous emphasis that Intel engineers are plcaing on massive L2 caches Given the huge investment that both Intel and Compaq/HP have put into the Itanium family, and the fact that the Alpha is a moribund architecture, it is unlikely that the innovative Piranha microprocessor will ever see the light of day. Conclusion

Computer Science and Engineering No more penguins to eat……………… Harvey G. Cragon, discusses in his paper “Forty Five Years of Computer Architecture—All That's Old is New Again,” that he finds most of the performance-improvement advances in computer micro-architecture have been based on the exploitation of only two ideas: locality and pipelining. In my personal opinion the upcoming years are going to exploit two ides: SMT and CMP. The Future

Computer Science and Engineering Questions