The Stanford FLASH Multiprocessor

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Distributed Systems CS
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Multiple Processor Systems
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.
G Robert Grimm New York University Disco.
DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.
1  1998 Morgan Kaufmann Publishers Chapter 9 Multiprocessors.
Chapter Hardwired vs Microprogrammed Control Multithreading
1 Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Chapter 17 Parallel Processing.
CS533 - Concepts of Operating Systems
Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:
DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Main Memory CS448.
Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.
I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Background Computer System Architectures Computer System Software.
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Introduction to Operating Systems Concepts
INTRODUCTION TO WIRELESS SENSOR NETWORKS
CS 704 Advanced Computer Architecture
12.4 Memory Organization in Multiprocessor Systems
Flow Path Model of Superscalars
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
CS775: Computer Architecture
CMSC 611: Advanced Computer Architecture
Section 1: Introduction to Simics
Example Cache Coherence Problem
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Lecture 1: Parallel Architecture Intro
Multiple Processor Systems
Introduction to Multiprocessors
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
DDM – A Cache-Only Memory Architecture
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors
CS510 - Portland State University
Latency Tolerance: what to do when it just won’t go away
Chapter 4 Multiprocessors
Lecture 18: Coherence and Synchronization
Interconnection Network and Prefetching
Presentation transcript:

The Stanford FLASH Multiprocessor J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, J. Hennessy. In Proceedings of the 21st Annual International Symposium on Computer Architecture (ISCA) 1994. Henry Cook CS258 4/2/2008

Goals Support both cache-coherent shared memory and message passing Not just either/or, but both at the same time Design a custom node controller Build actual hardware 256 node target, scaling to thousands (Actually built 64)

Big Idea Principal difference between CCSM and MP is protocol for transferring data Overall machine structure is the same Functions performed by node controller are also the same By making the controller a special purpose protocol processor, can leverage flexibility of software while reducing overheads

System Architecture Each processor is a single MIPS chip MAGIC has a protocol processor Different messages types are processed by different software handlers

Protocols - CCSM Directory-based, with dynamic pointer allocation structure Similar to DASH protocol Separate request-reply networks to eliminate cycles Handlers must yield if they cannot run to completion

Protocols - MP Long messages vs short messages Block transfer vs synchronization User-level parameters passed to transfer handler running on MAGIC When all user message components have arrived at dest., a reception handler is invoked

Protocols - Extensions Just have to change the handlers Emulate COMA attraction memories Implement synchronization primitives as MAGIC handlers Short message support similar to active messages Not user-level active messages

MAGIC Architecture Separation and specialization Use hardwired data movement logic for speed Use control logic that runs software protocols for flexibility

MAGIC Architecture Must operate quickly enough to avoid being the bottleneck Hardware based speculative message dispatch to PP Separate hardware for message sends

Protocol Processor Implements subset of DLX ISA with extensions for common protocol ops Statically-scheduled dual-issue superscalar processor No interrupts, exceptions, address translation or interlocks

Performance How do latencies look for local read misses? Derived from Verilog model Assuming PP cache hits

Performance PP occupancies are longest sub-operation that must be performed Block transfer bandwidth of 300-400MB/s

Testbed System-level simulator in C++ Protocol verifier, SPLASH benchmarks Verilog description of MAGIC N-1 simulated nodes and one Verilog node

Retrospective Flexibility of software handlers is key Supporting multiple protocols Scaling to arch to multiple machine sizes Debugging protocol operation Building hardware in an academic environment is difficult and time consuming