The Effect of Network Total Order, Broadcast, and Remote-Write on Network- Based Shared Memory Computing Robert Stets, Sandhya Dwarkadas, Leonidas Kontothanassis,

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.
University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Distributed Shared Memory
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
November 1, 2005Sebastian Niezgoda TreadMarks Sebastian Niezgoda.
Implementation of Page Management in Mome, a User-Level DSM Yvon Jégou Paris Project, IRISA/INRIA FRANCE.
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
(Software) Distributed Shared Memory (aka Shared Virtual Memory)
Lightweight Logging For Lazy Release Consistent DSM Costa, et. al. CS /01/01.
Implementing an OpenMP Execution Environment on InfiniBand Clusters Jie Tao ¹, Wolfgang Karl ¹, and Carsten Trinitis ² ¹ Institut für Technische Informatik.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed Shared Memory.
Distributed Shared Memory Systems and Programming
TreadMarks Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z.
TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.
An Efficient Lock Protocol for Home-based Lazy Release Consistency Electronics and Telecommunications Research Institute (ETRI) 2001/5/16 HeeChul Yun.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Distributed Shared Memory (part 1). Distributed Shared Memory (DSM) mem0 proc0 mem1 proc1 mem2 proc2 memN procN network... shared memory.
Treadmarks: Distributed Shared Memory on Standard Workstations and Operating Systems P. Keleher, A. Cox, S. Dwarkadas, and W. Zwaenepoel The Winter Usenix.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li Department of Electrical and Computer Engineering University of.
DISTRIBUTED COMPUTING
Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.
TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems Present By: Blair Fort Oct. 28, 2004.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
Making a DSM Consistency Protocol Hierarchy-Aware: An Efficient Synchronization Scheme Gabriel Antoniu, Luc Bougé, Sébastien Lacour IRISA / INRIA & ENS.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Region-Based Software Distributed Shared Memory Song Li, Yu Lin, and Michael Walker CS Operating Systems May 1, 2000.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li Department of Electrical and Computer Engineering University of.
Distributed Shared Memory
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
CMSC 611: Advanced Computer Architecture
Distributed Shared Memory
CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #
Outline Midterm results summary Distributed file systems – continued
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Implementing an OpenMP Execution Environment on InfiniBand Clusters
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
High Performance Computing
Lecture 8: Directory-Based Examples
Design Alternatives for SAS: The Beauty of Mobile Homes
The University of Adelaide, School of Computer Science
Lecture 24: Multiprocessors
Lecture 19: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Presentation transcript:

The Effect of Network Total Order, Broadcast, and Remote-Write on Network- Based Shared Memory Computing Robert Stets, Sandhya Dwarkadas, Leonidas Kontothanassis, Umit Rencuzogullari, and Michael L. Scott University of Rochester and Compaq Research HPCA, 2000

High Performance Cluster Computing Cost-effective parallel computing platform Software Distributed Shared Memory (SDSM) system provides attractive shared memory paradigm In SDSM, what is the performance impact of the SAN? System Area Network (SAN)

Cashmere SDSM Designed to leverage special network features: remote write, broadcast, total message order Use network features to reduce receiver and ack overhead –provides an 18-44% improvement in three of ten apps Benefits are outweighed by protocol optimizations! Use broadcast to reduce data propagation overhead –provides small improvement on eight node cluster Provides up to 51% on emulated 32-node cluster!

Cashmere: Network Features I. Introduction II. Cashmere design (use of network features) III. Performance of full network features IV. Performance of Adaptive Data Broadcast V. Conclusions

II. Cashmere Design Implements a Lazy Release Consistency model Applications must be free of data races and must use Cashmere synchronization ops (Acquire, Release) Traps data accesses through Virtual Memory hardware Uses invalidation messages (write notices) Employs home nodes and global page directory Prototype: AlphaServer SMPs, Memory Channel SAN

Use of Network Features in Cashmere

Cashmere Design Variants Memory Channel Cashmere –shared data propagation (diffs)* –meta-data propagation (directory, write notices) –synchronization mechanisms (locks, barriers) *Remotely-accessible shared data space is limited! Explicit messages Cashmere –diffs, write notices, locks, barriers: use plain messages –directory: maintain master entry only at the home –hide ack latency by pipelining diffs, write notices

Home Migration Optimization Reduce twin/diff overhead by migrating home node to the active writers –migration is not possible when using remote write to propagate data Send migration request as part of each write fault –home will grant migration request if it is not actively writing the page –old node will forward any incoming requests to new home Migration very effectively reduces twin/diff operations!

III: Performance of Full Features Platform: AlphaServer 4100 cluster: 32 procs, 8 nodes –Alpha 21164A 600 MHz –Memory Channel II SAN Microbenchmarks: Round-trip null message latency: 15 secs

Results for Full Network Features Left, middle, right bars: MC Features, Explicit Msgs, Explicit Msgs Migration 32 processors

IV. Adaptive Broadcast (ADB) Use broadcast to reduce contention for widely-shared data Augment Cashmere to include a set of broadcast buffers (avoid mapping data into remotely-accessible memory) Identify widely shared data –multiple requests for same page in the same interval –two or more requests for same page in the last interval Little performance improvement (<13%) at eight nodes. Does ADB have a larger impact beyond eight nodes?

Results for Cashmere-ADB Left, right bars: Cashmere, Cashmere-ADB 32 Nodes (emulated)

V. Conclusions Special network features provide some benefit, but can limit scalability –three of ten applications improve by 18-44% Home node migration can largely recover benefits of the network features –can even lead to as much as 67% improvement In larger clusters, a scalable broadcast mechanism may be worthwhile –three of ten applications improve by 18-51%