OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Multiple Processor Systems

Distributed Systems CS

SE-292 High Performance Computing

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Distributed Shared Memory

Multiple Processor Systems

The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

November 1, 2005Sebastian Niezgoda TreadMarks Sebastian Niezgoda.

Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.

Reference: Message Passing Fundamentals.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

3.5 Interprocess Communication

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Computer System Architectures Computer System Software

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Distributed Shared Memory Systems and Programming

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

TreadMarks Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

View-Oriented Parallel Programming for multi-core systems Dr Zhiyi Huang World 45 Univ of Otago.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.

Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.

Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Buddy Scheduling in Distributed Scientific Applications on SMT Processors Nikola Vouk December 20, 2004 In Fulfillment of Requirement for Master’s of Science.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Treadmarks: Distributed Shared Memory on Standard Workstations and Operating Systems P. Keleher, A. Cox, S. Dwarkadas, and W. Zwaenepoel The Winter Usenix.

Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.

Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.

M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.

1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

1 March 17, 2006Zhiyi’s RSL VODCA: View-Oriented, Distributed, Cluster-based Approach to parallel computing Dr Zhiyi Huang Dept of Computer Science University.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Background Computer System Architectures Computer System Software.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

The University of Adelaide, School of Computer Science

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Chapter 4: Threads.

Distributed Shared Memory

The University of Adelaide, School of Computer Science

Chapter 4: Threads.

Chapter 4: Threads.

Outline Midterm results summary Distributed file systems – continued

Concurrency: Mutual Exclusion and Process Synchronization

High Performance Computing

Chapter 4: Threads & Concurrency

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Presentation transcript:

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang

Background Published in the Journal of Parallel and Distributed Computing, vol. 60 (12), pp , December 2000 Work to further improve TreadMarks Presents an alternative solution to MPI

Roadmap Motivation Solution OpenMP API TreadMarks OpenMP Translator Performance Measurement Results Conclusion

Motivation To enable the programmer to reply on a single, standard, shared-memory API for parallelization within and between multiprocessors. To provide another standard other than MPI?

Solution Presents the first system that implements OpenMP on a network of shared-memory multiprocessors Implemented via a translator converting OpenMP directives to calls in modified TreadMarks Modified TreadMarks uses POSIX threads for parallelism within an SMP node

Solution Original version of TreadMarks:  A Unix process was executed on each processor of the multiprocessor node and communication between processes was achieved through message passing  Fails to take advantage of hardware shared memory

Solution Modified version of TreadMarks  POSIX threads used to implement parallelism  OpenMP threads within a multiprocessor share a single address space  Positive: Reduces the number of changes to TreadMarks to support multithreading on a multiprocessor OS maintains the coherence of page mappings automatically  Negative: More difficult to provide uniform sharing of memory between threads on the same node and threads on different nodes

OpenMP API Three kinds of directives:  Parallelism/work sharing  Data environment  Synchronization Based on a fork-join model Sequential code sections executed by master thread Parallel code sections are executed by all threads, including the master thread

OpenMP API Parallel directive – all threads perform the same computation Work sharing directive – computation is divided among the threads Data environment directive – control the sharing of program variables Synchronization directive – control the synchronization between threads

TreadMarks User-level SDSM system Provides a global shared address space on top of physically distributed memories Key functions performed are memory coherence and synchronization

TreadMarks – Memory Coherence Minimize the amount of communication performed to maintain memory consistency by:  a lazy implementation of release consistency  reducing the impact of false sharing by allowing multiple concurrent writers to modify a page Propagation of consistency information is postponed until the time of an acquire

TreadMarks - Synchronization Barrier implemented as acquire and release messages Governed by a centralized manager

TreadMarks – Modifications for OpenMP Inclusion of two primitives:  Tmk_fork  Tmk_join All threads created at the start of a program’s execution to minimize overhead. Slave threads are blocked during sequential execution until the next Tmk_fork is issued by the master thread.

TreadMarks – Modifications for Networks of Multiprocessors POSIX thread enabled sharing of data between processors. Addition of some data structures, such as message buffers, in thread-private memory for data that is to remain private within a thread. A per-page mutex was added to allow greater concurrency in the page fault handler. Synchronization functions in TreadMarks were modified to use POSIX thread-based synchronization between processors within a node and existing TreadMarks synchronization functions between nodes. A second mapping was added for the memory that is shared between nodes so shared-memory pages can be updated while the first mapping remains invalid until the update is complete. This reduces the number of page protection operations performed by TreadMarks.

OpenMP Translator Synchronization directives translate directly to TreadMarks synchronization operations. The complier translates the code sections marks with parallel directives to fork-join code. Data environment directives implemented to work with both TreadMarks and POSIX threads, hiding the interface issues from the programmer.

Performance Measurement Platform  IBM SP2 consisting of four SMP nodes  Per node: Four IBM PowerPC 604 processors 1 GB memory Running AIX 4.2

Performance Measurement Applications  SPLASH-2 Barnes-Hut  NAS 3D-FFT  SPLASH-2 CLU  SPLASH-2 Water  Red-Black SOR  TSP  Modified Gramm-Schmidt (MGS)

Results

Conclusion Enables the programmer to rely on a single, standard, shared-memory API for parallelization within and between multiprocessors. Using shared hardware memory reduced data and messages transmitted. The speedups of multithreaded TreadMarks codes on four four-way SMP SP2 nodes are within 7-30% of the MPI versions.

Critique Solution allows easier implementation of program parallelization across multiprocessors if speedup is not crucial OpenMP is easier on the programmer but speedup still not as good as MPI

Critique Issues:  AIX has inefficient implementation of page protection Paper claims that every other brand of Unix, including Linux, uses data structures that handle mprotect operations more efficiently Why wasn’t the solution implemented on another platform?  Paper failed to present a big motivation for using this solution over MPI.

Thank You