Implementing an OpenMP Execution Environment on InfiniBand Clusters Jie Tao ¹, Wolfgang Karl ¹, and Carsten Trinitis ² ¹ Institut für Technische Informatik.

Slides:



Advertisements
Similar presentations
The Effect of Network Total Order, Broadcast, and Remote-Write on Network- Based Shared Memory Computing Robert Stets, Sandhya Dwarkadas, Leonidas Kontothanassis,
Advertisements

© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.
Computer Architectures & Networks WS 2004/2005 Dec. 7 th 2005 The Memory Hierarchy / Caches Carsten Trinitis Lehrstuhl für Rechnertechnik und Rechnerorganisation.
Distributed Systems CS
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.
Department of Computer Science and Engineering University of Washington Brian N. Bershad, Stefan Savage, Przemyslaw Pardyak, Emin Gun Sirer, Marc E. Fiuczynski,
University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
University of Houston Extending Global Optimizations in the OpenUH Compiler for OpenMP Open64 Workshop, CGO ‘08.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
November 1, 2005Sebastian Niezgoda TreadMarks Sebastian Niezgoda.
Implementation of Page Management in Mome, a User-Level DSM Yvon Jégou Paris Project, IRISA/INRIA FRANCE.
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
1 Performance Evaluation of Gigabit Ethernet & Myrinet
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Introduction to Parallel Processing 3.1 Basic concepts 3.2 Types and levels of parallelism 3.3 Classification of parallel architecture 3.4 Basic parallel.
Computer Architecture Parallel Processing
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Distributed Shared Memory Systems and Programming
TreadMarks Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
An Efficient Lock Protocol for Home-based Lazy Release Consistency Electronics and Telecommunications Research Institute (ETRI) 2001/5/16 HeeChul Yun.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li Department of Electrical and Computer Engineering University of.
DISTRIBUTED COMPUTING
Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.
TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems Present By: Blair Fort Oct. 28, 2004.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
Making a DSM Consistency Protocol Hierarchy-Aware: An Efficient Synchronization Scheme Gabriel Antoniu, Luc Bougé, Sébastien Lacour IRISA / INRIA & ENS.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
1 March 17, 2006Zhiyi’s RSL VODCA: View-Oriented, Distributed, Cluster-based Approach to parallel computing Dr Zhiyi Huang Dept of Computer Science University.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li Department of Electrical and Computer Engineering University of.
Distributed Shared Memory
CS5102 High Performance Computer Systems Thread-Level Parallelism
For Massively Parallel Computation The Chaotic State of the Art
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Distributed Systems CS
Hybrid Programming with OpenMP and MPI
Implementing an OpenMP Execution Environment on InfiniBand Clusters
MPJ: A Java-based Parallel Computing System
High Performance Computing
Introduction, background, jargon
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

Implementing an OpenMP Execution Environment on InfiniBand Clusters Jie Tao ¹, Wolfgang Karl ¹, and Carsten Trinitis ² ¹ Institut für Technische Informatik Universität Karlsruhe ² Lehrstuhl für Rechnertechnik und Rechnerorganisation Technische Universität München

© J. Tao, IWOMP Outline  Motivation  ViSMI: Virtual Shared Memory for InfiniBand clusters  Omni/Infini: towards OpenMP execution on cluster systems  Initial experimental results  Conclusions

© J. Tao, IWOMP Motivation  InfiniBand  Point-to-point, switched I/O interconnect architecture  Low latency, high bandwidth  Basic connection: serial at data rates of 2.5 Gbps  Bundled: e.g. 4X connection -- 10Gbps; 12X -- 30Gbps  Special feature: Remote Direct Memory Access (RDMA)  The InfiniBand Cluster at TUM  Configuration: 6 Xeon (2-way), 4 Itanium 2 (4-way), 36 Opteron  MPI available

© J. Tao, IWOMP Software Distributed Shared Memory  Virtually global address space on cluster architectures  Memory consistency models  Sequential consistency  Any execution has the same result as if operations were executed in a sequential order  Relaxed consistency  Consistency through explicit synchronization: acquire & release  Lazy Release Consistency (LRC)  Home-based Lazy Release Consistency (HLRC)  Multiple writable copies of the same page  A home for a page

© J. Tao, IWOMP ViSMI: Software DSM for InfiniBand Clusters  HLRC implementation  Home node: first-touch  Diff-based mechanism  Clean copy before first write  Diffs propagated by invalidations  InfiniBand hardware-based multicast  Programming interface  A set of annotations  HLRC_Malloc  HLRC_Myself  HLRC_InitParallel  HLRC_Barrier  HLRC_Aquire  HLRC_Release

© J. Tao, IWOMP Omni/Infini: Compiler & Runtime for OpenMP Execution on InfiniBand  Omni for SMP as the basis  A new runtime library: adapting to ViSMI interface  Scheduling  ompc_static_bschd ()  ompc_get_num_threads ()  Parallelization  ompc_do_parallel ()  Synchronization  ompc_lock ()  ompc_unlock ()  ompc_barrier ()  Reduction operation  ompc_reduction ()  Specific code region  ompc_do_single ()  ompc_is_master ()  ompc_critical ()

© J. Tao, IWOMP Omni/Infini (cont.)  Shared data allocation  Static data  pointer  HLRC_Malloc into shared region  Adapting process structure to thread structure  ViSMI: process structure  OpenMP: thread structure

© J. Tao, IWOMP Thread-level & Process-level Parallelization master T1T2T3 Parallel region Sequential region IDLE Thread-level parallelizationProcess-level parallelization P1P2P3P4 Sequential region Parallel region

© J. Tao, IWOMP Experimental Results  Platform and setup  Initial: 6 Xeon nodes  Most recently: Opteron nodes  Application:  NAS parallel benchmark suite  OpenMP version of SPLASH-2  Small codes  SMP programming course  Self-coded

© J. Tao, IWOMP Experimental Results  Platform

© J. Tao, IWOMP Speedup

© J. Tao, IWOMP Normalized Execution Time Breakdown

© J. Tao, IWOMP

© J. Tao, IWOMP CG Time Breakdown

© J. Tao, IWOMP Comparison with Omni/SCASH

© J. Tao, IWOMP Scalability

© J. Tao, IWOMP Conclusions  Summary  An OpenMP execution environment for InfiniBand  Built on top of a software DSM  Omni compiler as basis  Speedup on 6 nodes: up to 5.22  Future work  Optimization: barrier, page operation  Test with realistic applications