GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
Today’s topics Single processors and the Memory Hierarchy
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
Parallel Computing Overview CS 524 – High-Performance Computing.
Multiprocessors Andreas Klappenecker CPSC321 Computer Architecture.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
The hybird approach to programming clusters of multi-core architetures.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
PMIT-6102 Advanced Database Systems
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
Hybrid MPI and OpenMP Parallel Programming
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Performance Comparison of Winterhawk I and Winterhawk II Systems Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
From Clustered SMPs to Clustered NUMA John M. Levesque The Advanced Computing Technology Center.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Introduction to parallel programming
Architecture Background
CMSC 611: Advanced Computer Architecture
Symmetric Multiprocessing (SMP)
Introduction to Multiprocessors
Hybrid Programming with OpenMP and MPI
Hybrid Parallel Programming
Chapter 4 Multiprocessors
Hybrid Parallel Programming
Presentation transcript:

GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP 2000, SDSC, La Jolla

GWDG O. Haan, Matrix Transpose Results, SCICOMP Overview Hybrid Programming Model Distributed Matrix Transpose Performance Measurements Summary of Results

GWDG O. Haan, Matrix Transpose Results, SCICOMP Architecture of Scalable Parallel Computers Two level hierarchy cluster of SMP nodes distributed memory high speed interconnect SMP nodes with multiple processors shared memory bus or switch connected

GWDG O. Haan, Matrix Transpose Results, SCICOMP Programming Models message passing over all processors MPI implementation for shared memory multiple access to switch adapters SP: 4-way Winterhawk2 + 8-way Nighthawk - shared memory over all processors virtual global address space SP: - hybrid message passing - shared memory message passing between nodes shared memory within nodes SP: +

GWDG O. Haan, Matrix Transpose Results, SCICOMP Hybrid Programming Model SPMD program with MPI tasks OpenMP threads within each task communication between MPI tasks

GWDG O. Haan, Matrix Transpose Results, SCICOMP Example of Hybrid Program program hybrid_example include “ mpif.h “ com = MPI_COMM_WORLD call MPI_INIT(ierr) call MPI_COMM_SIZE(com,nk,ierr) call MPI_COMM_RANK(com,my_task,ierr) kp = OMP_GET_NUM_PROCS() !$OMP PARALLEL PRIVATE(my_thread) my_thread = OMP_GET_THREAD_NUM() call work(my_thread,kp,my_task,nk,thread_res) !$OMP END PARALLEL do i = 0, kp-1 node_res = node_res + thread_res(i) end do call MPI_REDUCE(node_res,glob_res,1, : MPI_REAL,MPI_SUM,0,com,ierr) call MPI_FINALIZE(ierr) stop end

GWDG O. Haan, Matrix Transpose Results, SCICOMP Hybrid Programming vs. Pure Message Passing + works on all SP configuration coarser internode communication granularity faster intranode communication - larger programming effort additional synchronization steps reduced reuse of cached data the net score depends on the problem

GWDG O. Haan, Matrix Transpose Results, SCICOMP Distributed Matrix Transpose

GWDG O. Haan, Matrix Transpose Results, SCICOMP step Transpose n1 x n2 matrix A( i1, i2 ) --> n2 x n1 matrix B( i2, i1 ) decompose n1, n2 in local and global parts: n1 = n1l * np n2 = n2l * np write matrices A, B as 4-dim arrays: A( i1l, i1g, i2l ; i2g ), B( i2l, i2g, i1l ; i1g ) step 1 : local reorder A( i1l, i1g, i2l ; i2g ) -> a1( i1l, i2l, i1g ; i2g ) step 2 : global reorder a1( i1l, i2l, i1g ; i2g ) -> a2( i1l, i2l, i2g ; i1g ) step 3 : local transpose a2( i1l, i2l, i2g ; i1g ) -> B( i2l, i2g, i1l ; i1g )

GWDG O. Haan, Matrix Transpose Results, SCICOMP Local Steps: Copy with Reorder data in memory: speed limited by performance of bus and memory subsystems Winterhawk2 : all processors share the same bus bandwidth : 1.6 GB/s data in cache: speed limited by processor performance Winterhawk2 : one load plus one store per cycle bandwidth : 8 MB / (1/375) s =3 GB / s

GWDG O. Haan, Matrix Transpose Results, SCICOMP Copy: Data in Memory

GWDG O. Haan, Matrix Transpose Results, SCICOMP Copy : Prefetch

GWDG O. Haan, Matrix Transpose Results, SCICOMP Copy : Data in Cache

GWDG O. Haan, Matrix Transpose Results, SCICOMP Global Reorder a1( *, *, i1g ; i2g ) -> a2( *, *, i2g ; i1g ) global reorder on np processors in np steps p0 p1 p2 step0 step1 step2

GWDG O. Haan, Matrix Transpose Results, SCICOMP Performance Modelling Hardware model: nk nodes with kp procs each np = nk * kp is total procs count Switch model:nk concurrent links between nodes latency tlat, bandwidth c execution model for Hybrid: reorder on nk nodes: nk steps with n1*n2 / nk**2 data per node execution model for MPI: reorder on np processors: np steps with n1*n2 / np**2 data per node switch links shared between kp procs

GWDG O. Haan, Matrix Transpose Results, SCICOMP Performance Modelling Hybrid timing model: MPI timing model:

GWDG O. Haan, Matrix Transpose Results, SCICOMP Timing of Global Reorder (internode part)

GWDG O. Haan, Matrix Transpose Results, SCICOMP Timing of Global Reorder (internode part)

GWDG O. Haan, Matrix Transpose Results, SCICOMP Timing of Global Reorder

GWDG O. Haan, Matrix Transpose Results, SCICOMP Timing of Transpose

GWDG O. Haan, Matrix Transpose Results, SCICOMP Scaling of Transpose

GWDG O. Haan, Matrix Transpose Results, SCICOMP Timing of Transpose Steps

GWDG O. Haan, Matrix Transpose Results, SCICOMP Summary of Results: Hardware Memory access in Winterhawk2 is not adaquate: copy rate of 400 MB/s = 50 Mwords/s peak CPU rate of 6000 Mflops/s a factor of 100 between computational speed and memory speed Sharing of switch link by 4 processors degrades communication speed: bandwidth smaller by more than a factor of 4 ( factor of 4 expected ) latency larger by nearly a factor of 4 ( factor of 1 expected )

GWDG O. Haan, Matrix Transpose Results, SCICOMP Summary of Results: Hybrid vs. MPI hybrid OpenMP / MPI programming is profitable for distributed matrix tranpose : 1000 x 1000 matrix on 16 nodes : 2.3 times faster x matrix on 16 nodes : 1.1 times faster Competing influences : MPI programming enhances use of cached data Hybrid programming has lower communication latency and coarser communication granularity

GWDG O. Haan, Matrix Transpose Results, SCICOMP Summary of Results: Use of Transpose in FFT 2-dim complex array of size Execution time on nk nodes : where r : computational speed per node c : transpose speed per node effective execution speed per node :

GWDG O. Haan, Matrix Transpose Results, SCICOMP Summary of Results: Use of Transpose in FFT- Example SP r = 4 * 200 Mflop/s = 800 Mflop/s c depends on n, nk and programming model nk = 16 n = 10**6 10**9 hybrid c = Mword/s MPI c = Mword/s effective execution speed per node hybrid = Mflop/s MPI = Mflop/s