Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University.

Slides:

Advertisements

Similar presentations

Jingming Xu Multimedia Communications Lab University of Waterloo

Advertisements

Load Balancing in a Cluster-based Active Jiani Guo (Student Member, IEEE) Laxmi Bhuyan (Fellow, IEEE) March 15 th 2005 Seo, Dong Mahn.

29/1/2014 Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka, Dimitrios Tsoumakos, Nectarios Koziris {katerina, dtsouma,

A AAAA Model to Support Science Gateways with Community Accounts GGF-14 Science Gateways Workshop June 28, 2005 Von Welch, James Barlow, James Basney,

Lia Toledo Moreira Mota, Alexandre de Assis Mota, Wu, Shin-Ting

Slovenian experience on 98/34 Notification Procedure Conference on the Functioning of the 98/34 Notification Procedure, Brussels, 22 June 2005 mag. Irena.

M.Nedim Alpdemir, Anastasios Gounaris¹, Arijit Mukherjee², Desmond Fitzgerald, Norman W. Paton¹, Paul Watson², Rizos Sakellariou¹, Alvaro A.A. Fernandes¹,

UKOLN is supported by: Emergent technologies & digitisation: the institutional impact. Liz Lyon & Kevin Edge VCs Retreat, October a.

UKOLN is supported by: Starting to explore the role of memory institutions within the social fabric of the new Web Dr Liz Lyon, UKOLN, University of Bath,

VGISCs view VGISC Uses Cases Geneva October 2005.

Virginia Housing Development Authoritys NoVA Preservation Initiatives Affordable Housing Advisory Committee December 16, 2005.

March 18, 2005Computers in Libraries SPACE THE FUTURE FRONTIER Don Albrecht Jennifer S. Kutzik Colorado State University Libraries.

ICDT Optimal Distributed Declustering using Replication Keith Frikken Purdue University Jan 5, 2005.

Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.

Evolution of the Configuration Database Design Andrei Salnikov, SLAC For BaBar Computing Group ACAT05 – DESY, Zeuthen.

Intelligent Soccer Team Gustavo Armagno Facundo Benavides Claudia Rostagnol

May 9, September 2005, Barcelona, Spain Prioritization of Forestry Themes for the SRA Risto Päivinen.

ML Conseils SOCRATES – GRUNDTVIG 1 ACRE 2 Evaluation Report Seminar at Latsia (Cyprus) Marc LACAUD.

UNDP to PEN workshop - June Poverty – Environment links UNDP – Viet Nam.

CAS: Central Instrument for Managing for Results.

Pflugerville ISD - August Pflugerville Independent School District Integrated Physics & Chemistry Measurement.

Computer Networking Lecture 20 – Queue Management and QoS.

SU/IU Service-Learning Symposium Nov Strategies and challenges to institutionalising service-learning at a South African university Magda Fourie.

C omputer G raphics, TU Braunschweig EuroVis “ BioBrowser: A Framework for Fast Protein Visualization ” Andreas Halm, Lars Offen, Dieter Fellner.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Parallel Processing with OpenMP

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

The hybird approach to programming clusters of multi-core architetures.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

1 ProActive performance evaluation with NAS benchmarks and optimization of OO SPMD Brian AmedroVladimir Bodnartchouk.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

GU Junli SUN Yihe 1.  Introduction & Related work  Parallel encoder implementation  Test results and Analysis  Conclusions 2.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

Introduction to Parallel Processing

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

4- Performance Analysis of Parallel Programs

Parallel Programming Styles and Hybrids

Linchuan Chen, Xin Huo and Gagan Agrawal

Department of Computer Science University of California, Santa Barbara

Yiannis Nikolakopoulos

Hybrid Programming with OpenMP and MPI

Fine-grained vs Coarse-grained multithreading

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory

Oslo, June 15, 2005ICPP-HPSEC Motivation  fully permutable loops always a computational challenge for HPC  hybrid parallelization attractive for DSM architectures  currently, popular free message passing libraries provide limited multi-threading support  SPMD hybrid parallelization suffers from intrinsic load imbalance

Oslo, June 15, 2005ICPP-HPSEC Contribution  two static thread load balancing schemes (constant-variable) for coarse-grain funneled hybrid parallelization of fully permutable loops generic simple to implement  experimental evaluation against micro-kernel benchmarks of different programming models message passing fine-grain hybrid coarse-grain hybrid (unbalanced, balanced)

Oslo, June 15, 2005ICPP-HPSEC Algorithmic model foracross tile 1 do … foracross tile N do for tile n-1 do Receive(tile); Compute(A,tile); Send(tile); Restrictions:  fully permutable loops  unitary inter-process dependencies

Oslo, June 15, 2005ICPP-HPSEC Message passing parallelization  tiling transformation  (overlapped?) computation and communication phases  pipelined execution portable scalable highly optimized

Oslo, June 15, 2005ICPP-HPSEC Hybrid parallelization So… why bother?

Oslo, June 15, 2005ICPP-HPSEC Hybrid parallelization: why bother I shared memory programming model vs message passing programming model for shared memory architecture

Oslo, June 15, 2005ICPP-HPSEC Hybrid parallelization: why bother II DSM architectures are popular!

Oslo, June 15, 2005ICPP-HPSEC Fine-grain hybrid parallelization  incremental parallelization of loops  relatively easy to implement  popular  Amdahl’s law restricts parallel efficiency  overhead of thread structures re-initialization  restrictive programming model for many applications

Oslo, June 15, 2005ICPP-HPSEC Coarse-grain hybrid parallelization  generic SPMD programming style  good parallelization efficiency  no thread re-initialization overhead  more difficult to implement  intrinsic load imbalance assuming common funneled thread support level

Oslo, June 15, 2005ICPP-HPSEC MPI thread support levels  single  masteronly  funneled  serialized  multiple fine-grain hybrid coarse-grain hybrid comm comp comm … comp …

Oslo, June 15, 2005ICPP-HPSEC Load balancing Idea Consequence master thread assumes a smaller fraction of the process tile computational load compared to other threads

Oslo, June 15, 2005ICPP-HPSEC Load balancing (2) T………total number of threads p………current process id Assuming It follows

Oslo, June 15, 2005ICPP-HPSEC Load balancing (3)

Oslo, June 15, 2005ICPP-HPSEC Experimental Results  8-node dual SMP Linux Cluster (800 MHz PIII, 256 MB RAM, kernel )  MPICH v ( --with-device=ch_p4, --with-comm=shared, P4_SOCKBUFSIZE=104KB )  Intel C++ compiler 8.1 ( -O3 -static -mcpu=pentiumpro )  FastEthernet interconnection network

Oslo, June 15, 2005ICPP-HPSEC Alternating Direction Implicit (ADI)  Stencil computation used for solving partial differential equations  Unitary data dependencies  3D iteration space (X x Y x Z)

Oslo, June 15, 2005ICPP-HPSEC ADI

Oslo, June 15, 2005ICPP-HPSEC Synthetic benchmark

Oslo, June 15, 2005ICPP-HPSEC Conclusions  fine-grain hybrid parallelization inefficient  unbalanced coarse-grain hybrid parallelization also inefficient  balancing improves hybrid model performance  variable balanced coarse-grain hybrid model most efficient approach overall  relative performance improvement increases for higher communication vs computation needs

Oslo, June 15, 2005ICPP-HPSEC Thank You! Questions?

Oslo, June 15, 2005ICPP-HPSEC ADI

Oslo, June 15, 2005ICPP-HPSEC Synthetic benchmark