Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
1 ISCM-10 Taub Computing Center High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001.
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Distributed Systems CS
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach.
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
OpenFOAM on a GPU-based Heterogeneous Cluster
SHARCNET. Multicomputer Systems r A multicomputer system comprises of a number of independent machines linked by an interconnection network. r Each computer.
Self-Adapting Scheduling for Tasks with Dependencies in Stochastic Environments Ioannis Riakiotakis, Florina M. Ciorba, Theodore Andronikos and George.
Adaptive Cyclic Scheduling of Nested Loops Florina M. Ciorba, Theodore Andronikos and George Papakonstantinou National Technical University of Athens Computing.
Low Overhead Fault Tolerant Networking (in Myrinet)
Reducing the Communication Cost via Chain Pattern Scheduling Florina M. Ciorba, Theodore Andronikos, Ioannis Drositis, George Papakonstantinou and Panayotis.
DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.
Memory Management 2010.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,
Dynamic Multi Phase Scheduling for Heterogeneous Clusters Florina M. Ciorba †, Theodore Andronikos †, Ioannis Riakiotakis †, Anthony T. Chronopoulos ‡
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.
Parallelization of the Classic Gram-Schmidt QR-Factorization
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
1 Threads, SMP, and Microkernels Chapter 4. 2 Process Resource ownership: process includes a virtual address space to hold the process image (fig 3.16)
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
CENG334 Introduction to Operating Systems 1 Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara, TURKEY URL:
A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
OSes: 2. Structs 1 Operating Systems v Objective –to give a (selective) overview of computer system architectures Certificate Program in Software Development.
Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
These slides are based on the book:
High Performance and Reliable Multicast over Myrinet/GM-2
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Current Generation Hypervisor Type 1 Type 2.
CS 286 Computer Organization and Architecture
High Performance Messaging on Workstations
Department of Computer Science University of California, Santa Barbara
Myrinet 2Gbps Networks (
Department of Computer Science University of California, Santa Barbara
Cluster Computers.
Presentation transcript:

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical University of Athens School of Electrical and Computer Engineering Computing Systems Laboratory

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Previous work  M. Athanasaki, A. Sotiropoulos, G. Tsoukalas, N. Koziris, "Pipelined Scheduling of Tiled Nested Loops onto Clusters of SMPs using Memory Mapped Network Interfaces", SuperComputing Conference on High Performance Networking and Computing (SC2002), Baltimore, Maryland, November 16-22,  G. Goumas, A.Sotiropoulos and N. Koziris, "Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping," Proceedings of the 2001 International Parallel and Distributed Processing Symposium (IPDPS2001), IEEE Press, San Francisco, California, April 2001.

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Overview  Tiling for parallelization  Non-overlapping vs. Overlapping execution scheme  Grouping  Application on a cluster of SMPs with a fixed number of nodes  Experimental-Simulation Results

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Nested For-Loops for (i 1 =l 1 ; i 1 <=u 1 ; i 1 ++) for (i 2 =l 2 ; i 2 <=u 2 ; i 2 ++) … … … … … for (i n =l n ; i n <=u n ; i n ++) { Loop Body }

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Dependence Vectors i2i2 i1i1 for (i 1 =0; i 1 <=7; i 1 ++) for (i 2 =0; i 2 <=7; i 2 ++) A[i,j]=A[i-1,j]+A[i,j-1]

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Tiling i2i2 i1i1

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Tiling i2i2 i1i1 Processor 0 Processor 1

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Overview  Tiling for parallelization  Non-overlapping vs. Overlapping execution scheme  Grouping  Application on a cluster of SMPs with a fixed number of nodes  Experimental-Simulation Results

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Non-Overlapping Scheme i2i2 i1i1 Processor 0 Processor 1 Processor 2

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Non-Overlapping vs. Overlapping Scheme P0 P1 P2 P3 P0 P1 P2 P3

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Overlapping Scheme i2i2 i1i1 Processor 0 Processor 1 Processor 2

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Overview  Tiling for parallelization  Non-overlapping vs. Overlapping execution scheme  Grouping  Application on a cluster of SMPs with a fixed number of nodes  Experimental-Simulation Results

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Generalization to SMPs – “Grouping” SMP0 SMP1 SMP2 SMP3 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Example: Grouping + Non overlapping Communication Scheme Tile Space Group Space SMP node0 SMP node1 Scheduling vector Π=(1,0)

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Example: Grouping + Overlapping Communication Scheme Tile Space Group Space SMP node0 SMP node1 Scheduling vector Π=(1,1)

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Overview  Tiling for parallelization  Non-overlapping vs. Overlapping execution scheme  Grouping  Application on a cluster of SMPs with a fixed number of nodes  Experimental-Simulation Results

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Scheduling onto a Fixed Number of SMPs  Dynamic Scheduling by the Operating System  Run time overhead for generating a lot of processes  Context switching slows down the execution  Static Scheduling at Compile Time

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Scheduling onto a Fixed Number of SMPs  Cyclic Assignment Schedule  Mirror Assignment Schedule  Cluster Assignment Schedule  Retiling

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cyclic Assignment SMP0 SMP1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 Cyclic assignment on 2 SMP nodes with 2 CPUs each SMP0 SMP1

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cyclic Assignment CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 Cyclic assignment on 2 SMP nodes with 2 CPUs each SMP0 SMP1 SMP0 SMP1 chunk

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cyclic Assignment – Non Overlapping Communication CPU0 CPU1 CPU0 CPU1 Cyclic assignment on 2 SMP nodes with 2 CPUs each SMP0 SMP1 t

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cyclic Assignment - Overlapping Communication Cyclic assignment on 2 SMP nodes with 2 CPUs each t CPU0 CPU1 CPU0 CPU1 SMP0 SMP1 

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cyclic Assignment - Communication CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 Cyclic assignment on 2 SMP nodes with 2 CPUs each SMP0 SMP1 SMP0 SMP1 chunk

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Scheduling onto a Fixed Number of SMPs  Cyclic Assignment Schedule  Mirror Assignment Schedule  Cluster Assignment Schedule  Retiling

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Mirror Assignment SMP0 SMP1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 Mirror assignment on 2 SMP nodes with 2 CPUs each SMP1 SMP0 chunk

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Mirror Assignment – Non Overlapping Communication Mirror assignment on 2 SMP nodes with 2 CPUs each CPU0 CPU1 CPU0 CPU1 SMP0 SMP1 t

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Mirror Assignment - Overlapping Communication Mirror assignment on 2 SMP nodes with 2 CPUs each t CPU0 CPU1 CPU0 CPU1 SMP0 SMP1

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Mirror Assignment - Communication SMP0 SMP1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 Mirror assignment on 2 SMP nodes with 2 CPUs each SMP1 SMP0

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Scheduling onto a Fixed Number of SMPs  Cyclic Assignment Schedule  Mirror Assignment Schedule  Cluster Assignment Schedule  Retiling

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cluster Assignment SMP0 SMP1 CPU0 Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 tiles “TILE”

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cluster Assignment SMP0 SMP1 CPU0 Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 TILES GROUPS

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cluster Assignment – Non Overlapping Communication SMP0 SMP1 CPU0 Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 t

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cluster Assignment – Overlapping Communication SMP0 SMP1 CPU0 Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 t 

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cluster Assignment - Communication SMP0 SMP1 CPU0 Cluster assignment on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 TILES GROUPS 

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Scheduling onto a Fixed Number of SMPs  Cyclic Assignment Schedule  Mirror Assignment Schedule  Cluster Assignment Schedule  Retiling

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Retiling SMP0 SMP1 CPU0 Retiling on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 old tiles new tiles

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Retiling SMP0 SMP1 CPU0 Retiling on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 old tiles new tiles retaining computation volume of a tile

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Retiling – Non Overlapping Communication SMP0 SMP1 CPU0 Retiling on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 t

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Retiling –Overlapping Communication SMP0 SMP1 CPU0 Retiling on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 t

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Retiling - Communication SMP0 SMP1 CPU0 Retiling on 2 SMP nodes with 2 CPUs each CPU1 CPU0 CPU1 

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Overview  Tiling for parallelization  Non-overlapping vs. Overlapping execution scheme  Grouping  Application on a cluster of SMPs with a fixed number of nodes  Experimental-Simulation Results

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Experimental Platform  Linux SMP (Symmetric Multi- Processors) Cluster  2 nodes  1GB RAM  2 Pentium III 1266MHz  Myrinet high performance interconnect  GM low level message passing system

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes The Myrinet interconnect  User-level Networking  Based on the GM message passing interface  All message exchange using DMA  Directly to/from pinned userspace buffers  Communication is offloaded to the NIC  Programmable NIC  LANai RISC MHz  2-8MB SRAM  2+2Gbps full duplex fiber links

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes GM Architecture  Comprised of three main parts  User library  Kernel driver  Firmware on NIC  OS bypass design  Regions of NIC memory mapped to the VM of a process GM Library Application GM kernel module GM firmware User Kernel NIC

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Sending and Receiving messages over Myrinet/GM Sending application Host NIC Send q Send DMARecv DMA Host DMA LANai Receiving application Host NIC Recv q Send DMARecv DMA Host DMA LANai BufferEvent qBufferEvent q

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Initial Code for (i=1; i<=X; i++) for (j=1; j<=Y; j++) for (k=1; k<=Z; k++) { A[i][j][k] = func(A[i-1][j][k], A[i][j-1][k], A[i][j][k-1]) }

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes cyclic mirror cluster retile cyclic mirror cluster retile Experimental results Speedup / # processors Height of Iteration Space Non Overlapping Execution Scheme Speedup / # processors Height of Iteration Space Overlapping Execution Scheme

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Simulation results mirror cyclic retile Speedup / # processors Height of Iteration Space Overlapping Execution Scheme cluster mirror Speedup / # processors Height of Iteration Space Non Overlapping Execution Scheme retile cluster cyclic

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Simulation results retile cluster cyclic mirror Speedup / # processors Height of Iteration Space Non Overlapping Execution Scheme mirror cluster retile Speedup / # processors Height of Iteration Space Overlapping Execution Scheme cyclic

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Advantages - Disadvantages AdvantagesDisadvantages cyclic + fast pipeline filling- communication mirror + better communication than cyclic - idle time steps - worse communication than cluster, retile cluster + communication: 1) little volume of data to be transferred 2) data combined in fewer messages - slow pipeline filling retile + fast pipeline filling + communication: little volume of data to be transfered - reorganizes tiles  annuls optimal tile shape for cache hits

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes The End

National Technical University of Athens Computing Systems Laboratory PDP 2004 Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Cyclic Assignment - Overlapping Communication SMP0 SMP1 SMP0 SMP1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 CPU0 CPU1 equivalent schedulings P t scheduling on a fixed number of processors empty pipeline waiting for the necessary data to become available t P scheduling on an unlimited number of processors