Computer Science 320 Load Balancing. Behavior of Parallel Program Why do 3 threads take longer than two?

Slides:



Advertisements
Similar presentations
© 2013 IBM Corporation Implement high-level parallel API in JDK Richard Ning – Enterprise Developer 1 st June 2013.
Advertisements

Computer Science 320 Clumping in Parallel Java. Sequential vs Parallel Program Initial setup Execute the computation Clean up Initial setup Create a parallel.
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture III: OS Support.
CMPT 401 Dr. Alexandra Fedorova Lecture III: OS Support.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Scheduling and Performance Issues for Programming using OpenMP
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
Flow Charts, Loop Structures
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Dynamic Load Balancing Experiments in a Grid Vrije Universiteit Amsterdam, The Netherlands CWI Amsterdam, The
The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
Simple Load Balancing CS550 Operating Systems. Announcements Project will be posted – TBA This project will use the client-server model and will require.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Computer Science 320 Measuring Speedup. What Is Running Time? T(N, K) says that the running time T is a function of the problem size N and the number.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
Performance Evaluation of Parallel Processing. Why Performance?
Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Computer Science 320 Load Balancing for Hybrid SMP/Clusters.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
This module created with support form NSF under grant # DUE Module developed Fall 2014 by Apan Qasem Parallel Computing Fundamentals Course TBD.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
NERSC NUG Meeting 5/29/03 Seaborg Code Scalability Project Richard Gerber NERSC User Services.
Threaded Programming Lecture 4: Work sharing directives.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Computer Science 320 Introduction to Hybrid SMP/Clusters.
Computer Science 320 Load Balancing with Clusters.
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
ICC Module 3 Lesson 2 – Memory Hierarchies 1 / 13 © 2015 Ph. Janson Information, Computing & Communication Memory Hierarchies – Clip 9 – Locality School.
Programming Languages and Paradigms Activation Records in Java.
Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.
Computer Science 320 Parallel Image Generation. The Mandelbrot Set.
Computer Science 320 A First Program in Parallel Java.
CS 420 Design of Algorithms Parallel Algorithm Design.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Linux Process Management. Linux Implementation of Threads Threads enable concurrent programming / true parallelism Linux implementation of threads.
Case Study 5: Molecular Dynamics (MD) Simulation of a set of bodies under the influence of physical laws. Atoms, molecules, forces acting on them... Have.
Computer Science 320 Barrier Actions. 1-D Continuous Cellular Automata 1-D array of cells, each having a value between 0.0 and 1.0 Each cell has a neighborhood.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Performance Aware Secure Code Partitioning Sri Hari Krishna Narayanan, Mahmut Kandemir, Richard Brooks Presenter : Sri Hari Krishna Narayanan.
Computer Science 320 Measuring Sizeup. Speedup vs Sizeup If we add more processors, we should be able to solve a problem of a given size faster If we.
Concurrency and Performance Based on slides by Henri Casanova.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 Reading Data with the SqlDataReader ADO.NET - Lesson 04  Training time: 10.
Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio
Computer Engg, IIT(BHU)
Chapter 4 MATLAB Programming
Lab. 3 (May 11th) You may use either cygwin or visual studio for using OpenMP Compiling in cygwin “> gcc –fopenmp ex1.c” will generate a.exe Execute :
Parallel Techniques • Embarrassingly Parallel Computations
Presentation transcript:

Computer Science 320 Load Balancing

Behavior of Parallel Program Why do 3 threads take longer than two?

Get running times of each thread Ideally, each one should run for T / K Only when K = 2 is this the case The threads assigned to the middle rows of the matrix always do more work

The Cause of the Problem Each thread takes same # of iterations Thread 1 takes more iterations. Why?

Load Balance The extent to which each processor (thread) does the same amount of work If the work is the same, we have a balanced load If not, we have an unbalanced load

Quantifying Load Balance B = T p (K) / (T p (1) / K) = K * T p (K) / T p (1)

Achieving a Balanced Load If some threads do more work on some rows, then the rows cannot be evenly divided among them Some thread will get more rows, ideally, the rows with the fewest points in the Mandelbrot set

PJ resources for Load Balancing Can divide iterations of a parallel for loop unevenly among the threads, by setting the loop’s schedule Different schedules can be set in the program code or via the command line

Fixed Schedule By default, each thread gets a chunk of loop iterations of the same size 100 iterations, 4 threads

Dynamic Schedule Chunk sizes are the same, there can be many more chunks than there are threads When a thread’s run method completes, it’s called again on a different chunk Default chunk size is 1 iteration, but it can be made larger 100 iterations, chunk size 5

Guided Schedule Chunk sizes decrease in size, by a factor of 2 * K, with a default minimum size of 1 When a thread’s run method completes, it’s called again on a different chunk K = 1 K = iterations

On Command Line or in Code java –Dpj.schedule=guided new IntegerForLoop(){ public IntegerSchedule schedule(){ return IntegerSchedule.guided(); }... }