Programming for Performance Laxmikant Kale CS 433.

Slides:

Advertisements

Similar presentations

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Advertisements

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.

Parallel System Performance CS 524 – High-Performance Computing.

1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.

1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Strategies for Implementing Dynamic Load Sharing.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

CS 420 Design of Algorithms Analytical Models of Parallel Algorithms.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Dynamic Load Balancing Tree and Structured Computations CS433 Laxmikant Kale Spring 2001.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.

Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Application Paradigms: Unstructured Grids CS433 Spring 2001 Laxmikant Kale.

RAM, PRAM, and LogP models

LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.

CS3505: DATA LINK LAYER. data link layer  phys. layer subject to errors; not reliable; and only moves information as bits, which alone are not meaningful.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Programming for Performance CS433 Spring 2001 Laxmikant Kale.

Static Process Scheduling

Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.

CS 420 Design of Algorithms Parallel Algorithm Design.

Basic Communication Operations Carl Tropper Department of Computer Science.

Programming for Performance Laxmikant Kale CS 433.

Load Balancing : The Goal Given a collection of tasks comprising a computation and a set of computers on which these tasks may be executed, find the mapping.

Concurrency and Performance Based on slides by Henri Casanova.

Dynamic Load Balancing Tree and Structured Computations.

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.

Auburn University

Conception of parallel algorithms

Parallel Programming By J. H. Wang May 2, 2017.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Introduction to parallel algorithms

Parallel Algorithm Design

Performance Evaluation of Adaptive MPI

Parallel Programming in C with MPI and OpenMP

Unit-2 Divide and Conquer

Parallel Matrix Operations

Numerical Algorithms • Parallelizing matrix multiplication

Introduction to parallel algorithms

Parallel Sorting Algorithms

COMP60621 Fundamentals of Parallel and Distributed Systems

CS510 - Portland State University

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Parallel Programming in C with MPI and OpenMP

COMP60611 Fundamentals of Parallel and Distributed Systems

Introduction to parallel algorithms

Presentation transcript:

Programming for Performance Laxmikant Kale CS 433

Causes of performance loss If each processor is rated at k MFLOPS, and there are p processors, why don’t we see k.p MFLOPS performance? –Several causes, –Each must be understood separately –but they interact with each other in complex ways Solution to one problem may create another One problem may mask another, which manifests itself under other conditions (e.g. increased p).

Causes Sequential: cache performance Communication overhead Algorithmic overhead (“extra work”) Speculative work Load imbalance (Long) Critical paths Bottlenecks

Algorithmic overhead Parallel algorithms may have a higher operation count Example: parallel prefix (also called “scan”) –How to parallelize this? B[0] = A[0]; for (I=1; I<N; I++) B[I] = B[I-1]+A[I];

Parallel Prefix: continued How to this operation in parallel? –Seems inherently sequential –Recursive doubling algorithm –Operation count: log(P). N A better algorithm: –Take blocking of data into account –Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements –N + log(P) +N: doubling of op. Count

Bottleneck Consider the “primes” program (or the “pi”) –What happens when we run it on 1000 pes? How to eliminate bottlenecks: –Two structures are useful in most such cases: Spanning trees: organize processors in a tree Hypercube-based dimensional exchange

Communication overhead Components: –per message and per byte –sending, receiving and network –capacity constraints Grainsize analysis: –How much computation per message –Computation-to-communication ratio

Communication overhead examples Usually, must reorganize data or work to reduce communication Combining communication also helps Examples:

Communication overhead Communication delay: time interval between sending on one processor to receipt on another: time = a + b. N Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN Typical values: a = microseconds, b: 2-10 ns

Grainsize control A Simple definition of grainsize: –Amount of computation per message –Problem: short message/ long message More realistic: –Computation to communication ratio

Example: matrix multiplication How to parallelize this? For (I=0; I<N; I++) For (J=0; j<N; J++) // c[I][j] ==0 For(k=0; k<N; k++) C[I][J] += A[I][K] * B[K][J];

A simple algorithm: Distribute A by rows, B by columns –So,any processor can request a row of A and get it (in two messages). Same for a col of B, –Distribute the work of computing each element of C using some load balancing scheme So it works even on machines with varying processor capabilities (e.g. timeshared clusters) –What is the computation-toc-mmunication ratio? For each object: 2.N ops, 2 messages with N bytes

A better algorithm: Store A as a collection row-bunches –each bunch stores g rows –Same of B’s columns Each object now computes a gxg section of C Comp to commn ratio: –2*g*g*N ops –2 messages, gN bytes each –alpha ratio: 2g*g*N/2, beta ratio: g

Alpha vs beta The per message cost is significantly larger than per byte cost –factor of several thousands –So, several optimizations are possible that trade off : get larger beta cost for smaller alpha –I.e. send fewer messages –Applications of this idea: Message combining Complex communication patterns: each-to-all,..

Example: Each to all communication: –each processor wants to send N bytes, distinct message to each other processor –Simple implementation: alpha*P + N * beta *P typical values?

Programming for performance: steps Select/design Parallel algorithm Decide on Decomposition Select Load balancing strategy Plan Communication structure Examine synchronization needs –global synchronizations, critical paths

Design Philosophy: Parallel Algorithm design: –Ensure good performance (total op count) –Generate sufficient parallelism –Avoid/minimize “extra work” Decomposition: –Break into many small pieces: Smallest grain that sufficiently amortizes overhead

Design principles: contd. Load balancing –Select static, dynamic, or quasi-dynamic strategy Measurement based vs prediction based load estimation –Principle: let a processor idle but avoid overloading one (think about this) Reduce communication overhead –Algorithmic reorganization (change mapping) –Message combining –Use efficient communication libraries

Design principles: Synchronization Eliminate unnecessary global synchronization –If T(i,j) is the time during i’th phase on j’th PE With synch: sum ( max {T(i,j)}) Without: max { sum(T (i,j) } Critical Paths: –Look for long chains of dependences Draw timeline pictures with dependences

Diagnosing performance problems Tools: –Back of the envelope (I.e. simple) analysis –Post-mortem analysis, with performance logs Visualization of performance data Automatic analysis Phase-by-phase analysis (prog. may have many phases) –What to measure load distribution, (commun.) overhead, idle time Their averages, max/min, and variances Profiling: time spent in individual modules/subroutines

Diagnostic technniques Tell-tale signs: –max load >> average, and # Pes > average is >>1 Load imbalance –max load >> average, and # Pes > average is ~ 1 Possible bottleneck (if there is dependence) –profile shows increase in total time in routine f with increase in Pes: algorithmic overhead –Communication overhead: obvious