Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.

Slides:



Advertisements
Similar presentations
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Advertisements

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Distributed Systems CS
SE-292 High Performance Computing
Potential for parallel computers/parallel programming
1 Potential for Parallel Computation Module 2. 2 Potential for Parallelism Much trivially parallel computing  Independent data, accounts  Nothing to.
CS 240A : Matrix multiplication Matrix multiplication I : parallel issues Matrix multiplication II: cache issues Thanks to Jim Demmel and Kathy Yelick.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Parallel System Performance CS 524 – High-Performance Computing.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
1 Distributed Computing Algorithms CSCI Distributed Computing: everything not centralized many processors.
12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
1 Lecture 4 Analytical Modeling of Parallel Programs Parallel Computing Fall 2008.
1 Lecture 11 Sorting Parallel Computing Fall 2008.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
Performance Metrics Parallel Computing - Theory and Practice (2/e) Section 3.6 Michael J. Quinn mcGraw-Hill, Inc., 1994.
Lecture 5 Today’s Topics and Learning Objectives Quinn Chapter 7 Predict performance of parallel programs Understand barriers to higher performance.
Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.
Steve Lantz Computing and Information Science Parallel Performance Week 7 Lecture Notes.
Parallel System Performance CS 524 – High-Performance Computing.
CS 240A: Complexity Measures for Parallel Computation.
Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.
Computer Science 320 Measuring Speedup. What Is Running Time? T(N, K) says that the running time T is a function of the problem size N and the number.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
1 CS240A: Applied Parallel Computing Introduction.
Performance Evaluation of Parallel Processing. Why Performance?
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
Flynn’s Taxonomy SISD: Although instruction execution may be pipelined, computers in this category can decode only a single instruction in unit time SIMD:
Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.
CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix multiplication I: parallel issues Matrix multiplication II: cache.
MIMD Distributed Memory Architectures message-passing multicomputers.
Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.
Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
CS 8625 High Performance and Parallel, Dr. Hoganson Copyright © 2005, 2006 Dr. Ken Hoganson CS8625-June Class Will Start Momentarily… Homework.
RAM, PRAM, and LogP models
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
October COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 4 An Approach to Performance Modelling Len Freeman, Graham Riley Centre.
Parallel Programming with MPI and OpenMP
CS240A: Conjugate Gradients and the Model Problem.
Parallelism Can we make it faster? 25-Apr-17.
CSCI-455/552 Introduction to High Performance Computing Lecture 6.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 8 October 23, 2002 Nayda G. Santiago.
Advanced Computer Networks Lecture 1 - Parallelization 1.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Classification of parallel computers Limitations of parallel processing.
Potential for parallel computers/parallel programming
Model Problem: Solving Poisson’s equation for temperature
Complexity Measures for Parallel Computation
CS 140 : Matrix multiplication
Multithreaded Programming in Cilk Lecture 1
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Complexity Measures for Parallel Computation
Potential for parallel computers/parallel programming
Potential for parallel computers/parallel programming
Parallel Matrix Multiply
Presentation transcript:

Complexity Measures for Parallel Computation

Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time on p processors t 1 time on 1 processor = sequential time = “work” t ∞ time on unlimited procs = critical path length = “span” vtotal communication volume Performance measures speedups = t 1 / t p efficiencye = t 1 / (p*t p ) = s / p (potential) parallelism pp = t 1 / t ∞ computational intensityq = t 1 / v

Several possible models! Execution time and parallelism: Work / Span Model Total cost of moving data: Communication Volume Model Detailed models that try to capture time for moving data: Latency / Bandwidth Model (for message-passing) Cache Memory Model (for hierarchical memory) Other detailed models we won’t discuss: LogP, UMH, ….

t p = execution time on p processors Work / Span Model

t p = execution time on p processors t 1 = work Work / Span Model

t p = execution time on p processors *Also called critical-path length or computational depth. t 1 = workt ∞ = span * Work / Span Model

t p = execution time on p processors t 1 = workt ∞ = span * *Also called critical-path length or computational depth. W ORK L AW ∙t p ≥t 1 /p S PAN L AW ∙t p ≥ t ∞ Work / Span Model

Work: t 1 (A∪B) = Series Composition A A B B Work: t 1 (A∪B) = t 1 (A) + t 1 (B) Span: t ∞ (A∪B) = t ∞ (A) +t ∞ (B)Span: t ∞ (A∪B) =

Parallel Composition A A B B Span: t ∞ (A∪B) = max{t ∞ (A), t ∞ (B)} Work: t 1 (A∪B) = t 1 (A) + t 1 (B)

Def. t 1 /t P = speedup on p processors. If t 1 /t P =  (p), we have linear speedup, = p, we have perfect linear speedup, > p, we have superlinear speedup, (which is not possible in this model, because of the Work Law t p ≥ t 1 /p) Speedup

Parallelism Because the Span Law requires t p ≥ t ∞, the maximum possible speedup is t 1 /t ∞ =(potential) parallelism =the average amount of work per step along the span.

Laws of Parallel Complexity Work law: t p ≥ t 1 / p Span law: t p ≥ t ∞ Amdahl’s law: If a fraction f, between 0 and 1, of the work must be done sequentially, then speedup ≤ 1 / f Exercise: prove Amdahl’s law from the span law.

Communication Volume Model Network of p processors Each with local memory Message-passing Communication volume (v) Total size (words) of all messages passed during computation Broadcasting one word costs volume p (actually, p-1) No explicit accounting for communication time Thus, can’t really model parallel efficiency or speedup; for that, we’d use the latency-bandwidth model (see later slide)

Complexity Measures for Parallel Computation Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time on p processors t 1 time on 1 processor = sequential time = “work” t ∞ time on unlimited procs = critical path length = “span” vtotal communication volume Performance measures speedups = t 1 / t p efficiencye = t 1 / (p*t p ) = s / p (potential) parallelism pp = t 1 / t ∞ computational intensityq = t 1 / v

Detailed complexity measures for data movement I: Latency/Bandwith Model Moving data between processors by message-passing Machine parameters:  or  t startup latency (message startup time in seconds)  or t data inverse bandwidth (in seconds per word) between nodes of Triton,  ×  and  × Time to send & recv or bcast a message of w words:  + w*  t comm total commmunication time t comp total computation time Total parallel time: t p = t comp + t comm

Moving data between cache and memory on one processor: Assume just two levels in memory hierarchy, fast and slow All data initially in slow memory m = number of memory elements (words) moved between fast and slow memory t m = time per slow memory operation f = number of arithmetic operations t f = time per arithmetic operation, t f << t m q = f / m ( computational intensity) flops per slow element access Minimum possible time = f * t f when all data in fast memory Actual time f * t f + m * t m = f * t f * (1 + t m /t f * 1/q) Larger q means time closer to minimum f * t f Detailed complexity measures for data movement II: Cache Memory Model