ECE669 L6: Programming for Performance February 17, 2004 ECE 669 Parallel Computer Architecture Lecture 6 Programming for Performance.

Slides:



Advertisements
Similar presentations
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Advertisements

ECE669 L9: Workload Evaluation February 26, 2004 ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation.
1 Parallelization of An Example Program Examine a simplified version of a piece of Ocean simulation Iterative equation solver Illustrate parallel program.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
Reference: Message Passing Fundamentals.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
EECC756 - Shaaban #1 lec # 7 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Hardware/Software Performance Tradeoffs (plus Msg Passing Finish) CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Computer Architecture II 1 Computer architecture II Programming for performance.
EECC756 - Shaaban #1 lec # 5 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Computer Architecture II 1 Computer architecture II Introduction.
Computer Organization and Architecture
Computer Architecture II 1 Computer architecture II Introduction.
1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...)
EECC756 - Shaaban #1 lec # 5 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Parallel Programming Todd C. Mowry CS 740 October 16 & 18, 2000 Topics Motivating Examples Parallel Programming for High Performance Impact of the Programming.
Programming for Performance
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
EECC756 - Shaaban #1 lec # 5 Spring Parallel Programming for Performance A process of Successive Refinement Partitioning for Performance:Partitioning.
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 Programming for Performance Reference: Chapter 3, Parallel Computer Architecture, Culler et.al.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note Introduction to Parallel Computing (Blaise Barney,
1 What is a Multiprocessor? A collection of communicating processors View taken so far Goals: balance load, reduce inherent communication and extra work.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
CMPE655 - Shaaban #1 lec # 6 Fall Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
#1 lec # 6 Fall Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance Goal of the.
CMPE655 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
Concurrency and Performance Based on slides by Henri Casanova.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
CMPE655 - Shaaban #1 lec # 6 Fall Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
EECC756 - Shaaban #1 lec # 6 Spring Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Performance.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo
Steps in Creating a Parallel Program
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Parallel Programming in C with MPI and OpenMP
Perspective on Parallel Programming
Parallel Application Case Studies
Parallelization of An Example Program
CS 584.
COMP60621 Fundamentals of Parallel and Distributed Systems
Parallel Programming: Performance Todd C
Mattan Erez The University of Texas at Austin
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

ECE669 L6: Programming for Performance February 17, 2004 ECE 669 Parallel Computer Architecture Lecture 6 Programming for Performance

ECE669 L6: Programming for Performance February 17, 2004 Introduction °Rich space of techniques and issues Trade off and interact with one another °Issues can be addressed/helped by software or hardware Algorithmic or programming techniques Architectural techniques °Focus here on performance issues and software techniques Partitioning Communication Orchestration

ECE669 L6: Programming for Performance February 17, 2004 Partitioning for Performance °Initially consider how to segment program without view of programming model °Important factors: Balancing workload Reducing communication Reducing extra work needed for management °Goals similar for parallel computer and VLSI design °Algorithms or manual approaches °Perhaps most important factor for performance

ECE669 L6: Programming for Performance February 17, 2004 Performance Goal => Speedup °Architect Goal observe how program uses machine and improve the design to enhance performance °Programmer Goal observe how the program uses the machine and improve the implementation to enhance performance °What do you observe? °Who fixes what?

ECE669 L6: Programming for Performance February 17, 2004 Partitioning for Performance °Balancing the workload and reducing wait time at synch points °Reducing inherent communication °Reducing extra work °Even these algorithmic issues trade off: Minimize comm. => run on 1 processor => extreme load imbalance Maximize load balance => random assignment of tiny tasks => no control over communication Good partition may imply extra work to compute or manage it °Goal is to compromise Fortunately, often not difficult in practice

ECE669 L6: Programming for Performance February 17, 2004 Load Balance and Synch Wait Time °Limit on speedup: Speedup problem (p) < Work includes data access and other costs Not just equal work, but must be busy at same time °Four parts to load balance and reducing synch wait time: °1. Identify enough concurrency °2. Decide how to manage it °3. Determine the granularity at which to exploit it °4. Reduce serialization and cost of synchronization Sequential Work Max Work on any Processor

ECE669 L6: Programming for Performance February 17, 2004 Identifying Concurrency °Techniques seen for equation solver: Loop structure, fundamental dependences, new algorithms °Data Parallelism versus Function Parallelism °Often see orthogonal levels of parallelism; e.g. VLSI routing

ECE669 L6: Programming for Performance February 17, 2004 Load Balance and Synchronization Sequential Work Max Work on any Processor Speedup problem (p) < °Instantaneous load imbalance revealed as wait time at completion at barriers at receive P0P0 P1P1 P2P2 P3P3 P0P0 P1P1 P2P2 P3P3 Sequential Work Max (Work + Synch Wait Time)

ECE669 L6: Programming for Performance February 17, 2004 Improving Load Balance °Decompose into more smaller tasks (>>P) °Distribute uniformly variable sized task randomize bin packing dynamic assignment °Schedule more carefully avoid serialization estimate work use history info. P 0 P 1 P 2 P 4 for_all i = 1 to n do for_all j = i to n do A[ i, j ] = A[i-1, j] + A[i, j-1] +...

ECE669 L6: Programming for Performance February 17, 2004 Example: Barnes-Hut °Divide space into roughly equal # particles °Particles close together in space should be on same processor °Nonuniform, dynamically changing

ECE669 L6: Programming for Performance February 17, 2004 Dynamic Scheduling with Task Queues °Centralized versus distributed queues °Task stealing with distributed queues Can compromise comm and locality, and increase synchronization Whom to steal from, how many tasks to steal,... Termination detection Maximum imbalance related to size of task

ECE669 L6: Programming for Performance February 17, 2004 Deciding How to Manage Concurrency °Static versus Dynamic techniques °Static: Algorithmic assignment based on input; won’t change Low runtime overhead Computation must be predictable Preferable when applicable (except in multiprogrammed/heterogeneous environment) °Dynamic: Adapt at runtime to balance load Can increase communication and reduce locality Can increase task management overheads

ECE669 L6: Programming for Performance February 17, 2004 Dynamic Assignment °Profile-based (semi-static): Profile work distribution at runtime, and repartition dynamically Applicable in many computations, e.g. Barnes-Hut, some graphics °Dynamic Tasking: Deal with unpredictability in program or environment (e.g. Raytrace) -computation, communication, and memory system interactions -multiprogramming and heterogeneity -used by runtime systems and OS too Pool of tasks; take and add tasks until done E.g. “self-scheduling” of loop iterations (shared loop counter)

ECE669 L6: Programming for Performance February 17, 2004 Impact of Dynamic Assignment °Barnes-Hut and Ray Tracing on SGI Origin 2000 and Challenge (cache-coherent shared memory) °Semistatic – periodic run-time re-evaluation of task assignment

ECE669 L6: Programming for Performance February 17, 2004 Determining Task Granularity °Task granularity: amount of work associated with a task °General rule: Coarse-grained => often less load balance Fine-grained => more overhead, often more communication and contention °Communication and contention affected by assignment, not size Overhead an issue, particularly with task queues

ECE669 L6: Programming for Performance February 17, 2004 Reducing Serialization °Be careful about assignment and orchestration including scheduling °Event synchronization Reduce use of conservative synchronization -Point-to-point instead of global barriers Fine-grained synch more difficult to program, more synch ops. °Mutual exclusion Separate locks for separate data -e.g. locking records in a database: lock per process, record, or field -lock per task in task queue, not per queue -finer grain => less contention, more space, less reuse Smaller, less frequent critical sections -don’t do reading/testing in critical section, only modification

ECE669 L6: Programming for Performance February 17, 2004 Implications of Load Balance °Extends speedup limit expression to: °Generally, responsibility of software °Architecture can support task stealing and synch efficiently Fine-grained communication, low-overhead access to queues -efficient support allows smaller tasks, better load balance Naming logically shared data in the presence of task stealing Efficient support for point-to-point communication Sequential Work Max (Work + Synch Wait Time) Speedup problem (p) <

ECE669 L6: Programming for Performance February 17, 2004 Architectural Implications of Load Balancing °Naming Global position independent naming separates decomposition from layout Allows diverse, even dynamic assignments °Efficient fine-grained communication & synch Requires: -messages -locks point-to-point synchronization °Automatic replication of tasks

ECE669 L6: Programming for Performance February 17, 2004 Implications of Comm-to-Comp Ratio °Architects examine application needs to see where to spend money °If denominator is execution time, ratio gives average BW needs °If operation count, gives extremes in impact of latency and bandwidth Latency: assume no latency hiding Bandwidth: assume all latency hidden Reality is somewhere in between °Actual impact of comm. depends on structure and cost as well Need to keep communication balanced across processors as well Sequential Work Max (Work + Synch Wait Time + Comm Cost) Speedup <

ECE669 L6: Programming for Performance February 17, 2004 Reducing Extra Work °Common sources of extra work: Computing a good partition -e.g. partitioning in Barnes-Hut Using redundant computation to avoid communication Task, data and process management overhead -applications, languages, runtime systems, OS Imposing structure on communication -coalescing messages, allowing effective naming °Architectural Implications: Reduce need by making communication and orchestration efficient Sequential Work Max (Work + Synch Wait Time + Comm Cost + Extra Work) Speedup <

ECE669 L6: Programming for Performance February 17, 2004 Reducing Inherent Communication °Communication is expensive! °Measure: communication to computation ratio °Inherent communication Determined by assignment of tasks to processes One produces data consumed by others °Replicate computations => Use algorithms that communicate less => Assign tasks that access same data to same process same row or block to same process in each iteration

ECE669 L6: Programming for Performance February 17, 2004 Perimeter to Area comm-to-comp ratio (area to volume in 3-d) Depends on n,p: decreases with n, increases with p Domain Decomposition °Works well for scientific, engineering, graphics,... applications °Exploits local-biased nature of physical problems Information requirements often short-range Or long-range but fall off with distance °Simple example: nearest-neighbor grid computation

ECE669 L6: Programming for Performance February 17, 2004 Domain Decomposition °Comm to comp: for block, for strip °Application dependent: strip may be better in other cases 4*p 0.5 n 2*p n Best domain decomposition depends on information requirements Nearest neighbor example: block versus strip decomposition:

ECE669 L6: Programming for Performance February 17, 2004 Finding a Domain Decomposition °Static, by inspection Must be predictable: grid example, Ocean °Static, but not by inspection Input-dependent, require analyzing input structure E.g sparse matrix computations °Semi-static (periodic repartitioning) Characteristics change but slowly; e.g. Barnes-Hut °Static or semi-static, with dynamic task stealing Initial decomposition, but highly unpredictable

ECE669 L6: Programming for Performance February 17, 2004 Summary: Analyzing Parallel Algorithms °Requires characterization of multiprocessor and algorithm °Historical focus on algorithmic aspects: partitioning, mapping °PRAM model: data access and communication are free Only load balance (including serialization) and extra work matter Useful for early development, but unrealistic for real performance Ignores communication and also the imbalances it causes Can lead to poor choice of partitions as well as orchestration Sequential Instructions Max (Instructions + Synch Wait Time + Extra Instructions) Speedup <

ECE669 L6: Programming for Performance February 17, 2004 Orchestration for Performance °Reducing amount of communication: Inherent: change logical data sharing patterns in algorithm Artifactual: exploit spatial, temporal locality in extended hierarchy -Techniques often similar to those on uniprocessors °Structuring communication to reduce cost

ECE669 L6: Programming for Performance February 17, 2004 Reducing Communication °Message passing model Communication and replication are both explicit Communication is in messages °Shared address space model More interesting from an architectural perspective Occurs transparently due to interactions of program and system -sizes and granularities in extended memory hierarchy °Use shared address space to illustrate issues

ECE669 L6: Programming for Performance February 17, 2004 Exploiting Temporal Locality Structure algorithm so working sets map well to hierarchy -often techniques to reduce inherent communication do well here -schedule tasks for data reuse once assigned Solver example: blocking

ECE669 L6: Programming for Performance February 17, 2004 Exploiting Spatial Locality °Besides capacity, granularities are important: Granularity of allocation Granularity of communication or data transfer Granularity of coherence °Major spatial-related causes of artifactual communication: Conflict misses Data distribution/layout (allocation granularity) Fragmentation (communication granularity) False sharing of data (coherence granularity) °All depend on how spatial access patterns interact with data structures Fix problems by modifying data structures, or layout/alignment °Examine later in context of architectures one simple example here: data distribution in SAS solver

ECE669 L6: Programming for Performance February 17, 2004 Spatial Locality Example Repeated sweeps over 2-d grid, each time adding 1 to elements Natural 2-d versus higher-dimensional array representation

ECE669 L6: Programming for Performance February 17, 2004 Architectural Implications of Locality °Communication abstraction that makes exploiting it easy °For cache-coherent SAS Size and organization of levels of memory hierarchy -cost-effectiveness: caches are expensive -caveats: flexibility for different and time-shared workloads Replication in main memory useful? If so, how to manage? -hardware, OS/runtime, program? Granularities of allocation, communication, coherence (?) -small granularities => high overheads, but easier to program °Machine granularity (resource division among processors, memory...)

ECE669 L6: Programming for Performance February 17, 2004 Example Performance Impact °Equation solver on SGI Origin2000

ECE669 L6: Programming for Performance February 17, 2004 Summary of Tradeoffs °Different goals often have conflicting demands Load Balance -fine-grain tasks -random or dynamic assignment Communication -usually coarse grain tasks -decompose to obtain locality: not random/dynamic Extra Work -coarse grain tasks -simple assignment Communication Cost: -big transfers: amortize overhead and latency -small transfers: reduce contention