Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

Parallelism Lecture notes from MKP and S. Yalamanchili.

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Computer Abstractions and Technology

Parallelizing GIS applications for IBM Cell Broadband engine and x86 Multicore platforms Bharghava R, Jyothish Soman, K S Rajan International.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.

CIS 570 Advanced Computer Systems University of Massachusetts Dartmouth Instructor: Dr. Michael Geiger Fall 2008 Lecture 1: Fundamentals of Computer Design.

Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.

Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

Old-Fashioned Mud-Slinging with Flash

CS 584 Lecture 11 l Assignment? l Paper Schedule –10 Students –5 Days –Look at the schedule and me your preference. Quickly.

EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.

Lecture 1: Introduction to High Performance Computing.

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.

Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Basics and Architectures

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Performance Evaluation of Parallel Processing. Why Performance?

EET 4250: Chapter 1 Computer Abstractions and Technology Acknowledgements: Some slides and lecture notes for this course adapted from Prof. Mary Jane Irwin.

Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Compiled by Maria Ramila Jimenez

Parallel Processing Sharing the load. Inside a Processor Chip in Package Circuits Primarily Crystalline Silicon 1 mm – 25 mm on a side 100 million to.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.

Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.

Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

BCS361: Computer Architecture I/O Devices. 2 Input/Output CPU Cache Bus MemoryDiskNetworkUSBDVD …

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Concurrency and Performance Based on slides by Henri Casanova.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

Chapter 1 Performance & Technology Trends. Outline What is computer architecture? Performance What is performance: latency (response time), throughput.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

Parallel programs Inf-2202 Concurrent and Data-intensive Programming Fall 2016 Lars Ailo Bongo

CS203 – Advanced Computer Architecture

Lecture 2: Performance Today’s topics:

Lynn Choi School of Electrical Engineering

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

18-447: Computer Architecture Lecture 30B: Multiprocessors

ECE 4100/6100 Advanced Computer Architecture Lecture 1 Performance

Chapter1 Fundamental of Computer Design

Parallel Programming By J. H. Wang May 2, 2017.

Morgan Kaufmann Publishers

Morgan Kaufmann Publishers

Lecture 2 The Art of Concurrency

Mattan Erez The University of Texas at Austin

Summary Inf-2202 Concurrent and Data-Intensive Programming Fall 2016

Presentation transcript:

Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute of Information Technology July 23 – 25, 2009, Hyderabad.

GRAND CHALLENGE PROBLEMS Global change Human genome Fluid turbulence Vehicle dynamics Ocean circulation Viscous fluid dynamics Superconductor modeling Quantum chromo dynamics Vision

APPLICATIONS Nature of workloads. Computational and Storage demands of technical, scientific, digital media and business applications Finer degrees of spatial and temporal resolution A computational fluid dynamics(CFD) calculation on an airplane wing 512 X 64 X 256 grid 5000 fl-pt operations per grid point 5000 steps 2.1x10 14 ft-ops. 3.5 minutes on a machine sustaining 1 trillion fl-ops A simulation of full aircraft 3.5 x grid points total of 8.7 x ft-pt operations on same machine requires more than 275,000 years to complete. Simulation of magnetic materials at the level of 2000-atom systems require 2.64 Tflops of computational power and 512 GB of storage. Full hard disk simulation 30 Tflops and 2 TB Current investigations limited about 1000 atoms 0.5 Tflops 250 GB Future investigations involving 10,000 atoms 100 Tflops 2.5TB Digital movies and special effects fl-pt operations per frame and 50 frames per second 90-min movie represents 2.7 x fl-pt operations. It would take 2, Gflops CPUs approximately 150 days to complete the computation. Inventory planning, risk analysis, workforce scheduling and chip design.

Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on)‏ Old: Multiplies are slow, Memory access is fast New: “Memory wall” Memory slow, multiplies fast (200 clocks to DRAM memory, 4 clocks for FP multiply)‏ Old : Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)‏ New CW: “ILP wall” diminishing returns on more ILP New: Power Wall + Memory Wall + ILP Wall = Brick Wall –Old CW: Uniprocessor performance 2X / 1.5 yrs –New CW: Uniprocessor performance only 2X / 5 yrs? Conventional Wisdom (CW) in Computer Architecture - Patterson

Multicore and Manycore Processors IBM Cell NVidia GeForce 8800 includes 128 scalar processors and Tesla Sun T1 and T2 Tilera Tile64 Picochip combines 430 simple RISC cores Cisco 188 TRIPS

Parallel Programming?  Programming where concurrent executions are explicitly specified, possibly in a high- level language.  Stake-holders Architects: Understand workloads Algorithm designers: Focus on designs for real systems. Programmers: Understand performance issues and engineer for better performance.

Parallel Programming  4 approaches Extending an existing compiler. E.g. Fortran compiler Extending an existing language with new constructs. E.g. MPI and OpenMP Add a parallel programming layer. Not popular. Design a new parallel language and build a compiler. Most difficult.

Parallel Programming  How different from programming an uniprocessor? Program mostly fixed in the latter and is mostly taken for granted. Other entities such as compilers and operating system change but need not rewrite the source.

Parallel Programming  Programs have to be written to suit the available architecture.  A continuous evolutionary model taking into account parallel software and architecture.  Some Challenges More processors Memory hierarchy Scope for several optimizations/trade-offs. e.g., communication.

Parallelization Process  Assume that a description of the sequential program is available.  Does the sequential program lend itself to direct parallelization? Enough cases where it does and where it does not Will see an example of both.

Parallelization Process  Identify tasks that can be done in parallel.  Goal: To get a high-performance implementation with reasonable effort and resources.  Who should do it? Compiler, OS, run-time system, programmer Different challenges in different approaches.

Parallelization Process – 4 Steps 1. Decomposition Computation to tasks 2. Assignment Task – Process assignment 3. Orchestration Understand communication and synchronization 4. Mapping Map to physical processors

DecompositionDecomposition AssignmentAssignment OrchestrationOrchestration MappingMapping P1P2 P3 P4 Parallelization Process – In Pictures

Decomposition  Break the computation into a collection of tasks. Can have dynamic generation of tasks.  Goal is to expose as much concurrency as possible. Careful to keep the overhead manageable.

Decomposition  Limitation: Available concurrency.  Formalized as Amdahl’s law. Let s be the fraction of operations in a computation that must be performed sequentially, with 0  s  1. The maximum speed-up achievable by a parallel computer is:

Decomposition  Implications of Amdahl’s Law Some processors may have to be idle due to the sequential nature of the program. Also applicable to other resources.  Quick Example: If 20% of the program is sequential then the best speed up with 10 processors is limited to 1/( ) = 3.5  Amdahl’s Law: As p, the speed-up is bounded by 1/s.

Decomposition  Amdahl’s Law: As p , the speed-up is bounded by 1/ s.  Example: 2-phase calculation sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum  Time for first phase = n 2 /p  Second phase serialized at global variable, so time = n 2  Speedup <= or at most 2  Trick: divide second phase into two accumulate into private sum during sweep add per-process private sum into global sum  Parallel time is n 2 /p + n 2 /p + p, and speedup at best 2n 2 n2n2 p + n 2 2pn 2 2n 2 + p 2

Assignment  Distribution of tasks among processes.  Issue: Balance the load among the processes. Load includes number of tasks and inter- process communication.  One has to be careful because inter- process communication is expensive and load imbalance can affect performance.

Assignment: Static vs. Dynamic  Static assignment: Assignment completely specified at the beginning. Does not change after that Useful for very structured applications.

Assignment: Static vs. Dynamic  Dynamic Assignment Assignment changes at runtime. Imagine a task pool. Has a chance to correct load imbalance. Useful for unstructured applications.

Orchestration  Bring in the architecture, programming model, and the programming language.  Consider available mechanisms for Data exchange Synchronization Inter-process communication Various programming model primitives and their relative merits

Orchestration  Data structures and their organization.  Exploit temporal locality among tasks assigned to a process by proper scheduling.  Implicit vs. explicit communication  Size of messages.

Orchestration – Goals  Preserving data locality  Task scheduling to remove inter-task waiting.  Reduce the overhead of managing parallelism.

Mapping  Closer and specific to the system and the programming environment.  User controlled  Which process runs on which processor? Want an assignment that preserves locality of communication.

Mapping  System controlled The OS schedules processes on processors dynamically. Processes may be migrated across processors  In-between approach Take user requests into account but the system may change it.

Parallelizing Computation vs. Data  Computation is decomposed and assigned (partitioned) ‏  Partitioning data is often a natural view too Computation follows data: owner computes Grid example; data mining;  Distinction between comp. and data stronger in many applications: E.g. Raytrace

Parallelization Process – Summary  Of the 4 stages, decomposition and assignment are independent of architecture and programming language/environment. Reduce IPC, inter- task dependence, synchronization Yes3. Orchestration Exploit communication locality Yes4. Mapping Load balancingMostly No2. Assignment Expose enough concurrency Mostly no1. Decomposition GoalsArchitecture Dependent Step

Parallelization Process – Summary Reduce IPC, inter- task dependence, synchronization Yes3. Orchestration Exploit communication locality Yes4. Mapping Load balancingMostly No2. Assignment Expose enough concurrency Mostly no1. Decomposition GoalsArchitecture Dependent Step

Rest of the Lecture  Concentrate on Steps 1 and 2 – These are algorithmic in nature  Steps 3 and 4 : Programming in nature. Mostly self-taught. Few inputs from my side.

DecompositionDecomposition AssignmentAssignment OrchestrationOrchestration MappingMapping P1P2 P3 P4 Parallelization Process – In Pictures

A similar View  Along similar lines, proposed by Ian Foster: Partitioning: Alike decomposition. Communication: Understand the communication required by the partition. Agglomeration: Combine tasks to reduce communication, preserve locality, ease programming effort. Mapping: Map processes to processors.  See Parallel Programming in C with MPI and OpenMP, M. J. Quinn.

Foster’s Design Methodology PartitioningCommunication Agglomeration Mapping

Example 1 – Sequential to Parallel  Matrix Multiplication Listing 1: Sequential Code for i = 1 to n do for j = 1 to n do C[i][j] = 0; for k = 1 to n do c[i][j] += A[i][k]*B[k][j] end

Matrix Multiplication  Easy to modify the sequential algorithm to a parallel algorithm  Several techniques available Recursive approach Sub-matrices in parallel Rows/Columns in parallel

Example 2 – New Parallel Algorithm  Prefix Computations: Given an array A of n elements and an associative operation o, compute A(1) o A(2) o... A(i) for each i.  A very simple sequential algorithm exists for this problem. Listing 1: S(1) = A(1)‏ for i = 2 to n do S(i) = S(i-1) o A(i)‏

Parallel Prefix Computation  The sequential algorithm in Listing 1 is not efficient in parallel.  Need a new algorithm approach. Balanced Binary Tree

 An algorithm design approach for parallel algorithms  Many problems can be solved with this design technique.  Easily amenable to parallellization and analysis.

Balanced Binary Tree  A complete binary tree with processors at each internal node.  Input is at the leaf nodes  Define operations to be executed at the internal nodes. Input for this operation at a node are the values at the children of this node.  Computation as a tree traversal from leaf to root.

Balanced Binary Tree – Prefix Sums a0a1a2a3a4a5a6a

Balanced Binary Tree – Sum a0a1a2a3a4a5a6a a0 + a1a2 + a3a4 + a5a6 + a7 a0 + a1 + a2 + a3a4 + a5 + a6 + a7  a i

Balanced Binary Tree – Sum  The above approach called as an ``upward traversal'' Data flow from the children to the root. Helpful in other situations also such as computing the max, expression evaluation.  Analogously, can define a downward traversal Data flow from root to leaf Helps in settings such as element broadcast

Balanced Binary Tree  Can use a combination of both upward and downward traversal.  Prefix computation requires that.  Illustration in the next slide.

Balanced Binary Tree – Sum a1a2a3a4a5a6a7a a1 + a2a3 + a4a5 + a6a7 + a8 a1 + a2 + a3 + a4a5 + a6 + a7 + a8  a i

Balanced Binary Tree – Prefix Sum a1a2a3a4a5a6a7a a1 + a2a3 + a4a5 + a6a7 + a8 a1 + a2 + a3 + a4a5 + a6 + a7 + a8  a i Upward traversal

a1a2a3a4a5a6a7a a1 + a2 a3 + a4a5 + a6a7 + a8 a1 + a2 + a3 + a4 a5 + a6 + a7 + a8  a i Downward traversal – Even indices Balanced Binary Tree – Prefix Sum a1 + a2 a1+a2+ a3 + a4  i=1 6  a i  a i a1a1+a2a1+a2+a3+a4  i=1 6  a i  a i

a1a2a3a4a5a6a7a a1 + a2 a3 + a4a5 + a6a7 + a8 a1 + a2 + a3 + a4 a5 + a6 + a7 + a8  a i Downward traversal – Odd indices Balanced Binary Tree – Prefix Sum a1 + a2 a1+a2+ a3 + a4  i=1 6  a i  a i a1(a1+a2) + a3  i=1 4  a i ) + a5  i=1 6  a i ) + a7

Balanced Binary Tree – Prefix Sums  Two traversals of a complete binary tree.  The tree is only a visual aid. Map processors to locations in the tree Perform equivalent computations. Algorithm designed in the PRAM model. Works in logarithmic time, and optimal number of operations. //upward traversal 1. for i = 1 to n/2 do in parallel b i = a 2i-2 o a 2i 2. Recursively compute the prefix sums of B= (b 1, b 2,..., b n/2 ) and store them in C = (c 1, c 2,..., c n/2 )‏ //downward traversal 3. for i = 1 to n do in parallel i is even : s i = c i i= 1 : s i = x i i is odd : s i = c (i-1)/2 o a i

The PRAM Model  An extension of the von Neumann model. P1P2P3Pn Global Shared Memory

The PRAM Model  A set of n identical processors  A common access shared memory  Synchronous time steps  Access to the shared memory costs the same as a unit of computation.  Different models to provide semantics for concurrent access to the shared memory EREW, CREW, CRCW(Common, Aribitrary, Priority,...) ‏

PRAM Model – Advantages and Drawbacks  A simple model for algorithm design  Hides architectural details for the designer.  A good starting point  Ignores architectural features such as memory bandwidth, communication cost and latency, scheduling,...  Hardware may be difficult to realize Advantage s Disadvantag es

Other Models  The Network Model P4 P1 P5 P7 P3 P2 P6  Graph G of processors  Send/Receive messages over edges  Computation through communication.  Efficiency depends on the graph G P1

The Network Model  There are a few disadvantages Algorithm has to change if the network changes. Difficult to specify and design algorithms.

More Design Paradigms  Divide and Conquer Alike the sequential design technique  Partitioning A case of divide and conquer where the subproblems are independent of each other. No need to combine solutions Better suited for algorithms such as merging.  Path Doubling or Pointer Jumping Suitable where data is in linked lists

More Design Paradigms  Accelerated Cascading A technique to combine two parallel algorithms to get a better algorithm Algorithm A could be very fast but does lot of operations Algorithm B is slow but is work-optimal. Combine Algorithm A and Algorithm B and get both advantages.

References  Parallel Architectures and Programming, Culler, Gupta, and Singh.  Parallel Programming in C with MPI and OpenMP, M. J. Quinn.  Introduction to Parallel Algorithms, J. JaJa.

List Ranking – Another Example  Process a linked list to answer the distance of nodes from one end of the list.  Linked lists are a fundamental data structure.

List Ranking – Another Example  Pointer jumping – 3  Ind. set based - 3