Presentation on theme: "Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute."— Presentation transcript:
Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H firstname.lastname@example.org Workshop on Multi-core Technologies International Institute of Information Technology July 23 – 25, 2009, Hyderabad.
GRAND CHALLENGE PROBLEMS Global change Human genome Fluid turbulence Vehicle dynamics Ocean circulation Viscous fluid dynamics Superconductor modeling Quantum chromo dynamics Vision
APPLICATIONS Nature of workloads. Computational and Storage demands of technical, scientific, digital media and business applications Finer degrees of spatial and temporal resolution A computational fluid dynamics(CFD) calculation on an airplane wing 512 X 64 X 256 grid 5000 fl-pt operations per grid point 5000 steps 2.1x10 14 ft-ops. 3.5 minutes on a machine sustaining 1 trillion fl-ops A simulation of full aircraft 3.5 x 10 17 grid points total of 8.7 x 10 24 ft-pt operations on same machine requires more than 275,000 years to complete. Simulation of magnetic materials at the level of 2000-atom systems require 2.64 Tflops of computational power and 512 GB of storage. Full hard disk simulation 30 Tflops and 2 TB Current investigations limited about 1000 atoms 0.5 Tflops 250 GB Future investigations involving 10,000 atoms 100 Tflops 2.5TB Digital movies and special effects 10 14 fl-pt operations per frame and 50 frames per second 90-min movie represents 2.7 x 10 19 fl-pt operations. It would take 2,000 1- Gflops CPUs approximately 150 days to complete the computation. Inventory planning, risk analysis, workforce scheduling and chip design.
Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on) Old: Multiplies are slow, Memory access is fast New: “Memory wall” Memory slow, multiplies fast (200 clocks to DRAM memory, 4 clocks for FP multiply) Old : Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …) New CW: “ILP wall” diminishing returns on more ILP New: Power Wall + Memory Wall + ILP Wall = Brick Wall –Old CW: Uniprocessor performance 2X / 1.5 yrs –New CW: Uniprocessor performance only 2X / 5 yrs? Conventional Wisdom (CW) in Computer Architecture - Patterson
Multicore and Manycore Processors IBM Cell NVidia GeForce 8800 includes 128 scalar processors and Tesla Sun T1 and T2 Tilera Tile64 Picochip combines 430 simple RISC cores Cisco 188 TRIPS
Parallel Programming? Programming where concurrent executions are explicitly specified, possibly in a high- level language. Stake-holders Architects: Understand workloads Algorithm designers: Focus on designs for real systems. Programmers: Understand performance issues and engineer for better performance.
Parallel Programming 4 approaches Extending an existing compiler. E.g. Fortran compiler Extending an existing language with new constructs. E.g. MPI and OpenMP Add a parallel programming layer. Not popular. Design a new parallel language and build a compiler. Most difficult.
Parallel Programming How different from programming an uni- processor? Program mostly fixed in the latter and is mostly taken for granted. Other entities such as compilers and operating system change but need not rewrite the source.
Parallel Programming Programs have to be written to suit the available architecture. A continuous evolutionary model taking into account parallel software and architecture. Some Challenges More processors Memory hierarchy Scope for several optimizations/trade-offs. e.g., communication.
Parallelization Process Assume that a description of the sequential program is available. Does the sequential program lend itself to direct parallelization? Enough cases where it does and where it does not Will see an example of both.
Parallelization Process Identify tasks that can be done in parallel. Goal: To get a high-performance implementation with reasonable effort and resources. Who should do it? Compiler, OS, run-time system, programmer Different challenges in different approaches.
Parallelization Process – 4 Steps 1. Decomposition Computation to tasks 2. Assignment Task – Process assignment 3. Orchestration Understand communication and synchronization 4. Mapping Map to physical processors
DecompositionDecomposition AssignmentAssignment OrchestrationOrchestration MappingMapping P1P2 P3 P4 Parallelization Process – In Pictures
Decomposition Break the computation into a collection of tasks. Can have dynamic generation of tasks. Goal is to expose as much concurrency as possible. Careful to keep the overhead manageable.
Decomposition Limitation: Available concurrency. Formalized as Amdahl’s law. Let s be the fraction of operations in a computation that must be performed sequentially, with 0 s 1. The maximum speed-up achievable by a parallel computer is:
Decomposition Implications of Amdahl’s Law Some processors may have to be idle due to the sequential nature of the program. Also applicable to other resources. Quick Example: If 20% of the program is sequential then the best speed up with 10 processors is limited to 1/(0.2+0.08) = 3.5 Amdahl’s Law: As p, the speed-up is bounded by 1/s.
Decomposition Amdahl’s Law: As p , the speed-up is bounded by 1/ s. Example: 2-phase calculation sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum Time for first phase = n 2 /p Second phase serialized at global variable, so time = n 2 Speedup <= or at most 2 Trick: divide second phase into two accumulate into private sum during sweep add per-process private sum into global sum Parallel time is n 2 /p + n 2 /p + p, and speedup at best 2n 2 n2n2 p + n 2 2pn 2 2n 2 + p 2
Assignment Distribution of tasks among processes. Issue: Balance the load among the processes. Load includes number of tasks and inter- process communication. One has to be careful because inter- process communication is expensive and load imbalance can affect performance.
Assignment: Static vs. Dynamic Static assignment: Assignment completely specified at the beginning. Does not change after that Useful for very structured applications.
Assignment: Static vs. Dynamic Dynamic Assignment Assignment changes at runtime. Imagine a task pool. Has a chance to correct load imbalance. Useful for unstructured applications.
Orchestration Bring in the architecture, programming model, and the programming language. Consider available mechanisms for Data exchange Synchronization Inter-process communication Various programming model primitives and their relative merits
Orchestration Data structures and their organization. Exploit temporal locality among tasks assigned to a process by proper scheduling. Implicit vs. explicit communication Size of messages.
Orchestration – Goals Preserving data locality Task scheduling to remove inter-task waiting. Reduce the overhead of managing parallelism.
Mapping Closer and specific to the system and the programming environment. User controlled Which process runs on which processor? Want an assignment that preserves locality of communication.
Mapping System controlled The OS schedules processes on processors dynamically. Processes may be migrated across processors In-between approach Take user requests into account but the system may change it.
Parallelizing Computation vs. Data Computation is decomposed and assigned (partitioned) Partitioning data is often a natural view too Computation follows data: owner computes Grid example; data mining; Distinction between comp. and data stronger in many applications: E.g. Raytrace
Parallelization Process – Summary Of the 4 stages, decomposition and assignment are independent of architecture and programming language/environment. Reduce IPC, inter- task dependence, synchronization Yes3. Orchestration Exploit communication locality Yes4. Mapping Load balancingMostly No2. Assignment Expose enough concurrency Mostly no1. Decomposition GoalsArchitecture Dependent Step
Rest of the Lecture Concentrate on Steps 1 and 2 – These are algorithmic in nature Steps 3 and 4 : Programming in nature. Mostly self-taught. Few inputs from my side.
DecompositionDecomposition AssignmentAssignment OrchestrationOrchestration MappingMapping P1P2 P3 P4 Parallelization Process – In Pictures
A similar View Along similar lines, proposed by Ian Foster: Partitioning: Alike decomposition. Communication: Understand the communication required by the partition. Agglomeration: Combine tasks to reduce communication, preserve locality, ease programming effort. Mapping: Map processes to processors. See Parallel Programming in C with MPI and OpenMP, M. J. Quinn.
Example 1 – Sequential to Parallel Matrix Multiplication Listing 1: Sequential Code for i = 1 to n do for j = 1 to n do C[i][j] = 0; for k = 1 to n do c[i][j] += A[i][k]*B[k][j] end
Matrix Multiplication Easy to modify the sequential algorithm to a parallel algorithm Several techniques available Recursive approach Sub-matrices in parallel Rows/Columns in parallel
Example 2 – New Parallel Algorithm Prefix Computations: Given an array A of n elements and an associative operation o, compute A(1) o A(2) o... A(i) for each i. A very simple sequential algorithm exists for this problem. Listing 1: S(1) = A(1) for i = 2 to n do S(i) = S(i-1) o A(i)
Parallel Prefix Computation The sequential algorithm in Listing 1 is not efficient in parallel. Need a new algorithm approach. Balanced Binary Tree
An algorithm design approach for parallel algorithms Many problems can be solved with this design technique. Easily amenable to parallellization and analysis.
Balanced Binary Tree A complete binary tree with processors at each internal node. Input is at the leaf nodes Define operations to be executed at the internal nodes. Input for this operation at a node are the values at the children of this node. Computation as a tree traversal from leaf to root.
Balanced Binary Tree – Sum a0a1a2a3a4a5a6a7 ++++ ++ + a0 + a1a2 + a3a4 + a5a6 + a7 a0 + a1 + a2 + a3a4 + a5 + a6 + a7 a i
Balanced Binary Tree – Sum The above approach called as an ``upward traversal'' Data flow from the children to the root. Helpful in other situations also such as computing the max, expression evaluation. Analogously, can define a downward traversal Data flow from root to leaf Helps in settings such as element broadcast
Balanced Binary Tree Can use a combination of both upward and downward traversal. Prefix computation requires that. Illustration in the next slide.
Balanced Binary Tree – Sum a1a2a3a4a5a6a7a8 ++++ ++ + a1 + a2a3 + a4a5 + a6a7 + a8 a1 + a2 + a3 + a4a5 + a6 + a7 + a8 a i
Balanced Binary Tree – Prefix Sum a1a2a3a4a5a6a7a8 ++++ ++ + a1 + a2a3 + a4a5 + a6a7 + a8 a1 + a2 + a3 + a4a5 + a6 + a7 + a8 a i Upward traversal
a1a2a3a4a5a6a7a8 ++++ ++ + a1 + a2 a3 + a4a5 + a6a7 + a8 a1 + a2 + a3 + a4 a5 + a6 + a7 + a8 a i Downward traversal – Even indices Balanced Binary Tree – Prefix Sum a1 + a2 a1+a2+ a3 + a4 i=1 6 a i a i a1a1+a2a1+a2+a3+a4 i=1 6 a i a i
a1a2a3a4a5a6a7a8 ++++ ++ + a1 + a2 a3 + a4a5 + a6a7 + a8 a1 + a2 + a3 + a4 a5 + a6 + a7 + a8 a i Downward traversal – Odd indices Balanced Binary Tree – Prefix Sum a1 + a2 a1+a2+ a3 + a4 i=1 6 a i a i a1(a1+a2) + a3 i=1 4 a i ) + a5 i=1 6 a i ) + a7
Balanced Binary Tree – Prefix Sums Two traversals of a complete binary tree. The tree is only a visual aid. Map processors to locations in the tree Perform equivalent computations. Algorithm designed in the PRAM model. Works in logarithmic time, and optimal number of operations. //upward traversal 1. for i = 1 to n/2 do in parallel b i = a 2i-2 o a 2i 2. Recursively compute the prefix sums of B= (b 1, b 2,..., b n/2 ) and store them in C = (c 1, c 2,..., c n/2 ) //downward traversal 3. for i = 1 to n do in parallel i is even : s i = c i i= 1 : s i = x i i is odd : s i = c (i-1)/2 o a i
The PRAM Model An extension of the von Neumann model. P1P2P3Pn Global Shared Memory
The PRAM Model A set of n identical processors A common access shared memory Synchronous time steps Access to the shared memory costs the same as a unit of computation. Different models to provide semantics for concurrent access to the shared memory EREW, CREW, CRCW(Common, Aribitrary, Priority,...)
PRAM Model – Advantages and Drawbacks A simple model for algorithm design Hides architectural details for the designer. A good starting point Ignores architectural features such as memory bandwidth, communication cost and latency, scheduling,... Hardware may be difficult to realize Advantage s Disadvantag es
Other Models The Network Model P4 P1 P5 P7 P3 P2 P6 Graph G of processors Send/Receive messages over edges Computation through communication. Efficiency depends on the graph G P1
The Network Model There are a few disadvantages Algorithm has to change if the network changes. Difficult to specify and design algorithms.
More Design Paradigms Divide and Conquer Alike the sequential design technique Partitioning A case of divide and conquer where the subproblems are independent of each other. No need to combine solutions Better suited for algorithms such as merging. Path Doubling or Pointer Jumping Suitable where data is in linked lists
More Design Paradigms Accelerated Cascading A technique to combine two parallel algorithms to get a better algorithm Algorithm A could be very fast but does lot of operations Algorithm B is slow but is work-optimal. Combine Algorithm A and Algorithm B and get both advantages.
References Parallel Architectures and Programming, Culler, Gupta, and Singh. Parallel Programming in C with MPI and OpenMP, M. J. Quinn. Introduction to Parallel Algorithms, J. JaJa.
List Ranking – Another Example Process a linked list to answer the distance of nodes from one end of the list. Linked lists are a fundamental data structure.
List Ranking – Another Example Pointer jumping – 3 Ind. set based - 3