Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.

Similar presentations


Presentation on theme: "Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute."— Presentation transcript:

1 Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H kkishore@iiit.ac.in Workshop on Multi-core Technologies International Institute of Information Technology July 23 – 25, 2009, Hyderabad.

2 GRAND CHALLENGE PROBLEMS Global change Human genome Fluid turbulence Vehicle dynamics Ocean circulation Viscous fluid dynamics Superconductor modeling Quantum chromo dynamics Vision

3 APPLICATIONS Nature of workloads. Computational and Storage demands of technical, scientific, digital media and business applications Finer degrees of spatial and temporal resolution A computational fluid dynamics(CFD) calculation on an airplane wing 512 X 64 X 256 grid 5000 fl-pt operations per grid point 5000 steps 2.1x10 14 ft-ops. 3.5 minutes on a machine sustaining 1 trillion fl-ops A simulation of full aircraft 3.5 x 10 17 grid points total of 8.7 x 10 24 ft-pt operations on same machine requires more than 275,000 years to complete. Simulation of magnetic materials at the level of 2000-atom systems require 2.64 Tflops of computational power and 512 GB of storage. Full hard disk simulation 30 Tflops and 2 TB Current investigations limited about 1000 atoms 0.5 Tflops 250 GB Future investigations involving 10,000 atoms 100 Tflops 2.5TB Digital movies and special effects 10 14 fl-pt operations per frame and 50 frames per second 90-min movie represents 2.7 x 10 19 fl-pt operations. It would take 2,000 1- Gflops CPUs approximately 150 days to complete the computation. Inventory planning, risk analysis, workforce scheduling and chip design.

4 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can afford to turn on)‏ Old: Multiplies are slow, Memory access is fast New: “Memory wall” Memory slow, multiplies fast (200 clocks to DRAM memory, 4 clocks for FP multiply)‏ Old : Increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, …)‏ New CW: “ILP wall” diminishing returns on more ILP New: Power Wall + Memory Wall + ILP Wall = Brick Wall –Old CW: Uniprocessor performance 2X / 1.5 yrs –New CW: Uniprocessor performance only 2X / 5 yrs? Conventional Wisdom (CW) in Computer Architecture - Patterson

5

6 Multicore and Manycore Processors IBM Cell NVidia GeForce 8800 includes 128 scalar processors and Tesla Sun T1 and T2 Tilera Tile64 Picochip combines 430 simple RISC cores Cisco 188 TRIPS

7 Parallel Programming?  Programming where concurrent executions are explicitly specified, possibly in a high- level language.  Stake-holders Architects: Understand workloads Algorithm designers: Focus on designs for real systems. Programmers: Understand performance issues and engineer for better performance.

8 Parallel Programming  4 approaches Extending an existing compiler. E.g. Fortran compiler Extending an existing language with new constructs. E.g. MPI and OpenMP Add a parallel programming layer. Not popular. Design a new parallel language and build a compiler. Most difficult.

9 Parallel Programming  How different from programming an uni- processor? Program mostly fixed in the latter and is mostly taken for granted. Other entities such as compilers and operating system change but need not rewrite the source.

10 Parallel Programming  Programs have to be written to suit the available architecture.  A continuous evolutionary model taking into account parallel software and architecture.  Some Challenges More processors Memory hierarchy Scope for several optimizations/trade-offs. e.g., communication.

11 Parallelization Process  Assume that a description of the sequential program is available.  Does the sequential program lend itself to direct parallelization? Enough cases where it does and where it does not Will see an example of both.

12 Parallelization Process  Identify tasks that can be done in parallel.  Goal: To get a high-performance implementation with reasonable effort and resources.  Who should do it? Compiler, OS, run-time system, programmer Different challenges in different approaches.

13 Parallelization Process – 4 Steps 1. Decomposition Computation to tasks 2. Assignment Task – Process assignment 3. Orchestration Understand communication and synchronization 4. Mapping Map to physical processors

14 DecompositionDecomposition AssignmentAssignment OrchestrationOrchestration MappingMapping P1P2 P3 P4 Parallelization Process – In Pictures

15 Decomposition  Break the computation into a collection of tasks. Can have dynamic generation of tasks.  Goal is to expose as much concurrency as possible. Careful to keep the overhead manageable.

16 Decomposition  Limitation: Available concurrency.  Formalized as Amdahl’s law. Let s be the fraction of operations in a computation that must be performed sequentially, with 0  s  1. The maximum speed-up achievable by a parallel computer is:

17 Decomposition  Implications of Amdahl’s Law Some processors may have to be idle due to the sequential nature of the program. Also applicable to other resources.  Quick Example: If 20% of the program is sequential then the best speed up with 10 processors is limited to 1/(0.2+0.08) = 3.5  Amdahl’s Law: As p, the speed-up is bounded by 1/s.

18 Decomposition  Amdahl’s Law: As p , the speed-up is bounded by 1/ s.  Example: 2-phase calculation sweep over n-by-n grid and do some independent computation sweep again and add each value to global sum  Time for first phase = n 2 /p  Second phase serialized at global variable, so time = n 2  Speedup <= or at most 2  Trick: divide second phase into two accumulate into private sum during sweep add per-process private sum into global sum  Parallel time is n 2 /p + n 2 /p + p, and speedup at best 2n 2 n2n2 p + n 2 2pn 2 2n 2 + p 2

19 Assignment  Distribution of tasks among processes.  Issue: Balance the load among the processes. Load includes number of tasks and inter- process communication.  One has to be careful because inter- process communication is expensive and load imbalance can affect performance.

20 Assignment: Static vs. Dynamic  Static assignment: Assignment completely specified at the beginning. Does not change after that Useful for very structured applications.

21 Assignment: Static vs. Dynamic  Dynamic Assignment Assignment changes at runtime. Imagine a task pool. Has a chance to correct load imbalance. Useful for unstructured applications.

22 Orchestration  Bring in the architecture, programming model, and the programming language.  Consider available mechanisms for Data exchange Synchronization Inter-process communication Various programming model primitives and their relative merits

23 Orchestration  Data structures and their organization.  Exploit temporal locality among tasks assigned to a process by proper scheduling.  Implicit vs. explicit communication  Size of messages.

24 Orchestration – Goals  Preserving data locality  Task scheduling to remove inter-task waiting.  Reduce the overhead of managing parallelism.

25 Mapping  Closer and specific to the system and the programming environment.  User controlled  Which process runs on which processor? Want an assignment that preserves locality of communication.

26 Mapping  System controlled The OS schedules processes on processors dynamically. Processes may be migrated across processors  In-between approach Take user requests into account but the system may change it.

27 Parallelizing Computation vs. Data  Computation is decomposed and assigned (partitioned) ‏  Partitioning data is often a natural view too Computation follows data: owner computes Grid example; data mining;  Distinction between comp. and data stronger in many applications: E.g. Raytrace

28 Parallelization Process – Summary  Of the 4 stages, decomposition and assignment are independent of architecture and programming language/environment. Reduce IPC, inter- task dependence, synchronization Yes3. Orchestration Exploit communication locality Yes4. Mapping Load balancingMostly No2. Assignment Expose enough concurrency Mostly no1. Decomposition GoalsArchitecture Dependent Step

29 Parallelization Process – Summary Reduce IPC, inter- task dependence, synchronization Yes3. Orchestration Exploit communication locality Yes4. Mapping Load balancingMostly No2. Assignment Expose enough concurrency Mostly no1. Decomposition GoalsArchitecture Dependent Step

30 Rest of the Lecture  Concentrate on Steps 1 and 2 – These are algorithmic in nature  Steps 3 and 4 : Programming in nature. Mostly self-taught. Few inputs from my side.

31 DecompositionDecomposition AssignmentAssignment OrchestrationOrchestration MappingMapping P1P2 P3 P4 Parallelization Process – In Pictures

32 A similar View  Along similar lines, proposed by Ian Foster: Partitioning: Alike decomposition. Communication: Understand the communication required by the partition. Agglomeration: Combine tasks to reduce communication, preserve locality, ease programming effort. Mapping: Map processes to processors.  See Parallel Programming in C with MPI and OpenMP, M. J. Quinn.

33 Foster’s Design Methodology PartitioningCommunication Agglomeration Mapping

34 Example 1 – Sequential to Parallel  Matrix Multiplication Listing 1: Sequential Code for i = 1 to n do for j = 1 to n do C[i][j] = 0; for k = 1 to n do c[i][j] += A[i][k]*B[k][j] end

35 Matrix Multiplication  Easy to modify the sequential algorithm to a parallel algorithm  Several techniques available Recursive approach Sub-matrices in parallel Rows/Columns in parallel

36 Example 2 – New Parallel Algorithm  Prefix Computations: Given an array A of n elements and an associative operation o, compute A(1) o A(2) o... A(i) for each i.  A very simple sequential algorithm exists for this problem. Listing 1: S(1) = A(1)‏ for i = 2 to n do S(i) = S(i-1) o A(i)‏

37 Parallel Prefix Computation  The sequential algorithm in Listing 1 is not efficient in parallel.  Need a new algorithm approach. Balanced Binary Tree

38  An algorithm design approach for parallel algorithms  Many problems can be solved with this design technique.  Easily amenable to parallellization and analysis.

39 Balanced Binary Tree  A complete binary tree with processors at each internal node.  Input is at the leaf nodes  Define operations to be executed at the internal nodes. Input for this operation at a node are the values at the children of this node.  Computation as a tree traversal from leaf to root.

40 Balanced Binary Tree – Prefix Sums a0a1a2a3a4a5a6a7 ++++ ++ +

41 Balanced Binary Tree – Sum a0a1a2a3a4a5a6a7 ++++ ++ + a0 + a1a2 + a3a4 + a5a6 + a7 a0 + a1 + a2 + a3a4 + a5 + a6 + a7  a i

42 Balanced Binary Tree – Sum  The above approach called as an ``upward traversal'' Data flow from the children to the root. Helpful in other situations also such as computing the max, expression evaluation.  Analogously, can define a downward traversal Data flow from root to leaf Helps in settings such as element broadcast

43 Balanced Binary Tree  Can use a combination of both upward and downward traversal.  Prefix computation requires that.  Illustration in the next slide.

44 Balanced Binary Tree – Sum a1a2a3a4a5a6a7a8 ++++ ++ + a1 + a2a3 + a4a5 + a6a7 + a8 a1 + a2 + a3 + a4a5 + a6 + a7 + a8  a i

45 Balanced Binary Tree – Prefix Sum a1a2a3a4a5a6a7a8 ++++ ++ + a1 + a2a3 + a4a5 + a6a7 + a8 a1 + a2 + a3 + a4a5 + a6 + a7 + a8  a i Upward traversal

46 a1a2a3a4a5a6a7a8 ++++ ++ + a1 + a2 a3 + a4a5 + a6a7 + a8 a1 + a2 + a3 + a4 a5 + a6 + a7 + a8  a i Downward traversal – Even indices Balanced Binary Tree – Prefix Sum a1 + a2 a1+a2+ a3 + a4  i=1 6  a i  a i a1a1+a2a1+a2+a3+a4  i=1 6  a i  a i

47 a1a2a3a4a5a6a7a8 ++++ ++ + a1 + a2 a3 + a4a5 + a6a7 + a8 a1 + a2 + a3 + a4 a5 + a6 + a7 + a8  a i Downward traversal – Odd indices Balanced Binary Tree – Prefix Sum a1 + a2 a1+a2+ a3 + a4  i=1 6  a i  a i a1(a1+a2) + a3  i=1 4  a i ) + a5  i=1 6  a i ) + a7

48 Balanced Binary Tree – Prefix Sums  Two traversals of a complete binary tree.  The tree is only a visual aid. Map processors to locations in the tree Perform equivalent computations. Algorithm designed in the PRAM model. Works in logarithmic time, and optimal number of operations. //upward traversal 1. for i = 1 to n/2 do in parallel b i = a 2i-2 o a 2i 2. Recursively compute the prefix sums of B= (b 1, b 2,..., b n/2 ) and store them in C = (c 1, c 2,..., c n/2 )‏ //downward traversal 3. for i = 1 to n do in parallel i is even : s i = c i i= 1 : s i = x i i is odd : s i = c (i-1)/2 o a i

49 The PRAM Model  An extension of the von Neumann model. P1P2P3Pn Global Shared Memory

50 The PRAM Model  A set of n identical processors  A common access shared memory  Synchronous time steps  Access to the shared memory costs the same as a unit of computation.  Different models to provide semantics for concurrent access to the shared memory EREW, CREW, CRCW(Common, Aribitrary, Priority,...) ‏

51 PRAM Model – Advantages and Drawbacks  A simple model for algorithm design  Hides architectural details for the designer.  A good starting point  Ignores architectural features such as memory bandwidth, communication cost and latency, scheduling,...  Hardware may be difficult to realize Advantage s Disadvantag es

52 Other Models  The Network Model P4 P1 P5 P7 P3 P2 P6  Graph G of processors  Send/Receive messages over edges  Computation through communication.  Efficiency depends on the graph G P1

53 The Network Model  There are a few disadvantages Algorithm has to change if the network changes. Difficult to specify and design algorithms.

54 More Design Paradigms  Divide and Conquer Alike the sequential design technique  Partitioning A case of divide and conquer where the subproblems are independent of each other. No need to combine solutions Better suited for algorithms such as merging.  Path Doubling or Pointer Jumping Suitable where data is in linked lists

55 More Design Paradigms  Accelerated Cascading A technique to combine two parallel algorithms to get a better algorithm Algorithm A could be very fast but does lot of operations Algorithm B is slow but is work-optimal. Combine Algorithm A and Algorithm B and get both advantages.

56 References  Parallel Architectures and Programming, Culler, Gupta, and Singh.  Parallel Programming in C with MPI and OpenMP, M. J. Quinn.  Introduction to Parallel Algorithms, J. JaJa.

57 List Ranking – Another Example  Process a linked list to answer the distance of nodes from one end of the list.  Linked lists are a fundamental data structure.

58 List Ranking – Another Example  Pointer jumping – 3  Ind. set based - 3


Download ppt "Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute."

Similar presentations


Ads by Google