Presentation is loading. Please wait.

Presentation is loading. Please wait.

Day 2. Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size.

Similar presentations


Presentation on theme: "Day 2. Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size."— Presentation transcript:

1 Day 2

2 Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size

3 One thing to remember Easy Hard

4 Seeking Concurrency Data dependence graphs Data parallelism Functional parallelism Pipelining

5 Data Dependence Graph Directed graph Vertices = tasks Edges = dependences

6 Data Parallelism Independent tasks apply same operation to different elements of a data set Okay to perform operations concurrently for i  0 to 99 do a[i]  b[i] + c[i] endfor

7 Functional Parallelism Independent tasks apply different operations to different data elements First and second statements Third and fourth statements a  2 b  3 m  (a + b) / 2 s  (a 2 + b 2 ) / 2 v  s - m 2

8 Pipelining Divide a process into stages Produce several items simultaneously

9 Data Clustering Data mining = looking for meaningful patterns in large data sets Data clustering = organizing a data set into clusters of “similar” items Data clustering can speed retrieval of related items

10 Document Vectors Moon Rocket Alice in Wonderland A Biography of Jules Verne The Geology of Moon Rocks The Story of Apollo 11

11 Document Clustering

12 Clustering Algorithm Compute document vectors Choose initial cluster centers Repeat –Compute performance function –Adjust centers Until function value converges or max iterations have elapsed Output cluster centers

13 Data Parallelism Opportunities Operation being applied to a data set Examples –Generating document vectors –Finding closest center to each vector –Picking initial values of cluster centers

14 Functional Parallelism Opportunities Draw data dependence diagram Look for sets of nodes such that there are no paths from one node to another

15 Data Dependence Diagram Build document vectors Compute function value Choose cluster centers Adjust cluster centersOutput cluster centers

16 Programming Parallel Computers Extend compilers: translate sequential programs into parallel programs Extend languages: add parallel operations Add parallel language layer on top of sequential language Define totally new parallel language and compiler system

17 Strategy 1: Extend Compilers Parallelizing compiler –Detect parallelism in sequential program –Produce parallel executable program Focus on making Fortran programs parallel

18 Extend Compilers (cont.) Advantages –Can leverage millions of lines of existing serial programs –Saves time and labor –Requires no retraining of programmers –Sequential programming easier than parallel programming

19 Extend Compilers (cont.) Disadvantages –Parallelism may be irretrievably lost when programs written in sequential languages –Performance of parallelizing compilers on broad range of applications still up in air

20 Extend Language Add functions to a sequential language –Create and terminate processes –Synchronize processes –Allow processes to communicate

21 Extend Language (cont.) Advantages –Easiest, quickest, and least expensive –Allows existing compiler technology to be leveraged –New libraries can be ready soon after new parallel computers are available

22 Extend Language (cont.) Disadvantages –Lack of compiler support to catch errors –Easy to write programs that are difficult to debug

23 Add a Parallel Programming Layer Lower layer –Core of computation –Process manipulates its portion of data to produce its portion of result Upper layer –Creation and synchronization of processes –Partitioning of data among processes A few research prototypes have been built based on these principles

24 Create a Parallel Language Develop a parallel language “from scratch” –occam is an example Add parallel constructs to an existing language –Fortran 90 –High Performance Fortran –C*

25 New Parallel Languages (cont.) Advantages –Allows programmer to communicate parallelism to compiler –Improves probability that executable will achieve high performance Disadvantages –Requires development of new compilers –New languages may not become standards –Programmer resistance

26 Current Status Low-level approach is most popular –Augment existing language with low-level parallel constructs –MPI and OpenMP are examples Advantages of low-level approach –Efficiency –Portability Disadvantage: More difficult to program and debug

27 Architectures Interconnection networks Processor arrays (SIMD/data parallel) Multiprocessors (shared memory) Multicomputers (distributed memory) Flynn’s taxonomy

28 Interconnection Networks Uses of interconnection networks –Connect processors to shared memory –Connect processors to each other Interconnection media types –Shared medium –Switched medium

29 Shared versus Switched Media

30 Shared Medium Allows only message at a time Messages are broadcast Each processor “listens” to every message Arbitration is decentralized Collisions require resending of messages Ethernet is an example

31 Switched Medium Supports point-to-point messages between pairs of processors Each processor has its own path to switch Advantages over shared media –Allows multiple messages to be sent simultaneously –Allows scaling of network to accommodate increase in processors

32 Switch Network Topologies View switched network as a graph –Vertices = processors or switches –Edges = communication paths Two kinds of topologies –Direct –Indirect

33 Direct Topology Ratio of switch nodes to processor nodes is 1:1 Every switch node is connected to –1 processor node –At least 1 other switch node

34 Indirect Topology Ratio of switch nodes to processor nodes is greater than 1:1 Some switches simply connect other switches

35 Evaluating Switch Topologies Diameter Bisection width Number of edges / node Constant edge length? (yes/no)

36 2-D Mesh Network Direct topology Switches arranged into a 2-D lattice Communication allowed only between neighboring switches Variants allow wraparound connections between switches on edge of mesh

37 2-D Meshes

38 Vector Computers Vector computer: instruction set includes operations on vectors as well as scalars Two ways to implement vector computers –Pipelined vector processor: streams data through pipelined arithmetic units –Processor array: many identical, synchronized arithmetic processing elements

39 Why Processor Arrays? Historically, high cost of a control unit Scientific applications have data parallelism

40 Processor Array

41 Data/instruction Storage Front end computer –Program –Data manipulated sequentially Processor array –Data manipulated in parallel

42 Processor Array Performance Performance: work done per time unit Performance of processor array –Speed of processing elements –Utilization of processing elements

43 Performance Example 1 1024 processors Each adds a pair of integers in 1  sec What is performance when adding two 1024-element vectors (one per processor)?

44 Performance Example 2 512 processors Each adds two integers in 1  sec Performance adding two vectors of length 600?

45 2-D Processor Interconnection Network Each VLSI chip has 16 processing elements

46 if (COND) then A else B

47

48

49 Processor Array Shortcomings Not all problems are data-parallel Speed drops for conditionally executed code Don’t adapt to multiple users well Do not scale down well to “starter” systems Rely on custom VLSI for processors Expense of control units has dropped

50 Multicomputer, aka Distributed Memory Machines Distributed memory multiple-CPU computer Same address on different processors refers to different physical memory locations Processors interact through message passing Commercial multicomputers Commodity clusters

51 Asymmetrical Multicomputer

52 Asymmetrical MC Advantages Back-end processors dedicated to parallel computations  Easier to understand, model, tune performance Only a simple back-end operating system needed  Easy for a vendor to create

53 Asymmetrical MC Disadvantages Front-end computer is a single point of failure Single front-end computer limits scalability of system Primitive operating system in back-end processors makes debugging difficult Every application requires development of both front-end and back-end program

54 Symmetrical Multicomputer

55 Symmetrical MC Advantages Alleviate performance bottleneck caused by single front-end computer Better support for debugging Every processor executes same program

56 Symmetrical MC Disadvantages More difficult to maintain illusion of single “parallel computer” No simple way to balance program development workload among processors More difficult to achieve high performance when multiple processes on each processor

57 Commodity Cluster Co-located computers Dedicated to running parallel jobs No keyboards or displays Identical operating system Identical local disk images Administered as an entity

58 Network of Workstations Dispersed computers First priority: person at keyboard Parallel jobs run in background Different operating systems Different local images Checkpointing and restarting important

59 DM programming model Communicating sequential programs Disjoint address spaces Communicate sending “messages” A message is an array of bytes –Send(dest, char *buf, in len); –receive(&dest, char *buf, int &len);

60 Multiprocessors Multiprocessor: multiple-CPU computer with a shared memory Same address on two different CPUs refers to the same memory location Avoid three problems of processor arrays –Can be built from commodity CPUs –Naturally support multiple users –Maintain efficiency in conditional code

61 Centralized Multiprocessor Straightforward extension of uniprocessor Add CPUs to bus All processors share same primary memory Memory access time same for all CPUs –Uniform memory access (UMA) multiprocessor –Symmetrical multiprocessor (SMP)

62 Centralized Multiprocessor

63 Private and Shared Data Private data: items used only by a single processor Shared data: values used by multiple processors In a multiprocessor, processors communicate via shared data values

64 Problems Associated with Shared Data Cache coherence –Replicating data across multiple caches reduces contention –How to ensure different processors have same value for same address? Synchronization –Mutual exclusion –Barrier

65 Cache-coherence Problem Cache CPU A Cache CPU B Memory 7 X

66 Cache-coherence Problem CPU ACPU B Memory 7 X 7

67 Cache-coherence Problem CPU ACPU B Memory 7 X 7 7

68 Cache-coherence Problem CPU ACPU B Memory 2 X 7 2

69 Write Invalidate Protocol CPU ACPU B 7 X 7 7 Cache control monitor

70 Write Invalidate Protocol CPU ACPU B 7 X 7 7 Intent to write X

71 Write Invalidate Protocol CPU ACPU B 7 X 7 Intent to write X

72 Write Invalidate Protocol CPU ACPU B X 2 2

73 Distributed Multiprocessor Distribute primary memory among processors Increase aggregate memory bandwidth and lower average memory access time Allow greater number of processors Also called non-uniform memory access (NUMA) multiprocessor

74 Distributed Multiprocessor

75 Cache Coherence Some NUMA multiprocessors do not support it in hardware –Only instructions, private data in cache –Large memory access time variance Implementation more difficult –No shared memory bus to “snoop” –Directory-based protocol needed

76 Flynn’s Taxonomy Instruction stream Data stream Single vs. multiple Four combinations –SISD –SIMD –MISD –MIMD

77 SISD Single Instruction, Single Data Single-CPU systems Note: co-processors don’t count –Functional –I/O Example: PCs

78 SIMD Single Instruction, Multiple Data Two architectures fit this category –Pipelined vector processor (e.g., Cray-1) –Processor array (e.g., Connection Machine)

79 MISD Multiple Instruction, Single Data Example: systolic array

80 MIMD Multiple Instruction, Multiple Data Multiple-CPU computers –Multiprocessors –Multicomputers

81 Summary Commercial parallel computers appeared in 1980s Multiple-CPU computers now dominate Small-scale: Centralized multiprocessors Large-scale: Distributed memory architectures (multiprocessors or multicomputers)

82 Programming the Beast Task/channel model Algorithm design methodology Case studies

83 Task/Channel Model Parallel computation = set of tasks Task –Program –Local memory –Collection of I/O ports Tasks interact by sending messages through channels

84 Task/Channel Model Task Channel

85 Foster’s Design Methodology Partitioning Communication Agglomeration Mapping

86 Foster’s Methodology

87 Partitioning Dividing computation and data into pieces Domain decomposition –Divide data into pieces –Determine how to associate computations with the data Functional decomposition –Divide computation into pieces –Determine how to associate data with the computations

88 Example Domain Decompositions

89 Example Functional Decomposition

90 Partitioning Checklist At least 10x more primitive tasks than processors in target computer Minimize redundant computations and redundant data storage Primitive tasks roughly the same size Number of tasks an increasing function of problem size

91 Communication Determine values passed among tasks Local communication –Task needs values from a small number of other tasks –Create channels illustrating data flow Global communication –Significant number of tasks contribute data to perform a computation –Don’t create channels for them early in design

92 Communication Checklist Communication operations balanced among tasks Each task communicates with only small group of neighbors Tasks can perform communications concurrently Task can perform computations concurrently

93 Agglomeration Grouping tasks into larger tasks Goals –Improve performance –Maintain scalability of program –Simplify programming In MPI programming, goal often to create one agglomerated task per processor

94 Agglomeration Can Improve Performance Eliminate communication between primitive tasks agglomerated into consolidated task Combine groups of sending and receiving tasks

95 Agglomeration Checklist Locality of parallel algorithm has increased Replicated computations take less time than communications they replace Data replication doesn’t affect scalability Agglomerated tasks have similar computational and communications costs Number of tasks increases with problem size Number of tasks suitable for likely target systems Tradeoff between agglomeration and code modifications costs is reasonable

96 Mapping Process of assigning tasks to processors Centralized multiprocessor: mapping done by operating system Distributed memory system: mapping done by user Conflicting goals of mapping –Maximize processor utilization –Minimize interprocessor communication

97 Mapping Example

98 Optimal Mapping Finding optimal mapping is NP-hard Must rely on heuristics

99 Mapping Decision Tree Static number of tasks –Structured communication Constant computation time per task –Agglomerate tasks to minimize comm –Create one task per processor Variable computation time per task –Cyclically map tasks to processors –Unstructured communication –Use a static load balancing algorithm Dynamic number of tasks

100 Mapping Strategy Static number of tasks Dynamic number of tasks –Frequent communications between tasks Use a dynamic load balancing algorithm –Many short-lived tasks Use a run-time task-scheduling algorithm

101 Mapping Checklist Considered designs based on one task per processor and multiple tasks per processor Evaluated static and dynamic task allocation If dynamic task allocation chosen, task allocator is not a bottleneck to performance If static task allocation chosen, ratio of tasks to processors is at least 10:1

102 Case Studies Boundary value problem Finding the maximum The n-body problem Adding data input

103 Boundary Value Problem Ice waterRodInsulation

104 Rod Cools as Time Progresses

105 Finite Difference Approximation

106 Partitioning One data item per grid point Associate one primitive task with each grid point Two-dimensional domain decomposition

107 Communication Identify communication pattern between primitive tasks Each interior primitive task has three incoming and three outgoing channels

108 Agglomeration and Mapping Agglomeration

109 Sequential execution time  – time to update element n – number of elements m – number of iterations Sequential execution time: m (n-1) 

110 Parallel Execution Time p – number of processors – message latency Parallel execution time m(  (n-1)/p  +2 )

111 Reduction Given associative operator  a 0  a 1  a 2  …  a n-1 Examples –Add –Multiply –And, Or –Maximum, Minimum

112 Parallel Reduction Evolution

113

114

115 Binomial Trees Subgraph of hypercube

116 Finding Global Sum 4207 -35-6-3 8123 -446

117 Finding Global Sum 17-64 4582

118 Finding Global Sum 8-2 910

119 Finding Global Sum 178

120 Finding Global Sum 25 Binomial Tree

121 Agglomeration

122 sum

123 The n-body Problem

124

125 Partitioning Domain partitioning Assume one task per particle Task has particle’s position, velocity vector Iteration –Get positions of all other particles –Compute new position, velocity

126 Gather

127 All-gather

128 Complete Graph for All-gather

129 Hypercube for All-gather

130 Communication Time Hypercube Complete graph

131 Adding Data Input

132 Scatter

133 Scatter in log p Steps 12345678 567812345612 7834

134 Summary: Task/channel Model Parallel computation –Set of tasks –Interactions through channels Good designs –Maximize local computations –Minimize communications –Scale up

135 Summary: Design Steps Partition computation Agglomerate tasks Map tasks to processors Goals –Maximize processor utilization –Minimize inter-processor communication

136 Summary: Fundamental Algorithms Reduction Gather and scatter All-gather

137 High Throughput Computing Easy problems – formerly known as “embarrassingly parallel” – now known as “pleasingly parallel” Basic idea – “Gee – I have a whole bunch of jobs (single run of a program) that I need to do, why not run them concurrently rather than sequentially” Sometimes called “bag of tasks” or parameter sweep problems

138 Bag-of-tasks

139 Examples A large number of proteins – each represented by a different file – to “dock” with a target protein –For all files x, execute f(x,y) Exploring a parameter space in n- dimensions –Uniform –Non-uniform Monte carlo’s

140 Tools Most common tool is a queuing system – sometimes called a load management system, or a local resource manager PBS, LSF, and SGE are the three most common. Condor is also often used. They all have the same basic functions, we’ll use PBS as an exemplar. Script languages (bash, Perl, etc.)

141 PBS qsub options script-file –Submit the script to run –Options can specify number of processors, other required resources (memory, etc.) –Returns the job ID (a string)

142

143

144 Other PBS qstat – give the status of jobs submitted to the queue qdel – delete a job from the queue

145 Blasting a set of jobs

146 Issues Overhead per job is substantial –Don’t want to run millisecond jobs –May need to “bundle them up” May not be enough jobs to saturate resources –May need to break up jobs IO System may become saturated –Copy large files to /tmp, check for existence in your shell script, copy if not there May be more jobs than the queuing system can handle (many start to break down at several thousand jobs) Jobs may fail for no good reason –Develop scripts to check for output and re-submit upto k jobs

147 Homework 1.Submit a simple job to the queue that echo’s the host name, redirect output to a file of your choice. 2.Via a script submit 100 “hostname” jobs to a script. Output should be “output.X” where X is the output number 3.For each file in a rooted directory tree run a “wc” to count the words. Maintain the results in a “shadow” directory tree. Your script should be able to detect results that have already been computed.


Download ppt "Day 2. Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size."

Similar presentations


Ads by Google