Presentation is loading. Please wait.

Presentation is loading. Please wait.

Principles of Parallel Algorithm Design Prof. Dr. Cevdet Aykanat Bilkent Üniversitesi Bilgisayar Mühendisliği Bölümü.

Similar presentations


Presentation on theme: "Principles of Parallel Algorithm Design Prof. Dr. Cevdet Aykanat Bilkent Üniversitesi Bilgisayar Mühendisliği Bölümü."— Presentation transcript:

1 Principles of Parallel Algorithm Design Prof. Dr. Cevdet Aykanat Bilkent Üniversitesi Bilgisayar Mühendisliği Bölümü

2 Identifying concurrent tasks Mapping tasks onto multiple processes Distributing input, output, intermediate data Managing access to shared data Synchronizing processors Principles of Parallel Algorithm Design

3 several choices for each step relatively a few combinations lead to a good parallel algorithm different choices yield best performance on –different parallel architectures –different parallel programming paradigms Principles of Parallel Algorithm Design Identifying concurrent tasks Mapping tasks onto multiple processes Distributing input, output, intermediate data Managing access to shared data Synchronizing processors

4 Decomposition, Tasks decomosition: –dividing a computation into smaller parts –some or all parts can be executed concurrently atomic task –user defined –indivisible units of computation –same size or different size

5

6 Task Dependence Graphs (TDG) directed acyclic graph nodes : atomic tasks directed edges : dependencies –some tasks use data produced by other tasks TDG can be weighted: –node wgt: amount of computation –edge wgt: amount of data multiple ways of expressing certain computations –different ways of arranging computations –lead to different TDGs

7

8

9

10 Granularity, Concurrency granularity: number (#) and size of tasks –fine grain : large # of small tasks –coarse grain : small # of large tas degree of concurrency (DoC): –# of tasks that can be executed simultaneously max DoC : max DoC at any given time –tree TDGs: max DoC = # of leaves (usually) avg DoC : DoC over entire duration

11

12 Degree of Concurrency depends on granularity –finer task granularity : larger DoC –bound on fine granularity of a decomposition depends on shape of TDG –shallow and wide TDG : larger DoC –deep and thin TDG : smaller DoC –critical path: longest directed path between a start node and a finish node –critical path length = sum of wgts along the path –avg DoC = total work / critical path length

13

14 Task Interaction Graph (TIG) tasks share input, output or intermediate data interactions among independent tasks of aTDG TIG: pattern of interactions among tasks –node: task –edge: connects tasks that interact with each other TIG can be weighted: –node wgt: amount of computation –edge wgt: amount of interaction

15

16 Processes and Mapping process vs processor: –logical computing agents that perform tasks mapping: assigning tasks to processes conflicting goals in a good mapping –maximize concurrency map independent tasks to different processes –minimize idle time / interaction overhead map tasks along critical path to same process map tasks with high interaction to same processes e.g., map all tasks to the same process

17

18 Decomposition Techiques recursive decomposition data decomposition explaloratory decomposition speculative decomposition

19 Recursive Decomposition divide-and-conquer strategy → natural concurrency divide problem into a set of independent subproblems conquer: recursively solve each subproblem combine: solns to subproblems to a soln of problem if sequential algorithm is not based on DAC –restructure computation as a DAC algorithm –recursive decomposition to extract concurrency –e.g., finding minimum of an array A of n numbers

20

21

22

23 Data Decomposition partition/decompose computational data domain use this partition to induce task decomposition –tasks: similar operations on different data parts partitioning output data –each output can be computed independently as a fn of input –example: block matrix multiplication –data decomposition may not lead to unique task decompsition –another example: computing itemset frequencies input: transactions & output: itemset frequencies

24

25

26

27

28 Data Decomposition partitioning input data –may not be possible desirable to partition output data e.g., finding min, sum of a set of numbers, sorting –a task created for each part of the input data –task: all computations that can be done using local data –a combine step may be needed to combine results of tasks –example: finding the sum of an array A of n numbers –example: computing itemset frequencies partitioning both output and input data –output data partitioning is feasible –partitioning of input data offers additional concurrency –example: computing itemset frequencies

29 Data Decomposition partitioning intermediate data –multistage computations partioning input or output data of an intermediate stage –may lead to higher concurrency –some restructuring of the algorithm may be needed –example: block matrix multiplication owner computes rule –each part performs all computations involving data it owns –input: perform all computations that can be done using local data –ouput: compute all data in the partition

30

31

32

33 Other Decomposition Techniques exploratory decomposition –search of a configuration space for a solution –partition the search space into smaller parts –search each part concurrently –total parallel work total serial work –example: 15-puzzle problem speculative decomposition hybrid decompositions –computation structured into multiple stages –may apply different decompositions in different stages –examples: finding min of an array and quicksort –data decomposition then recursive decomposition

34

35

36

37

38

39 Characteristics of Tasks task generation: static vs dynamic task generation –static: all tasks are known priori to execution of algorithm data decomposition: matrix multiplication recursive decomposition: finding min of an array –dynamic: actual tasks and TPG/TIG not available a priori rules, guideliness governing task generation may be known recursive decomposition: quicksort another example: ray tracing task sizes: uniform vs non-uniform –complexity of mapping depends on this –tasks in matrix multiplication: uniform –tasks in quicksort: non-uniform

40 Characteristics of Tasks knowledge of task sizes –can be used in mapping –known: tasks in decompositions for matrix multiplication –unknown: tasks in 15-puzzle problem do not know a priori how many moves will lead to a soln. size of data associated with tasks –associated data must be available to the process –size and location of the associated data –consider data migration overhead in the mapping

41 Characteristics of Inter-Task Interactions static vs dynamic –static: pattern and timing of interactions known a priori –static interaction: decompositions for matrix multiplication –message-passing paradigm (MPP): active involvement of both interacting tasks static interactions easy to program dynamic interactions harder to program tasks assigned additional synchronization and polling responsibilities –shared-address-space (SASP): can handle both equally easily regular vs irregular (spatial structure) –regular: structure that can be exploited for efficient implement. structured/curvilinear grids (implicit connectivity) image dithering (example) –irregular: no such regular pattern exists unstructured grids (connectivity maintained explicitly) SpMxV (sparse matrix vector multiplication) –irregular and dynamic interactions harder to handle in MPP

42

43 Characteristics of Inter-Task Interactions read-only vs read-write –read-only: tasks require read-only access to shared data –example: decompositions for matrix multiplication –read-write: tasks need to read and write on shared data –example: heuristic search for 15-puzzle problem one-way vs two-way –2-way: data/work needed by a task explicitliy supplied by another –usually involve predefined producer and consumer –1-way: only one of a pair of comm. tasks initiates & completes interaction –read-only → 1-way & read-write → either 1-way or 2-way –SASP can handle both interactions equally easily –MPP cannot handle 1-way interaction directly source of data should explicitly send it to the recipient static 1-way: easily converted to 2-way via program restructuring dynamic 1-way: nontrivial program structuring for converting to 2-way –polling: task checks for pending requests from others at regular intervals

44 Mapping Techniques minimize overheads of parallel task execution –overhead: inter-process interaction –overhead: process idle time (uneven load distribution) load balancing –balanced aggregate load: necessary but not sufficient –computations & interactions well balanced at each stage –example: 12-task decomposition (9-12 depends on 1-8)

45

46 Static vs Dynamic Mapping static: distribute tasks prior to execution –static task generation: either static or dynamic mapping –good mapping: knowledge of task sizes, data sizes, TIG –non-trivial problem (usually NP-hard) –task sizes known but non-uniform even if no TDG/TIG → number partitioning problem dynamic: distribute workload during execution –dynamic task generation: dynamic mapping –task sizes unkown: dynamic mapping more effective –large data size: dynamic mapping costly (in MPP)

47 Static-Mapping Schemes mapping based on data partitioning –data partitioning induces a decomposition –partitioning selected with final mapping in mind i.e., p-way data decomposition –dense arrays –sparse data structures, graphs (FE meshes) mapping based on task partitioning –task dependence graphs, task interaction graphs hierarchical partitioning –hybrid decomposition and mapping techniques

48 Array Distribution Schemes block distributions: spatial locality of interaction –each process receives a contigous block of entries –1D: each part contains a block of consecutive rows i.e., kth part contains rows kn/p... (k+1)n/p-1 –2D: checkerboard partitioning –higher dimensional distributions higher degree of concurrency less inter-process interaction example: matrix multiplication

49

50

51

52

53

54

55

56 Array Distribution Schemes cyclic distribution –amount of work differs for different matrix entries examples: ray casting, dense LU factorization block distribution leads to load imbalance –all processes have tasks from all parts of the matrix –good load balance, but complete loss of locality block-cyclic distribution –partition array into more than p blocks –map blocks to processes in a round-robin (scattered) manner randomized block distribution –when the distribution of work has some special pattern adaptive 2D array partitionings –rectilinear, jagged, orthogonal bisection

57

58

59

60

61

62

63

64

65

66

67

68 Dynamic Mapping Schemes centralized schemes –all tasks maintained in a common pool or by a process –idle processes take task(s) from central pool or master process –easier to implement –limited scalability: central pool/process becomes a bottleneck –chunk scheduling: idle processes get group of tasks danger of load imbalance due to large chunk sizes decrease chunk size as program progresses –e.g., sorting entries in each row of a matrix non-uniform tasks & unknown task sizes –e.g., image-space parallel ray casting

69 Dynamic Mapping Schemes distributed schemes –tasks are distributed among processes –more scalable (no bottleneck) –critical parameters of distributed load balancing how sending and receiving processes ard paired? who initiates the work transfer: sender or receiver? how much work transferred in each exchange? when is he work transfer performed? suitability to parallel architectures –both can be implemented in both SAS and MP paradigms –dynamic schemes require movement of tasks –computational granularity of tasks should be high in MP systems

70 Methods for Interaction Overheads factors: –volume and frequency of interaction –spatial and temporal pattern of interactions maximizing data locality –minimize volume of data exchange minimize overall volume of shared data similar to maximizing temporal data locality –minimize frequency of interaction high startup cost associated with each interaction restructure algorithm: shared data accessed in large pieces similar to increasing spatial locality of data access minimizing contention and hot spots –multiple tasks try to access same resource concurrently multiple simultaneous access to same memory block/bank multiple processes sending messages to same process at the same time

71 Methods for Interaction Overheads minimizing contention and hot spots –multiple tasks try to access same resource concurrently multiple simultaneous access to same memory block/bank multiple processes sending messages to same process simult. –e.g., matrix multiplication based on 2D partitioning overlapping computations with interactions –early initiation of an interaction –support from programming paradigm, OS, hardware –MP: non-blocking message-passing primitives

72 Methods for Interaction Overheads replicating data or computation –replicating frequently accessed read-only shared data –MP paradigm benefits more from data replication –replicated computation for shared intermediate results using optimized collective interaction operations –usually use available implementations (e.g., by MPI) –sometimes, it may be better to write your own procedure overlapping interactions with other interactions –example: one-to-all broadcast

73

74 Parallel Algorithm Models data-parallel model –data parallelism: identicial operations applied concurrently on different data items task graph model –task parallelism: independent tasks in a TDG –quicksort, sparse matrix factorization work-pool or task-pool model –dynamic mapping of tasks onto processes –mapping may be centralized or distributed

75 Parallel Algorithm Models master-slave or manager-worker model –master process generates work & allocates to worker processes pipeline or producer-consumer model –stream parallelism: execution of diff. programs on a data stream –each process in the pipeline: consumer of the sequence of data items for the preceeding process producer of data for the process following in the pipeline –pipeline may not be a linear chain (it can be a DAG) hybrid models –multiple models applied hierarchically –multiple models applied sequentially to different stages


Download ppt "Principles of Parallel Algorithm Design Prof. Dr. Cevdet Aykanat Bilkent Üniversitesi Bilgisayar Mühendisliği Bölümü."

Similar presentations


Ads by Google