Presentation is loading. Please wait.

Presentation is loading. Please wait.

Auburn University http://www.eng.auburn.edu/~xqin COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.

Similar presentations


Presentation on theme: "Auburn University http://www.eng.auburn.edu/~xqin COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao."— Presentation transcript:

1 Auburn University http://www.eng.auburn.edu/~xqin
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao Qin Auburn University 50 Minutes: slides 1-17 Lec03c-Fundamental Design Issues (old)

2 Recap: Routing Techniques
The total time to transfer a message over a network Startup time (ts): spent at sending and receiving nodes (executing the routing algorithm, programming routers). Per-hop time (th): is a function of number of hops and includes switch latencies, network delays. Per-word transfer time (tw): overheads that are determined by the length of the message. Passing a message from node P0 to P3 (a) through a store-and-forward communication network; (b) and (c) extending the concept to cut-through routing. The shaded regions represent the time that the message is in transit. The startup time associated with this message transfer is assumed to be zero. Store-and-forward makes poor use of communication resources.

3 Recap: Packet Routing The total communication time for packet routing is approximated by: Compare: The total communication time for store-and-forward routing: Packet routing breaks messages into packets and pipelines them through the network. Since packets may take different paths, each packet must carry routing information, error checking, sequencing, and other related header information.

4 Cut-Through Routing Takes packet routing to an extreme: further dividing messages into basic units called flits. Force all flits to take the same path. Flits are typically small, the header information must be minimized. A tracer message first programs all intermediate routers. All flits then take the same route. Error checks are performed on the entire message, as opposed to flits. No sequence numbers are needed. Q1: Why the same path? Any benefit? Takes packet routing to an extreme: further dividing messages into basic units called flits. Force all flits to take the same path. Since flits are typically small, the header information must be minimized. A tracer message first programs all intermediate routers. All flits then take the same route. Error checks are performed on the entire message, as opposed to flits. No sequence numbers are needed.

5 Cut-Through Routing The total communication time for cut-through routing is approximated by: This is identical to packet routing tw is typically much smaller. Communication cost for shared-address space machines While the basic messaging cost applies to these machines as well, a number of other factors make accurate cost modeling more difficult. Memory layout is typically determined by the system. Finite cache sizes can result in cache thrashing. Overheads associated with invalidate and update operations are difficult to quantify. Spatial locality is difficult to model. Prefetching can play a role in reducing the overhead associated with data access. False sharing and contention are difficult to model. Q2: What’s new?

6 Auburn University http://www.eng.auburn.edu/~xqin
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Decomposition and Parallel Tasks Dr. Xiao Qin Auburn University Lec03c-Decomposition and Parallel Tasks (Fall’13) Overview: Concurrency and Mapping Mapping Techniques for Load Balancing Static and Dynamic Mapping Methods for Minimizing Interaction Overheads Maximizing Data Locality Minimizing Contention and Hot-Spots Overlapping Communication and Computations Replication vs. Communication Group Communications vs. Point-to-Point Communication Parallel Algorithm Design Models Data-Parallel, Work-Pool, Task Graph, Master-Slave, Pipeline, and Hybrid Models

7 Decomposition To decompose a problem into smaller tasks that can be executed in parallel Management: The first step in developing a parallel algorithm is to decompose the problem into tasks that can be executed concurrently A given problem may be docomposed into tasks in many different ways. Tasks may be of same, different, or even interminate sizes. A decomposition can be illustrated in the form of a directed graph with nodes corresponding to tasks and edges indicating that the result of one task is required for processing the next. Such a graph is called a task dependency graph.

8 Example 1: Multiplying a Dense Matrix with a Vector
Q3: How to divide this computation into small tasks? A dense matrix-vector product can be decomposed into n tasks. Q4: Any shared data among the tasks? The tasks share the vector b Computation of each element of output vector y is independent of other elements. Based on this, a dense matrix-vector product can be decomposed into n tasks. The figure highlights the portion of the matrix and vector accessed by Task 1. Observations: While tasks share data (namely, the vector b ), they do not have any control dependencies - i.e., no task needs to wait for the (partial) completion of any other. All tasks are of the same size in terms of number of operations. Is this the maximum number of tasks we could decompose this problem into? Q5: Any control dependency among the tasks? No. No task needs to wait for the completion of any other. Q6: Is this the maximum number of tasks we could decompose this problem into?

9 Example 2: Database Query Processing
Consider the execution of the query: MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR = ``GREEN'' OR COLOR = ``WHITE) on the following database: ID# Model Year Color Dealer Price 4523 Civic 2002 Blue MN $18,000 3476 Corolla 1999 White IL $15,000 7623 Camry 2001 Green NY $21,000 9834 Prius CA 6734 OR $17,000 5342 Altima FL $19,000 3845 Maxima $22,000 8354 Accord 2000 VT 4395 Red 7352 WA Question: how are you going to decompose this problem?

10 Example 2: Database Query Processing
Each task can be thought of as generating an intermediate table of entries that satisfy a particular clause. The execution of the query can be divided into subtasks in various ways. Each task can be thought of as generating an intermediate table of entries that satisfy a particular clause. Decomposing the given query into a number of tasks. Edges in this graph denote that the output of one task is needed to accomplish the next.

11 Example 2: An Alternate Decomposition
An alternate decomposition of the given problem into subtasks, along with their data dependencies. Note that the same problem can be decomposed into subtasks in other ways as well. Different task decompositions may lead to significant differences with respect to their eventual parallel performance.

12 Task Granularity Q7: What factor affects task granularity?
The number of tasks into which a problem is decomposed. A large number of tasks results in fine-grained decomposition A small number of tasks results in a coarse-grained decomposition. Granularity of Task Decompositions Task granularity: the size of small tasks A coarse grained counterpart to the dense matrix-vector product example. Each task in this example corresponds to the computation of three elements of the result vector. Q8: Can you provide a coarse-grained decomposition?

13 Degree of Concurrency The number of tasks that can be executed in parallel is the degree of concurrency of a decomposition. Since the number of tasks that can be executed in parallel may change over program execution, the maximum degree of concurrency is the maximum number of such tasks at any point during execution.

14 Q9: What is the maximum degree of concurrency of the database query examples?
4

15 Degree of Concurrency The average degree of concurrency is the average number of tasks that can be processed in parallel over the execution of the program. As the decomposition becomes finer in granularity, the degree of concurrency increases

16 Critical Path Length A directed path in the task dependency graph represents a sequence of tasks that must be processed one after the other. The longest such path determines the shortest time in which the program can be executed in parallel. The length of the longest path in a task dependency graph is called the critical path length.

17 Q12: What is the maximum degree of concurrency?
Consider the task dependency graphs of the two database query decompositions: Q10: What are the critical path lengths for the two task dependency graphs? Q11: How many processors are needed in each case to achieve this minimum parallel execution time? Q12: What is the maximum degree of concurrency? 63/3=2.33, 64/4=1.88

18 Q13: What the average degree of concurrency in each decomposition?
total amount of work / critical path length (a) Critical path length = 27, total amount of work = 63 Average degree of concurrency = 63/27 = 2.33 (b) Critical path length = 34, total amount of work = 64 Average degree of concurrency = 64/34 = 1.88

19 Limits on Parallel Performance
It would appear that the parallel time can be made arbitrarily small by making the decomposition finer in granularity. There is an inherent bound on how fine the granularity of a computation can be.

20 Q14: What is the upper bound of the number of concurrent tasks?
Multiplying a dense matrix with a vector: there can be no more than (n2) concurrent tasks.

21 Q15: The larger the number of concurrent tasks, the better?
Communication overhead: Concurrent tasks may have to exchange data with other tasks. The tradeoff between the granularity of a decomposition and associated overheads often determines performance bounds.

22 Summary Decomposition Task Granularity Degree of Concurrency
Critical Path


Download ppt "Auburn University http://www.eng.auburn.edu/~xqin COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao."

Similar presentations


Ads by Google