Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adaptive Latency-Aware Parallel Resource Mapping: Task Graph Scheduling  Heterogeneous Network Topology Liwen Shih, Ph.D. Computer Engineering U of Houston.

Similar presentations


Presentation on theme: "Adaptive Latency-Aware Parallel Resource Mapping: Task Graph Scheduling  Heterogeneous Network Topology Liwen Shih, Ph.D. Computer Engineering U of Houston."— Presentation transcript:

1 Adaptive Latency-Aware Parallel Resource Mapping: Task Graph Scheduling  Heterogeneous Network Topology Liwen Shih, Ph.D. Computer Engineering U of Houston – Clear Lake

2 ADAPTIVE PARALLEL TASK TO NETWORK TOPOLOGY MAPPING Latency-adaptive: Topology Traffic Bandwidth Workload System hierarchy Thread partition: Coarse Medium Fine

3 Fine-Grained Mapping System [Shih 1988] 3 Parallel Mapping –Compiler- vs. run- time Task migration –Vertical vs. Horizontal Domain decomposition –Data vs. Function Execution order –Eager data-driven vs. Lazy demand-driven

4 PRIORITIZE TASK DFG NODES Task priority factors: 1.Level depth 2.Critical Paths 3.In/Out degree Data flow partial order: {(n7  n5), (n7  n4), (n6  n4), (n6  n3), (n5  n1), (n4  n2), (n3  n2), (n2  n1)}  total task priority order: {n1 > n2 > n4 > n3 > n5 > n6 > n7}  P2 thread: {n1>n2>n4>n3>n6} P3 thread: {n5 > n7}

5 SHORTEST-PATH NETWORK ROUTING Shortest latency and routes are updated after each task- processor allocation.

6 Given a directed, acyclic task DFG G(V, E) with task vertex set V connected by data-flow edge set E, And a processor network topology N(P, C) with processor node set P connected by channel link set C Find a processor assignment and schedule S: V(G)  P (N)  S minimizes total parallel computation time of G. A* Heuristic mapping reduces scheduling complexity from NP to P Adaptive A* Parallel Processor Scheduler

7 Demand-Driven Task-Topology mapping STEP 1 – assign a level to each task node vertex in G. STEP 2 – count critical paths passing through each DFG edge and node with a 2-pass bottom-up and then up-down graph traversal. STEP 3 – initially load and prioritize all deepest level task nodes that produce outputs, to the working task node list. STEP 4 – WHILE working task node list is not empty, schedule a best processor to the top priority task, and replace it with its parent task nodes inserted onto the working task node priority list.

8 STEP 4 – WHILE working task node list is not empty: BEGIN – STEP 4.1 – initialize if first time, otherwise update inter-processor shortest- path latency/routing table pair affected by last task-processor allocation. – STEP 4.2 – assign a nearby capable processor to minimize thread computation time for the highest priority task node at the top of the remaining prioritized working list. – STEP 4.3 – remove the newly scheduled task node, and replace it with its parent nodes, which are to be inserted/appended onto the working list (demand-driven) per priority, based on tie-breaker rules, which along with node level depth, estimate the time cost of the entire computation tread involved. END{WHILE} Demand-Driven Processor Scheduling

9 QUANTIFY SW/HW MAPPING QUALITY Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Example 2 – Scaling to Larger Tree-to-Tree Mapping Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph

10 Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping K-th Largest Selection Will tree Algorithm [3] match tree machine [4] ?

11 Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Adaptive mapping moves toward sequential processing when inter/intra communication latency ratio increase.

12 Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Adaptive Mapper allocates fewer processors and channels with fewer hops.

13 Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Adaptive Mapper achieves higher speedups consistently. (Bonus! pipeline processing speedup and be extrapolated when inter/intra communication latency ratio <1)

14 Example 1 – Latency-Adaptive Tree-Task to Tree-Machine Mapping Adaptive Mapper results in better efficiencies consistently. (Bonus! % pipeline processing efficiency can be extrapolated when inter/intra communication latency ratio <1)

15 Example 2 – Scaling to Larger Tree-to- Tree Mapping Adaptive Mapper achieves sub-optimal speedups as tree sizes scaled larger speedups, still trailing fixed tree-to-tree mapping closely.

16 Example 2 – Scaling to Larger Tree-to- Tree Mapping Adaptive Mapper is always more cost- efficient using less resource, with compatible sub-optimal speedups to fixed tree- to-tree mapping as tree sizes scaled.

17 Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Lack of matching topology clues for irregular shaped Robot Elbow Manipulator [5] 105 task nodes, 161 data flow edges 29 node levels

18 Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Candidate topologies Compare schedules for each topology Farther processors may not be selected –Linear Array –Tree

19 Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Best network topology performers (# channels) Complete (28) Mesh (12) Chordal ring (16) Systolic array (16) Cube (12)

20 Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Fewer processors selected for higher diameter networks Tree Linear Array

21 Example 3 – Select the Best Processor Topology Match for an Irregular Task Graph Deducing network switch hops Low multi-hop data exchanges < 10% Moderate 0-hop of 30% to 50% High near-neighbor direct 1-hop 50% to 70%

22 Future Speed/Memory/Power Optimization Latency-adaptive –Topology –Traffic –Bandwidth –Workload –System hierarchy Thread partition –Coarse –Mid –Fine Latency/Routing tables –Neighborhood –Network hierarchy –Worm-hole –Dynamic mobile network routing –Bandwidth –Heterogeneous system Algorithm-specific network topology

23 References

24 24 Liwen Shih, Ph.D. Professor in Computer Engineering University of Houston – Clear Lake Q & A?

25 xScale13 paper

26

27 Thank You! 27


Download ppt "Adaptive Latency-Aware Parallel Resource Mapping: Task Graph Scheduling  Heterogeneous Network Topology Liwen Shih, Ph.D. Computer Engineering U of Houston."

Similar presentations


Ads by Google