M4 and Parallel Programming

M4 and Parallel Programming

Milestone 4: Travelling Courier Problem
Given: Depots Pick ups Drop offs Truck Capacity Find: Fastest Delivery route How? Estimate travel time between all locations Find a good delivery ordering Return the final tour in < 45 seconds

Estimating Travel Times
𝑡 11 ⋯ 𝑡 𝑘1 ⋮ ⋱ ⋮ 𝑡 1𝑘 ⋯ 𝑡 𝑘𝑘 From Intersection To Intersection Option 1: Estimate by Distance Fast But inaccurate Option 2: Exact Path Accurate Slow

Computing Travel Time Matrix
k depot/pick-up/drop-off intersections n total intersections for (src : intersections) { for (dst : intersections) { paths[src][dst] = find_path_astar(src, dst); } 𝑂( 𝑘 2 𝑛 log 𝑛 ) On large test cases > 500 intersections May limit time for ordering optimization

How to speed-up finding many paths?
Algorithm? What if we want even more speed? k depot/pick-up/drop-off intersections n total intersections 𝑂(𝑘𝑛 log 𝑛 ) for (src : intersections) { paths[src] = find_path_djikstra_multitarget(src, intersections); }

Parallel Programming

A Multi-Core Processor
Shared Memory Core 2 Core 1 Core 0 Core 3

A Multi-Core Processor
Each core can execute its own instructions All cores share a common memory space Want to learn more? ECE552 – Computer Architecture Graduate School Shared Memory Core 2 Core 1 Core 0 Core 3

Applications with a Single Thread
Most applications use one thread You have been taught to write single-threaded applications Single-threaded applications do not need to worry about sharing memory Shared Memory Core 2 Core 1 Core 0 Core 3

Applications with Multiple Threads
Shared Memory Core 2 Core 1 Core 0 Core 3 Multi-threaded applications do need to worry about sharing memory Multiple threads could modify the same variable

Simple Parallelization
#pragma omp parallel for for (i = 0; i < 100; i++) { c[i] = a[i] + b[i]; } Thread 1 for(i = 0; i < 25; i++){ c[i] = a[i] + b[i]; } Thread 2 for(i = 25; i < 50; i++){ Thread 4 for(i = 75; i < 100; i++){ … //Following code std::cout << "Done\n";

Simple Parallelization Demo

Path Finding Parallelization Demo

Performance Optimization Results (k=20)
Time (s) Speed-Up vs Baseline Iterative A* 7.93 1.0x Multitarget Dijkstra 1.71 4.6x Parallel Iterative A* 1.93 4.1x Parallel Mulitarget Dijkstra 0.37 21.4x

M4 Traversal Queue Privatization Example
Shared global priority_queue traversal_q; for (src : intersections) { paths[src] = find_paths(src, intersections); } find_paths(src, intersections) { ... while () { traversal_q.push(…); #pragma omp parallel for for (src : intersections) { paths[src] = find_paths(src, intersections); } find_paths(src, intersections) { priority_queue traversal_q; ... while () { traversal_q.push(); Modifies shared global Privatized Modifies private copy Original Parallel

Limits to Parallelization: Amdahl’s Law
Suppose we have n cores f is fraction of parallelizable work Maximum parallel speed-up limited by serial work (1 – f) 𝑆𝑝𝑒𝑒𝑑𝑢𝑝= 1 1−𝑓 + 𝑓 𝑛

Parallel Programming Take Aways
Improve your algorithm first! Usually bigger reward Add OpenMP pragmas Avoid modifying shared state Minimizes bugs Privatize variables so each thread has a copy Parallelize large chunks of work Minimizes run-time overhead It is hard: Even experts get it wrong What/How to parallelize are open research questions #pragma omp parallel for for (i = 0; i < N; ++i) { do_work(); }

M4 Suggestions Get something simple working first: Then:
Simple travel time estimate Simple ordering heuristic Return a legal solution Then: Try using better cost estimates Pre-computing exact paths Parallelize only as a last resort

M4 and Parallel Programming

Similar presentations

Presentation on theme: "M4 and Parallel Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

M4 and Parallel Programming

Similar presentations

Presentation on theme: "M4 and Parallel Programming"— Presentation transcript:

Similar presentations

About project

Feedback