Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997.

Similar presentations


Presentation on theme: "CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997."— Presentation transcript:

1 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997

2 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley2 PSRS Summary Step 1: O(k) - Divide data into segments. Step 2: O(n log 2 (n/k)) - Sort data segments. Step 3: O(2k 2 ) - Sample sorted data segments. Step 4: O(2k 2 log 2 (2k 2 )) - Sort data sample. Step 5: O(k) - Choose pivots from sorted data sample. Step 6: O(k 2 log 2 (n/k))- Partition sorted data segments. Step 7: O(k 2 ) - Compute result partition sizes. Step 8: O(n log 2 (k)) - Merge data into result partitions. Notes: Almost all the time is in steps 2 and 8. Steps 3 - 5 increase likelihood of good pivot choice. O(nk) version of step 8 would be more simple.

3 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley3 The Key Algorithm: Sequential K-Way Merge (Step 8) 0 1 3 3 3 3 4 4 5 6 7 7 8 9 9 0 1 2 3 3 3 4 5 6 6 7 7 8 8 9 0 1 2 2 3 3 4 4 5 5 7 7 8 9 9 data sequential k-way merge result 0 0 0 1 1 1 2 2 sequential k-way merge sequential k-way merge 4 4 4 4 4 5 57 7 7 7 7 7 8 8 Sequential complexity: O(n log 2 (k)) or O(nk)

4 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley4 PSRS: Multithreaded Performance Issues Load Balance: –How evenly-sized will the partitions be? –What is the data is not uniformly distributed? –What if there are lots of duplicates in the data? –Can we solve load balancing by having k > t? Algorithm Overhead: –How does sequential performance compare with quicksort? –How does sequential performance depend on k? Multithreading: –What is the cost of thread creation? Should we use barriers? –What are the cache/memory access issues?

5 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley5 Multithreading Performance Issues Threading maintenance overheads. Load balancing. Granularity. Memory contention. Underlying algorithm.

6 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley6 Multithreading Performance Issues: Thread Maintenance Overheads Thread creation and termination costs. Thread scheduling costs. Thread synchronization costs.

7 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley7 Thread Creation and Termination Costs #pragma multithreadable mapping(blocked(4)) for (i = 0; i < N, i++) f(i); sequential startup/shutdown overhead creation/termination overhead Ideal Reality

8 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley8 Thread Scheduling Model pool of runnable threads pool of processors pool of suspended threads

9 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley9 Thread Scheduling Costs Threads can be either runnable or suspended. Scheduling policy: –Runnable threads replace suspended threads. –High-priority runnable threads replace low-priority runnable threads. –(Preemption) Idle runnable threads replace running runnable threads. Thread switch takes time. Cache reloading takes time.

10 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley10 Thread Synchronization Costs Ideal barrierReal barrier departure overhead arrival overhead

11 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley11 Multithreading Performance Issues: Load Balancing Load balancing = keeping all processors busy. Large number of small, equal-sized threads gives better load balance. processor time processor time 9 threads 18 threads

12 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley12 Multithreading Performance Issues: Granularity Granularity = measure of amount of computation between threading and synchronization operations. Fine-grained = little computation between operations. Coarse-grained = lots of computation between operations. Balance required: –Too fine-grained = too much threading overhead. –Too coarse-grained = poor load balancing.

13 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley13 Multithreading Performance Issues: Memory Contention Cache misses are very expensive - can cause memory contention. Rewrite program to increase memory access locality. Optimization of sequential cache behavior will minimize multithreaded memory contention. This is a big advantage of our “multithreaded like sequential” development methodology. Trap to watch out for - “false sharing”.

14 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley14 Multithreading Performance Issues: Underlying Algorithm Sequential component limits speedup (Amdahl’s law). Multithreaded algorithm may be less efficient than best sequential algorithm. Partitioning problem may increase total workload (e.g., PSRS algorithm). Partitioning problem may decrease total workload (e.g., route optimization). Partitioning problem will change cache behavior - often for the better.

15 CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley15 Multithreading Performance Issues: Summary No magic answer to obtaining performance. Granularity is very important: –Choose enough threads to provide good load balance. –But, make sure threads are not too fine-grained. Memory access patterns are very important: –Optimize cache behavior in sequential interpretation. –Watch out for false sharing. Underlying algorithm is very important: –No point speeding up a slow algorithm. –Consider effect of multithreading on total workload. Fortunately, Windows NT is very flexible.


Download ppt "CS 284a, 4 November 1997 Copyright (c) 1997-98, John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997."

Similar presentations


Ads by Google