Multithreading Why & How.

Multithreading Why & How

Intel 8086 First PC microprocessor 1978 29,000 transistors 5 MHz
~10 clocks / instruction ~500,000 instructions / s

Intel Core i7 (“Haswell”)
1.5 billion transistors 4.0 GHz ~15 clocks / instruction, but ~30 instructions in flight Average ~2 instructions completed / clock ~8 billion instructions / s die photo: courtesy techpowerup.com

CPU Scaling: Past & Future
1978 to 2015 50,000x more transistors 16,000x more instructions / s The future: 2X transistors every ~3 years But transistors not much faster  CPU clock speed saturating ~30 instructions in flight Complexity & power to go beyond this climbs rapidly Slow growth in instructions / cycle Impact: CPU speed not growing much But can fit many processors (cores) on a single chip Using multiple cores now important Multithreading: one program using multiple cores at once

A Single-Threaded Program
Memory Instructions (code) CPU / Core Program Counter Global Variables Stack Pointer Heap Variables (new) . . . Stack (local variables)

A Multi-Threaded Program
Memory Instructions (code) Core1 thread 1 Program Counter Global Variables Stack Pointer Shared by all threads Core2 thread 2 Heap Variables (new) Program Counter . . . Stack Pointer Stack1 (local variables) Each thread gets own local variables Stack2 (local variables)

Thread Basics Each thread has own program counter
Can be executing a different function Is (almost always) executing a different instruction from other threads Each thread has own stack Has its own copy of local variables (all different) Each thread sees same global variables Dynamically allocated memory Shared by all threads Any thread with a pointer to it can access

Implications Threads can communicate through memory
Global variables Dynamically allocated memory Fast communication! Must be careful threads don’t conflict in reads/write to same memory What if two threads update the same global variable at the same time? Not clear which update wins! Can have more threads than CPUs Time share the CPUs

Multi-Threading Libraries
Program start: 1 “main” thread created Need more: create with API/library Options: 1. Open MP Compiler directives: higher level code Will compile and run serially if compiler doesn’t support open MP #pragma omp parallel for for (int i=0; i < 10; i++) { 2. C++11 threads More control, but lower-level code #include <thread> // C++ 11 feature int main () { thread myThread (start_func_for_thread); 3. pthreads Same as C++11 threads, but nastier syntax

Convert to Use Multiple Threads
int main() { vector<int> a(100), b(100); ... for (int i = 0; i < a.size(); i++) { a[i] = a[i] + b[i]; } cout << “a[“ << i << “] is “ << a[i] << endl;

Convert to Use Multiple Threads
int main() { vector<int> a(100,50), b(100,2); ... #pragma omp parallel for for (int i = 0; i < a.size(); i++) { a[i] = a[i] + b[i]; } cout << “a[“ << i << “] is “ << a[i] << endl; These variables shared by all threads Local variables declared within block  separate per thread

Output? a[0] is 52 a[1] is 52 a[2] is a[25] is 52 a[26] is 52 52 a[27] a[50] is 52 is 52 ... std::cout… Global variable Being shared by multiple threads Not “thread safe” Each << is a function call to operator<< Randomly interleaving output

What Did OpenMP Do? Split loop iterations across threads (4 to 8 for UG machines) // thread 1 for (int i = 0; i < 24; i++) { cout << a[i] << endl; } // thread 2 for (int i = 25; i < 49; i++) {

Fixing the Output Problem
int main() { vector<int> a(100,50), b(100,2); ... #pragma omp parallel for for (int i = 0; i < a.size(); i++) { #pragma omp critical cout << “a[“ << i << “] is “ << a[i] << endl; } Only one thread at a time can execute this block Make everything critical?  Destroys your performance (serial again)

M4: What to Multithread? Want: Candidates?
Takes much (ideally majority of) CPU time Work independent  no or few dependencies Candidates? Path finding 2N invocations of multi-destination Dijkstra All doing independent work Biggest CPU time for most teams Multi-start of order optimization algorithm Independent work Large CPU time if complex algorithm Multiple perturbations to order at once? Harder: need critical section to update best solution Maybe update only periodically

How Much Speedup Can I Get?
Path order: 40% time Pathfinding: 60% time Make parallel? 4 cores  pathfinding time drops by at most 4 Total time = / 4 Total time = 0.55 of serial 1.8x faster Limited by serial code  Amdahl’s law Can run 8 threads with reduced performance (hyperthreading)  time  0.51 serial in this case

Gotcha? Writing to global variable in multiple threads?
vector<int> reaching_edge; // For all intersections // how did I reach you? #pragma omp parallel for for (int i = 0; i < deliveries.size(); i++) { path_times = multi_dest_dijkstra ( i, deliveries); } Writing to global variable in multiple threads? Refactor code to make local (best) Use omp critical only if truly necessary Make local

Keep It Simple! Do not try multi-threading until you have a good serial implementation! Much harder to write and debug multi-threaded code Do not need to multi-thread to have a good implementation Resources OpenMP Tutorial Debugging threads with gdb

To Learn More ECE 454: Computer Systems Programming
How to make programs go fast Profiling Writing fast code for a compiler Parallelism

Multithreading Why & How.

Similar presentations

Presentation on theme: "Multithreading Why & How."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multithreading Why & How.

Similar presentations

Presentation on theme: "Multithreading Why & How."— Presentation transcript:

Similar presentations

About project

Feedback