Multithreading Why & How.

Slides:



Advertisements
Similar presentations
Introductions to Parallel Programming Using OpenMP
Advertisements

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Dijkstra’s Algorithm Keep Going!. Pre-Computing Shortest Paths How many paths to pre-compute? Recall: –Using single-source to single-dest find_path: Need.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
Race Conditions. Isolated & Non-Isolated Processes Isolated: Do not share state with other processes –The output of process is unaffected by run of other.
MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 6.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Single Node Optimization Computational Astrophysics.
Barriers and Condition Variables
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu
Tuning Threaded Code with Intel® Parallel Amplifier.
1 A simple parallel algorithm Adding n numbers in parallel.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
Traveling Courier / Milestone 4 Continued. Recall Pre-compute all shortest paths you might need? –Then just look up delays during pertubations How many.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Performance. Moore's Law Moore's Law Related Curves.
Two notions of performance
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
GCSE Computing - The CPU
Conclusions on CS3014 David Gregg Department of Computer Science
CS203 – Advanced Computer Architecture
The Art of Parallel Processing
Introduction to OpenMP
Defining Performance Which airplane has the best performance?
Optimization: The Art of Computing
Atomic Operations in Hardware
Loop Parallelism and OpenMP CS433 Spring 2001
Atomic Operations in Hardware
Computer Engg, IIT(BHU)
The University of Adelaide, School of Computer Science
Exploiting Parallelism
CPU Efficiency Issues.
/ Computer Architecture and Design
CPU Central Processing Unit
Multi-Processing in High Performance Computer Architecture:
CMSC 341 Prof. Michael Neary
Multi-core CPU Computing Straightforward with OpenMP
Parallel Programming.
Chapter 4: Threads.
Parallel Processing Sharing the load.
CMSC 611: Advanced Computer Architecture
Hybrid Programming with OpenMP and MPI
Dr. Tansel Dökeroğlu University of Turkish Aeronautical Association Computer Engineering Department Ceng 442 Introduction to Parallel.
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
M4 and Parallel Programming
Chapter 12 Pipelining and RISC
CMSC 611: Advanced Computer Architecture
Memory System Performance Chapter 3
Programming with Shared Memory Specifying parallelism
A Level Computer Science Topic 5: Computer Architecture and Assembly
GCSE Computing - The CPU
Presentation transcript:

Multithreading Why & How

Intel 8086 First PC microprocessor 1978 29,000 transistors 5 MHz ~10 clocks / instruction ~500,000 instructions / s

Intel Core i7 (“Haswell”) 1.5 billion transistors 4.0 GHz ~15 clocks / instruction, but ~30 instructions in flight Average ~2 instructions completed / clock ~8 billion instructions / s die photo: courtesy techpowerup.com

CPU Scaling: Past & Future 1978 to 2015 50,000x more transistors 16,000x more instructions / s The future: 2X transistors every ~3 years But transistors not much faster  CPU clock speed saturating ~30 instructions in flight Complexity & power to go beyond this climbs rapidly Slow growth in instructions / cycle Impact: CPU speed not growing much But can fit many processors (cores) on a single chip Using multiple cores now important Multithreading: one program using multiple cores at once

A Single-Threaded Program Memory Instructions (code) CPU / Core Program Counter Global Variables Stack Pointer Heap Variables (new) . . . Stack (local variables)

A Multi-Threaded Program Memory Instructions (code) Core1 thread 1 Program Counter Global Variables Stack Pointer Shared by all threads Core2 thread 2 Heap Variables (new) Program Counter . . . Stack Pointer Stack1 (local variables) Each thread gets own local variables Stack2 (local variables)

Thread Basics Each thread has own program counter Can be executing a different function Is (almost always) executing a different instruction from other threads Each thread has own stack Has its own copy of local variables (all different) Each thread sees same global variables Dynamically allocated memory Shared by all threads Any thread with a pointer to it can access

Implications Threads can communicate through memory Global variables Dynamically allocated memory Fast communication! Must be careful threads don’t conflict in reads/write to same memory What if two threads update the same global variable at the same time? Not clear which update wins! Can have more threads than CPUs Time share the CPUs

Multi-Threading Libraries Program start: 1 “main” thread created Need more: create with API/library Options: 1. Open MP Compiler directives: higher level code Will compile and run serially if compiler doesn’t support open MP #pragma omp parallel for for (int i=0; i < 10; i++) { 2. C++11 threads More control, but lower-level code #include <thread> // C++ 11 feature int main () { thread myThread (start_func_for_thread); 3. pthreads Same as C++11 threads, but nastier syntax

Convert to Use Multiple Threads int main() { vector<int> a(100), b(100); ... for (int i = 0; i < a.size(); i++) { a[i] = a[i] + b[i]; } cout << “a[“ << i << “] is “ << a[i] << endl;

Convert to Use Multiple Threads int main() { vector<int> a(100,50), b(100,2); ... #pragma omp parallel for for (int i = 0; i < a.size(); i++) { a[i] = a[i] + b[i]; } cout << “a[“ << i << “] is “ << a[i] << endl; These variables shared by all threads Local variables declared within block  separate per thread

Output? a[0] is 52 a[1] is 52 a[2] is a[25] is 52 a[26] is 52 52 a[27] a[50] is 52 is 52 ... std::cout… Global variable Being shared by multiple threads Not “thread safe” Each << is a function call to operator<< Randomly interleaving output

What Did OpenMP Do? Split loop iterations across threads (4 to 8 for UG machines) // thread 1 for (int i = 0; i < 24; i++) { cout << a[i] << endl; } // thread 2 for (int i = 25; i < 49; i++) {

Fixing the Output Problem int main() { vector<int> a(100,50), b(100,2); ... #pragma omp parallel for for (int i = 0; i < a.size(); i++) { #pragma omp critical cout << “a[“ << i << “] is “ << a[i] << endl; } Only one thread at a time can execute this block Make everything critical?  Destroys your performance (serial again)

M4: What to Multithread? Want: Candidates? Takes much (ideally majority of) CPU time Work independent  no or few dependencies Candidates? Path finding 2N invocations of multi-destination Dijkstra All doing independent work Biggest CPU time for most teams Multi-start of order optimization algorithm Independent work Large CPU time if complex algorithm Multiple perturbations to order at once? Harder: need critical section to update best solution Maybe update only periodically

How Much Speedup Can I Get? Path order: 40% time Pathfinding: 60% time Make parallel? 4 cores  pathfinding time drops by at most 4 Total time = 0.4 + 0.6 / 4 Total time = 0.55 of serial 1.8x faster Limited by serial code  Amdahl’s law Can run 8 threads with reduced performance (hyperthreading)  time  0.51 serial in this case

Gotcha? Writing to global variable in multiple threads? vector<int> reaching_edge; // For all intersections // how did I reach you? #pragma omp parallel for for (int i = 0; i < deliveries.size(); i++) { path_times = multi_dest_dijkstra ( i, deliveries); } Writing to global variable in multiple threads? Refactor code to make local (best) Use omp critical only if truly necessary Make local

Keep It Simple! Do not try multi-threading until you have a good serial implementation! Much harder to write and debug multi-threaded code Do not need to multi-thread to have a good implementation Resources OpenMP Tutorial Debugging threads with gdb

To Learn More ECE 454: Computer Systems Programming How to make programs go fast Profiling Writing fast code for a compiler Parallelism