Presentation is loading. Please wait.

Presentation is loading. Please wait.

Threaded Programming Methodology Intel Software College.

Similar presentations


Presentation on theme: "Threaded Programming Methodology Intel Software College."— Presentation transcript:

1 Threaded Programming Methodology Intel Software College

2 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 2 Threaded Programming Methodology Objectives After completion of this module you will Be able to rapidly prototype and estimate the effort required to thread time consuming regions

3 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 3 Threaded Programming Methodology Agenda A Generic Development Cycle Case Study: Prime Number Generation Common Performance Issues

4 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 4 Threaded Programming Methodology What is Parallelism? Two or more processes or threads execute at the same time Parallelism for threading architectures Multiple processes Communication through Inter-Process Communication (IPC) Single process, multiple threads Communication through shared memory

5 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 5 Threaded Programming Methodology n = number of processors T parallel = { (1-P) + P/n } T serial Speedup = T serial / T parallel Amdahl’s Law Describes the upper bound of parallel execution speedup Serial code limits speedup (1-P) P T serial (1-P) P/2 0.5 + 0.25 1.0/0.75 = 1.33 n = 2 n = ∞ ∞ P/ ∞ … 0.5 + 0.0 1.0/0.5 = 2.0

6 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 6 Threaded Programming Methodology Processes and Threads Modern operating systems load programs as processes Resource holder Execution A process starts executing at its entry point as a thread Threads can create other threads within the process Each thread gets its own stack All threads within a process share code & data segments Code segment Data segment thread main() … thread Stack

7 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 7 Threaded Programming Methodology Threads – Benefits & Risks Benefits Increased performance and better resource utilization Even on single processor systems - for hiding latency and increasing throughput IPC through shared memory is more efficient Risks Increases complexity of the application Difficult to debug (data races, deadlocks, etc.)

8 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 8 Threaded Programming Methodology Commonly Encountered Questions with Threading Applications Where to thread? How long would it take to thread? How much re-design/effort is required? Is it worth threading a selected region? What should the expected speedup be? Will the performance meet expectations? Will it scale as more threads/data are added? Which threading model to use?

9 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 9 Threaded Programming Methodology Prime Number Generation bool TestForPrime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor) ) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { int range = end - start + 1; for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } ifactor 61 3 5 7 63 3 65 3 5 67 3 5 7 69 3 71 3 5 7 73 3 5 7 9 75 3 5 77 3 5 7 79 3 5 7 9

10 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 10 Threaded Programming Methodology Activity 1 Run Serial version of Prime code Locate PrimeSingle directory Compile with Intel compiler in Visual Studio Run a few times with different ranges

11 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 11 Threaded Programming Methodology Development Methodology Analysis Find computationally intense code Design (Introduce Threads) Determine how to implement threading solution Debug for correctness Detect any problems resulting from using threads Tune for performance Achieve best parallel performance

12 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 12 Threaded Programming Methodology Development Cycle Analysis –VTune™ Performance Analyzer Design (Introduce Threads) –Intel® Performance libraries: IPP and MKL –OpenMP* (Intel® Compiler) –Explicit threading (Win32*, Pthreads*) Debug for correctness –Intel® Thread Checker –Intel Debugger Tune for performance –Intel® Thread Profiler –VTune™ Performance Analyzer

13 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 13 Threaded Programming Methodology Let’s use the project PrimeSingle for analysis PrimeSingle Usage:./PrimeSingle 1 1000000 Analysis - Sampling Use VTune Sampling to find hotspots in application bool TestForPrime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Identifies the time consuming regions

14 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 14 Threaded Programming Methodology Analysis - Call Graph This is the level in the call tree where we need to thread Used to find proper level in the call-tree to thread

15 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 15 Threaded Programming Methodology Analysis Where to thread? FindPrimes() Is it worth threading a selected region? Appears to have minimal dependencies Appears to be data-parallel Consumes over 95% of the run time Baseline measurement

16 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 16 Threaded Programming Methodology Activity 2 Run code with ‘1 5000000’ range to get baseline measurement Make note for future reference Run VTune analysis on serial code What function takes the most time?

17 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 17 Threaded Programming Methodology Foster’s Design Methodology From Designing and Building Parallel Programs by Ian Foster Four Steps: PartitioningPartitioning Dividing computation and data CommunicationCommunication Sharing data between computations AgglomerationAgglomeration Grouping tasks to improve performance MappingMapping Assigning tasks to processors/threads

18 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 18 Threaded Programming Methodology Designing Threaded Programs Partition Divide problem into tasksCommunicate Determine amount and pattern of communicationAgglomerate Combine tasksMap Assign agglomerated tasks to created threads The Problem Initial tasksCommunicationCombined Tasks Final Program

19 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 19 Threaded Programming Methodology Parallel Programming Models Functional Decomposition Task parallelism Divide the computation, then associate the data Independent tasks of the same problem Data Decomposition Same operation performed on different data Divide data into pieces, then associate computation

20 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 20 Threaded Programming Methodology Decomposition Methods Functional Decomposition Focusing on computations can reveal structure in a problem Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC Domain Decomposition Focus on largest or most frequently accessed data structure Data Parallelism Same operation applied to all data Atmosphere Model Ocean Model Land Surface Model Hydrology Model

21 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 21 Threaded Programming Methodology Pipelined Decomposition Computation done in independent stages Functional decomposition Threads are assigned stage to compute Automobile assembly line Data decomposition Thread processes all stages of single instance One worker builds an entire car

22 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 22 Threaded Programming Methodology LAME Encoder Example LAME MP3 encoder Open source project Educational tool used for learning The goal of project is To improve the psychoacoustics quality To improve the speed of MP3 encoding

23 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 23 Threaded Programming Methodology LAME Pipeline Strategy Frame N Frame N + 1 Time Other N Prelude N Acoustics N Encoding N T 2 T 1 Acoustics N+1 Prelude N+1 Other N+1 Encoding N+1 Acoustics N+2 Prelude N+2 T 3 T 4 Prelude N+3 Hierarchical Barrier OtherPreludeAcousticsEncoding Frame Fetch next frame Frame characterization Set encode parameters Psycho Analysis FFT long/short Filter assemblage Apply filtering Noise Shaping Quantize & Count bits Add frame header Check correctness Write to disk

24 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 24 Threaded Programming Methodology Design What is the expected benefit? How do you achieve this with the least effort? How long would it take to thread? How much re-design/effort is required? Rapid prototyping with OpenMP Speedup(2P) = 100/(96/2+4) = ~1.92X

25 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 25 Threaded Programming Methodology OpenMP Fork-join parallelism: Master thread spawns a team of threads as needed Parallelism is added incrementally Sequential program evolves into a parallel program Parallel Regions Master Thread

26 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 26 Threaded Programming Methodology Design #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } OpenMP Create threads here for this parallel region for Divide iterations of the for loop

27 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 27 Threaded Programming Methodology Activity 3 Run OpenMP version of code Locate PrimeOpenMP directory and solution Compile code Run with ‘1 5000000’ for comparison What is the speedup?

28 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 28 Threaded Programming Methodology Design What is the expected benefit? How do you achieve this with the least effort? How long would it take to thread? How much re-design/effort is required? Is this the best speedup possible? Speedup of 1.40X (less than 1.92X)

29 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 29 Threaded Programming Methodology Debugging for Correctness Is this threaded implementation right? No! The answers are different each time …

30 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 30 Threaded Programming Methodology Debugging for Correctness Intel® Thread Checker pinpoints notorious threading bugs like data races, stalls and deadlocks Intel ® Thread Checker VTune™ Performance Analyzer +DLLs (Instrumented) Binary Instrumentation Primes.exe (Instrumented) Runtime Data Collector threadchecker.thr (result file)

31 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 31 Threaded Programming Methodology Thread Checker

32 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 32 Threaded Programming Methodology Activity 4 Use Thread Checker to analyze threaded application Create Thread Checker activity Run application Are any errors reported?

33 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 33 Threaded Programming Methodology Debugging for Correctness How much re-design/effort is required? How long would it take to thread? Thread Checker reported only 2 dependencies, so effort required should be low

34 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 34 Threaded Programming Methodology Debugging for Correctness #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } #pragma omp critical { gProgress++; percentDone = (int)(gProgress/range *200.0f+0.5f) } Will create a critical section for this reference Will create a critical section for both these references

35 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 35 Threaded Programming Methodology Activity 5 Modify and run OpenMP version of code Add critical region pragmas to code Compile code Run from within Thread Checker If errors still present, make appropriate fixes to code and run again in Thread Checker Run with ‘1 5000000’ for comparison Compile and run outside Thread Checker What is the speedup?

36 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 36 Threaded Programming Methodology Correctness 1.33 Correct answer, but performance has slipped to ~1.33X Is this the best we can expect from this algorithm? No! From Amdahl’s Law, we expect speedup close to 1.9X

37 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 37 Threaded Programming Methodology Common Performance Issues Parallel Overhead Due to thread creation, scheduling … Synchronization Excessive use of global data, contention for the same synchronization object Load Imbalance Improper distribution of parallel work Granularity No sufficient parallel work

38 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 38 Threaded Programming Methodology Tuning for Performance Thread Profiler pinpoints performance bottlenecks in threaded applications Thread Profiler Thread Profiler VTune™ Performance Analyzer +DLL’s (Instrumented) Binary Instrumentation Primes.c Primes.exe (Instrumented) Runtime Data Collector Bistro.tp/guide.gvs (result file) Compiler Source Instrumentation Primes.exe /Qopenmp_profile

39 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 39 Threaded Programming Methodology Thread Profiler for OpenMP

40 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 40 Threaded Programming Methodology Thread Profiler for OpenMP Speedup Graph Estimates threading speedup and potential speedup – – Based on Amdahl’s Law computation Gives upper and lower bound estimates

41 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 41 Threaded Programming Methodology Thread Profiler for OpenMP serial parallel

42 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 42 Threaded Programming Methodology Thread Profiler for OpenMP Thread 0Thread 1Thread 2Thread 3

43 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 43 Threaded Programming Methodology Thread Profiler (for Explicit Threads)

44 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 44 Threaded Programming Methodology Thread Profiler (for Explicit Threads) Why so many transitions?

45 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 45 Threaded Programming Methodology Performance This implementation has implicit synchronization calls This limits scaling performance due to the resulting context switches Back to the design stage

46 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 46 Threaded Programming Methodology Activity 6 Use Thread Profiler to analyze threaded application Use /Qopenmp_profile to compile and link Create Thread Profiler Activity (for explicit threads) Run application in Thread Profiler Find the source line that is causing the threads to be inactive

47 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 47 Threaded Programming Methodology Performance Is that much contention expected? The algorithm has many more updates than the 10 needed for showing progress void ShowProgress( int val, int range ) { int percentDone; gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 ) printf("\b\b\b\b%3d%", percentDone); } void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%", percentDone); lastPercentDone++; } This change should fix the contention issue

48 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 48 Threaded Programming Methodology Design Goals Eliminate the contention due to implicit synchronization 2.32X Speedup is 2.32X ! Is that right?

49 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 49 Threaded Programming Methodology Performance Our original baseline measurement had the “flawed” progress update algorithm Is this the best we can expect from this algorithm? 1.40X Speedup is actually 1.40X (<<1.9X)!

50 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 50 Threaded Programming Methodology Activity 7 Modify ShowProgress function (both serial and OpenMP) to print only the needed output Recompile and run the code Be sure no instrumentation flags are used What is speedup from serial version now? if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%", percentDone); lastPercentDone++; }

51 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 51 Threaded Programming Methodology Performance Re-visited Still have 62% of execution time in locks and synchronization

52 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 52 Threaded Programming Methodology Performance Re-visited Let’s look at the OpenMP locks… void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Lock is in a loop void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); }

53 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 53 Threaded Programming Methodology Performance Re-visited Let’s look at the second lock void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%", percentDone); lastPercentDone++; } This lock is also being called within a loop void ShowProgress( int val, int range ) { long percentDone, localProgress; static int lastPercentDone = 0; localProgress = InterlockedIncrement(&gProgress); percentDone = (int)((float)localProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%", percentDone); lastPercentDone++; }

54 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 54 Threaded Programming Methodology Activity 8 Modify OpenMP critical regions to use InterlockedIncrement instead Re-compile and run code What is speedup from serial version now?

55 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 55 Threaded Programming Methodology Thread Profiler for OpenMP Thread 0 342 factors to test 116747 Thread 1 612 factors to test 373553 Thread 2 789 factors to test 623759 Thread 3 934 factors to test 873913 500000 250000 750000 1000000

56 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 56 Threaded Programming Methodology Fixing the Load Imbalance Distribute the work more evenly void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); } 1.68X Speedup achieved is 1.68X

57 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 57 Threaded Programming Methodology Activity 9 Modify code for better load balance Add schedule (static, 8) clause to OpenMP parallel for pragma Re-compile and run code What is speedup from serial version now?

58 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 58 Threaded Programming Methodology Final Thread Profiler Run 1.80X Speedup achieved is 1.80X

59 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 59 Threaded Programming Methodology Comparative Analysis Threading applications require multiple iterations of going through the software development cycle

60 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 60 Threaded Programming Methodology Threading Methodology What’s Been Covered Four step development cycle for writing threaded code from serial and the Intel® tools that support each step Analysis Design (Introduce Threads) Debug for correctness Tune for performance Threading applications require multiple iterations of designing, debugging and performance tuning steps Use tools to improve productivity

61

62 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 62 Threaded Programming Methodology Backup Slides

63 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 63 Threaded Programming Methodology Parallel Overhead Thread Creation overhead Overhead increases rapidly as the number of active threads increases Solution Use of re-usable threads and thread pools Amortizes the cost of thread creation Keeps number of active threads relatively constant

64 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 64 Threaded Programming Methodology Synchronization Heap contention Allocation from heap causes implicit synchronization Allocate on stack or use thread local storage Atomic updates versus critical sections Some global data updates can use atomic operations (Interlocked family) Use atomic updates whenever possible Critical Sections versus mutual exclusion Critical Section objects reside in user space Use CRITICAL SECTION objects when visibility across process boundaries is not required Introduces lesser overhead Has a spin-wait variant that is useful for some applications

65 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 65 Threaded Programming Methodology Load Imbalance Unequal work loads lead to idle threads and wasted time Time Busy Idle

66 Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 66 Threaded Programming Methodology Granularity Coarse grain Fine grain Parallelizable portion Serial Parallelizable portion Serial Scaling: ~3X Scaling: ~1.10X Scaling: ~2.5X Scaling: ~1.05X


Download ppt "Threaded Programming Methodology Intel Software College."

Similar presentations


Ads by Google