1. 10/24/20152 3 Upon completion of this module, you will be able to: Use Thread Checker to detect and identify a variety of threading correctness issues.

Slides:



Advertisements
Similar presentations
Intel Software College Tuning Threading Code with Intel® Thread Profiler for Explicit Threads.
Advertisements

Part IV: Memory Management
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Correcting Threading Errors with Intel® Thread Checker for Explicit Threads Intel Software College.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Multi-core Programming Thread Checker. 2 Topics What is Intel® Thread Checker? Detecting race conditions Thread Checker as threading assistant Some other.
Review: Chapters 1 – Chapter 1: OS is a layer between user and hardware to make life easier for user and use hardware efficiently Control program.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
3.5 Interprocess Communication
Chapter 11 Operating Systems
Threads CNS What is a thread?  an independent unit of execution within a process  a "lightweight process"  an independent unit of execution within.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
Computer Programming and Basic Software Engineering 4. Basic Software Engineering 1 Writing a Good Program 4. Basic Software Engineering.
10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Threading and Concurrency Issues ● Creating Threads ● In Java ● Subclassing Thread ● Implementing Runnable ● Synchronization ● Immutable ● Synchronized.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
10/16/ Realizing Concurrency using the thread model B. Ramamurthy.
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Introduction to Concurrency.
B. RAMAMURTHY 10/24/ Realizing Concurrency using the thread model.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Internet Software Development Controlling Threads Paul J Krause.
Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.
Correcting Threading Errors with Intel® Parallel Inspector.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
CS533 – Spring Jeanie M. Schwenk Experiences and Processes and Monitors with Mesa What is Mesa? “Mesa is a strongly typed, block structured programming.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.
Barriers and Condition Variables
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Slides created by: Professor Ian G. Harris Operating Systems  Allow the processor to perform several tasks at virtually the same time Ex. Web Controlled.
Unit 4: Processes, Threads & Deadlocks June 2012 Kaplan University 1.
CSCI1600: Embedded and Real Time Software Lecture 17: Concurrent Programming Steven Reiss, Fall 2015.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Tuning Threaded Code with Intel® Parallel Amplifier.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
7/9/ Realizing Concurrency using Posix Threads (pthreads) B. Ramamurthy.
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 4 – Thread Concepts
Computer Engg, IIT(BHU)
Intel Software College
MODERN OPERATING SYSTEMS Third Edition ANDREW S
Intel® Parallel Studio and Advisor
Realizing Concurrency using Posix Threads (pthreads)
Realizing Concurrency using the thread model
Intel Software College
Tuning Threading Code with Intel® Thread Profiler for Explicit Threads
Threads Chapter 4.
Multithreaded Programming
Concurrency: Mutual Exclusion and Process Synchronization
CSCI1600: Embedded and Real Time Software
Realizing Concurrency using Posix Threads (pthreads)
Realizing Concurrency using the thread model
Lecture 2 The Art of Concurrency
Realizing Concurrency using Posix Threads (pthreads)
CSE 153 Design of Operating Systems Winter 19
CSE 153 Design of Operating Systems Winter 2019
CSCI1600: Embedded and Real Time Software
CSE 451 Section 1/27/2000.
Presentation transcript:

1

10/24/20152

3 Upon completion of this module, you will be able to: Use Thread Checker to detect and identify a variety of threading correctness issues in Windows* threaded applications Determine if library functions are thread-safe

What is Intel® Thread Checker? Detecting race conditions Thread Checker as threading assistant Some other threading errors Checking library thread-safety Other features of Thread Checker

Motivation Developing threaded applications can be a complex task. New class of problems are caused by the interaction between concurrent threads: Data races or storage conflicts o More than one thread accesses memory without synchronization Deadlocks o Thread waits for an event that will never happen

Intel® Thread Checker Debugging tool for threaded software Finds threading bugs in Windows*, POSIX*, and OpenMP* threaded software Locates bugs quickly that can take days to find using traditional methods and tools ◦Isolates problems, not the symptoms ◦Bug does not have to occur to find it! Plug-in to VTune™ Performance Analyzer ◦Same look, feel, and interface as VTune™ environment

7 Intel® Thread Checker Features Supports several different compilers ◦Intel® C++ and Fortran Compilers, v7 and higher ◦Microsoft* Visual* C++, v6 ◦Microsoft* Visual* C++.NET* 2002, 2003 & 2005 Editions  Integrated into Microsoft Visual Studio.NET* IDE View (drill-down to) source code for Diagnostics One-click help for diagnostics ◦Possible causes and solution suggestions API for user-defined synchronization primitives

8 Thread Checker: Analysis Dynamic as software runs: ◦Data (workload) -driven execution Includes monitoring of: ◦Thread and Sync APIs used ◦Thread execution order  Scheduler impacts results ◦Memory accesses between threads Code path must be executed to be analyzed

9 Thread Checker: Before You Start Instrumentation: background ◦Adds calls to library to record information  Thread and Sync APIs  Memory accesses ◦Increases execution time and size Use small data sets (workloads) ◦Execution time and space is expanded ◦Multiple runs over different paths yield best results Workload selection is important!

10 Workload Guidelines Execute problem code once per thread to be identified Use smallest possible working data set ◦Minimize data set size  Smaller image sizes ◦Minimize loop iterations or time steps  Simulate minutes rather than days ◦Minimize update rates  Lower frames per second Finds threading errors faster!

10/24/ Building for Thread Checker Compile Use dynamically linked thread-safe runtime libraries ( /MD, /MDd ) Generate symbolic information ( /Zi, /ZI, /Z7 ) Disable optimization ( /Od ) Link Preserve symbolic information ( /debug ) Specify relocatable code sections ( /fixed:no )

12 Binary Instrumentation Build with supported compiler Running the application ◦Must be run from within Thread Checker ◦Application is instrumented when executed ◦External DLLs are instrumented as used

13 Source Instrumentation Intel® C++ or Fortran Compilers Compile with /Qtcheck Running the application Start in VTune™ environment Start from Windows* command line o Data collected in threadchecker.thr results file o View results (.thr file) in VTune™ environment o Additional DLLs not instrumented or analyzed More detailed diagnostics

14 Starting Thread Checker Intel® Thread Checker Wizard Intel® Thread Profiler Wizard Advanced Activity Configuration 1) Must Select 2) To see these Wizards

15 Thread Checker Diagnostics

16 Diagnostics Grouping

17 Source Code Viewer

18 Diagnostic Help 1) Right-click here... 2) More help!

19 Activity 1a - Potential Energy Build and run serial version Build threaded version Run application in Thread Checker to identify threading problems

20 Dependence Analysis Consider the serial code: Flow dependence between S1 and S2 ◦Value of A updated in S1 is used in S2 Anti dependence between S2 and S3 ◦Value of A is read in S2 before written in S3 Output dependence between S3 and S4 ◦Value of A assigned in S3 must occur before assignment in S4 S1: A = 1.0; S2: B = A ; S3: A = 1/3 * (C – D); S4: A = (B * 3.8) / 2.7;

21 How to Avoid Data Races Thread Checker Dependencies: Output dependence ◦Write-Write conflict: one thread updates a variable that is subsequently updated by another thread Anti-dependence ◦Read-Write conflict: one thread reads a variable that is subsequently updated by another thread Flow dependence ◦Write-Read conflict: one thread updates a variable that is subsequently read by another thread

22 Race Conditions Execution order is assumed but cannot be guaranteed ◦Concurrent access of same variable by multiple threads Most common error in multithreaded programs May not be apparent at all times

23 Solving Race Conditions Solution: Scope variables to be local to threads When to use ◦Value computed is not used outside parallel region ◦Temporary or “work” variables How to implement ◦OpenMP scoping clauses ( private, shared ) ◦Declare variables within threaded functions ◦Allocate variables on thread stack ◦TLS (Thread Local Storage) API

24 Solving Race Conditions Solution: Control shared access with critical regions When to use ◦Value computed is used outside parallel region ◦Shared value is required by each thread How to implement ◦Mutual exclusion and synchronization ◦Lock, semaphore, event, critical section, atomic… ◦Rule of thumb: Use one lock per data element

25 Activity 1b - Potential Energy Fix errors found by Thread Checker

10/24/2015 Implementation Assistant When implementing threads ◦Obvious shared and private variables can be identified and handled ◦Should you analyze remaining variables for dependencies? ◦What if parallel code is 100’s of lines long? ◦What about variable use in called functions? ◦Can you tell if pointers refer to same memory location? Use Thread Checker as a threading assistant ◦Speculatively insert threading (OpenMP prototype?) ◦Compile and run program in Thread Checker ◦Review diagnostics ◦Update directives and/or restructure Let Thread Checker do the “heavy lifting”

10/24/ Deadlock Caused by thread waiting on some event that will never happen Most common cause is locking hierarchies ◦Always lock and un-lock in the same order ◦Avoid hierarchies if possible DWORD WINAPI threadA(LPVOID arg) { EnterCriticalSection(&L1); EnterCriticalSection(&L2); processA(data1, data2); LeaveCriticalSection(&L2); LeaveCriticalSection(&L1); return(0); } DWORD WINAPI threadB(LPVOID arg) { EnterCriticalSection(&L2); EnterCriticalSection(&L2); EnterCriticalSection(&L1); EnterCriticalSection(&L1); processB(data2, data1) ; processB(data2, data1) ; LeaveCriticalSection(&L1); LeaveCriticalSection(&L1);LeaveCriticalSection(&L2); return(0); return(0);} ThreadA: L1, then L2 ThreadB: L2, then L1

10/24/ Deadlock Add lock per element Lock only elements, not whole array of elements void swap (shape_t A, shape_t B) { lock(a.mutex); lock(b.mutex); // Swap data between A & B unlock(b.mutex); unlock(a.mutex); } typedef struct { // some data things SomeLockType mutex; } shape_t; shape_t Q[1024]; swap(Q[986], Q[34]); Thread 4 swap(Q[34], Q[986]); Thread 1 Grabs mutex 34 Grabs mutex 986

10/24/ Thread Stalls Thread waits for an inordinate amount of time ◦Usually on a resource ◦Commonly caused by dangling locks Be sure threads release all locks held

10/24/ What’s Wrong? int data; DWORD WINAPI threadFunc(LPVOID arg) { int localData; EnterCriticalSection(&lock); if (data == DONE_FLAG) return(1); localData = data; LeaveCriticalSection(&lock); process(local_data); return(0); } Lock never released

10/24/ Activity 2 - Deadlock Use Intel® Thread Checker to find and correct the potential deadlock problem.

10/24/ Thread Safe Routines All routines called concurrently from multiple threads must be thread safe How to test for thread safety? ◦Use OpenMP and Thread Checker for analysis  OpenMP simulator is systematic  Use sections to create concurrent execution

10/24/ Thread Safety Example Check for safety issues between ◦Multiple instances of routine1() ◦Instances of routine1() and routine2() Set up sections to test all permutations Still need to provide data sets that exercise relevant portions of code #pragma omp parallel sections { #pragma omp section routine1(&data1); #pragma omp section routine1(&data2); #pragma omp section routine2(&data3); }

10/24/ Two Ways to Ensure Thread Safety Routines can be written to be reentrant ◦Any variables changed by the routine must be local to each invocation  Don’t modify globally shared variables Routines can use mutual exclusion to avoid conflicts with other threads ◦If accessing shared variables cannot be avoided What if third-party libraries are not thread safe? ◦Will likely need to control threads access to library It is better to make a routine reentrant than to add synchronization Avoids potential overhead

10/24/ Activity 3 – Thread Safety Use OpenMP framework to call library routines concurrently ◦Three library calls = 6 combinations to test  A:A, B:B, C:C, A:B, A:C, B:C

10/24/ Instrumentation Levels Higher levels increase memory usage and analysis time, but provide more details Binary instrumentation lowers level from default until successful Manually adjust level of instrumentation to increase speed or control amount of information gathered Instrumentation Level Description Full ImageEach instruction in the module is instrumented to be checked to see if it might generate a diagnostic message. Custom ImageSame as “Full Image” except user can disable selected functions from instrumentation. All FunctionsTurns on full instrumentation for those parts of a module that were compiled with debugging information. Custom Functions Same as “All Functions” except user can disable selected functions from instrumentation. API ImportsOnly system API functions that are needed to be instrumented by the tool will be instrumented. No user code is instrumented. Module ImportsDisables instrumentation. This is default on system images, images without base relocations, and images not containing debug information.

10/24/ Large Diagnostics Counts What do you do if you have 5000 diagnostics? Where do you begin debugging? Are all the diagnostic messages equally important/serious? Suggestions for organizing and prioritizing ◦Add “1st Access” column ◦Group by “1st Access” ◦Sort by “Short Description” column

10/24/ Large Diagnostics Counts

10/24/ Large Diagnostics Counts

10/24/ Large Diagnostics Counts Groups errors reported for the same source line; each group can be seen as the same issue

10/24/ Large Diagnostics Counts Sort on the “Short description”

10/24/ Summary Threading errors are easy to introduce Debugging these errors by traditional techniques is hard Intel® Thread Checker catches these errors ◦Errors do not have to occur to be detected ◦Greatly reduces debugging time ◦Improves robustness of the application

10/24/201543

10/24/ Course Description

10/24/ Learning Objectives After successful completion of this module you will be able to… Use Thread Profiler to recognize and fix common performance problems in applications using Windows* threads

10/24/ Agenda Look at Intel® Thread Profiler features Define Critical Path Analysis Examine Thread Profiler data views available Review common performance issues of multithreaded applications ◦Focus on Load imbalance ◦Focus on Synchronization contention Describe general optimizations to gain better performance

10/24/ Motivation Developing efficient multithreaded applications is hard New performance problems are caused by the interaction between concurrent threads: ◦Load imbalance ◦Contention on synchronization objects ◦Threading overhead

10/24/ Intel® Thread Profiler Features Supports several different compilers ◦Intel® C++ and Fortran Compilers, v7 and higher ◦Microsoft* Visual* C++, v6 ◦Microsoft* Visual* C++.NET* 2002, 2003 & 2005 Editions  Integrated into Microsoft Visual Studio.NET* IDE Binary instrumentation of applications Different views and filters available to assist and organize analysis Uses critical path analysis

10/24/ What is the Critical Path? Threaded applications contain multiple execution flows: A new flow is created when a thread is created or resumes Flow ends when a thread terminates or blocks on a synchronization primitive critical path is the longest execution flow The critical path is the longest execution flow

10/24/ Critical Path Analysis System Utilization Relative to the system executing the application Thread interaction categories If the critical path is shortened, the application will run in less time

10/24/ System Utilization Examines processor utilization to determine concurrency level of the application. Concurrency is the number of active threads Categorization shown for a system configuration with 2 processors

10/24/ Execution Time Categories Analyze thread interaction and behavior along critical path Record objects that cause CP transitions Categorization shown for a system configuration with 2 processors

10/24/ Merging Concurrency and Behavior Concurrency Level Critical Path Thread Behavior Time Start with system utilization Further categorize by behavior

10/24/ Thread Profiler Views Critical Path View ◦Shows breakdown of the critical path Profile View ◦Shows the breakdown of selected critical paths ◦User can select other views of the selected profile ◦Concurrency level, threads, objects Timeline View ◦Shows thread activity and critical path transitions for the entire application Source View ◦Transition source view, creation source view

10/24/ Activity 1a Threaded version of potential code ◦Is there a performance issue? Goal ◦Run application through Thread Profiler ◦Examine thread activities by reviewing different views

10/24/ Thread Profiler Proflie View Profile Pane Timeline Pane

10/24/ Profile Pane – Concurrency Level View Concurrency Level View Two threads ran in parallel ~33% of the time Ran single threaded ~65% of the time Let’s look at the Thread View

10/24/ Profile Pane – Thread View Time on the Critical Path Active time of the thread Lifetime of the thread Let’s look at the Object View

10/24/ Profile Pane – Object View This object caused all of the impact Let’s look at Timeline View

10/24/ Timeline Pane

10/24/ Source View

10/24/ Activity 1b Threaded version of potential code ◦Is there a performance issue? Goal ◦Examine thread activities by reviewing different views ◦Determine system utilization ◦Identify any performance issues

10/24/ Review Activity 1 Concurrency Level view can be used to determine system utilization by the application Timeline view enables you to understand the thread activity in your application Instrumentation time will be included in first run results; thus, for applications running in a short amount of time, a second run may produce more realistic timings.

10/24/ Common Performance Issues Load balance ◦Improper distribution of parallel work Synchronization ◦Excessive use of global data, contention for the same synchronization object Parallel Overhead ◦Due to thread creation, scheduling.. Granularity ◦No sufficient parallel work

10/24/ Load Imbalance Busy Idle Time Thread 0 Thread 1 Thread 2 Thread 3 Start threads Join threads

10/24/ Redistribute Work to Threads Static assignment Are the same number of tasks assigned to each thread? Do tasks take different processing time? o Do tasks change in a predictable pattern?  Rearrange (static) order of assignment to threads o Use dynamic assignment of tasks

10/24/ Redistribute Work to Threads Dynamic assignment Is there one big task being assigned? o Break up large task to smaller parts Are small computations agglomerated into larger task? o Adjust number of computations in a task o More small computations into single task? o Fewer small computations into single task? o Bin packing heuristics

10/24/ Unbalanced Workloads Threads are unbalanced Active Times not equal

10/24/ Activity 2 – Load Imbalance Threaded version of potential code with thread pools ◦Has a load balance performance issue

10/24/ Review Activity 2 Threads view can be used to determine activity levels of each thread within the application Timeline view enables you to understand the thread activity in your application

10/24/ Synchronization By definition, synchronization serializes execution Lock contention means more idle time for threads Busy Idle In Critical Thread 0 Thread 1 Thread 2 Thread 3 Time

10/24/ Synchronization Fixes Eliminate synchronization ◦Expensive but necessary “evil” ◦Use storage local to threads  Use local variable for partial results, update global after local computations  Allocate space on thread stack ( alloca )  Use thread-local storage API (TlsAlloc) ◦Use atomic updates whenever possible  Some global data updates can use atomic operations (Interlocked API family)

10/24/ Atomic Updates Use Win32 Interlocked* intrinsics in place of synchronization object static long counter; // Fast InterlockedIncrement (&counter); // Slower EnterCriticalSection (&cs); counter++; LeaveCriticalSection (&cs);

10/24/ Synchronization Fixes Reduce size of critical regions protected by synchronization object Larger critical regions tie up sync objects longer; other threads sit idle longer waiting to acquire objects Only accesses to shared variables need to be protected

10/24/ Synchronization Fixes Use best synchronization object for job Critical Section o Local object o Available to threads within the same process o Lower overhead (~8X faster than mutex) Mutex o Kernel object o Accessible to threads within different processes o Deadlock safety (can only be released by owner) Other objects are available

10/24/ Object Contention These four threads… …are impacting threads by this object

10/24/ Activity 3 Threaded version of numerical integration ◦Has serious performance issues Goal ◦Understand thread activity ◦Use the Thread Profiler groupings ◦Examine synchronization and its effect on performance ◦Fix performance issue

10/24/ Review Activity 3 Grouping objects and threads provides the information on which objects impact what threads Apply the heuristics from labs for locating bottlenecks in the source code For longer running applications, the difference in first and second run-times is negligible

10/24/ General Optimizations Serial Optimizations ◦Serial optimizations along the critical path should affect execution time Parallel Optimizations ◦Reduce synchronization object contention ◦Balance workload ◦Functional parallelism Analyze benefit of increasing number of processors Analyze the effect of increasing the number of threads on scaling performance

10/24/ Summary Identifying performance issues can be time consuming without tools Tools are required to understand and to optimize parallel efficiency and hardware utilization Thread Profiler helps you understand your applications thread activity, system utilization, and scaling performance