Download presentation
Presentation is loading. Please wait.
Published byLambert Freeman Modified over 10 years ago
1
Intel ® Threading Building Blocks
2
2 Software and Services Group 2 Agenda Overview Intel ® Threading Building Blocks −Parallel Algorithms −Task Scheduler −Concurrent Containers −Sync Primitives −Memory Allocator Summary Intel and the Intel logo are trademarks of Intel Corporation in the United States and other countries
3
3 Software and Services Group 3 Gaining performance from multi-core requires parallel programming Multi-threading is used to: Reduce or hide latency Increase throughput Multi-Core is Mainstream
4
4 Software and Services Group 4 Going Parallel Typical Serial C++ Program Ideal Parallel C++ ProgramIssues AlgorithmsParallel AlgorithmsRequire many code changes when developed from scratch: often it takes a threading expert to get it right Data StructuresThread-safe and scalable Data Structures Serial data structures usually require global locks to make operations thread-safe Dependencies- Minimum of dependencies - Efficient use of synchronization primitives Too many dependencies expensive synchronization poor parallel performance Memory ManagementScalable Memory Manager Standard memory allocator is often inefficient in multi-threaded app
5
5 Software and Services Group 5 Concurrent Containers Common idioms for concurrent access - a scalable alternative to a serial container with a lock around it Miscellaneous Thread-safe timers Generic Parallel Algorithms Efficient scalable way to exploit the power of multi-core without having to start from scratch Task scheduler The engine that empowers parallel algorithms that employs task-stealing to maximize concurrency Synchronization Primitives User-level and OS wrappers for mutual exclusion, ranging from atomic operations to several flavors of mutexes and condition variables Memory Allocation Per-thread scalable memory manager and false-sharing free allocators Intel ® Threading Building Blocks Threads OS API wrappers Thread Local Storage Scalable implementation of thread-local data that supports infinite number of TLS TBB Flow Graph – New!
6
6 Software and Services Group 6 Portable C++ runtime library that does thread management, letting developers focus on proven parallel patterns Scalable Composable Flexible Portable Both GPL and commercial licenses are available. http://threadingbuildingblocks.org Intel ® Threading Building Blocks Extend C++ for parallelism *Other names and brands may be claimed as the property of others
7
7 Software and Services Group 7 Intel ® Threading Building Blocks Parallel Algorithms
8
8 Software and Services Group 8 Generic Parallel Algorithms Loop parallelization parallel_for, parallel_reduce, parallel_scan >Load balanced parallel execution of fixed number of independent loop iterations Parallel Algorithms for Streams parallel_do, parallel_for_each, pipeline / parallel_pipeline >Use for unstructured stream or pile of work Parallel function invocation parallel_invoke >Parallel execution of a number of user-specified functions Parallel Sort parallel_sort >Comparison sort with an average time complexity O(N Log(N))
9
9 Software and Services Group 9 Parallel Algorithm Usage Example #include "tbb/blocked_range.h" #include "tbb/parallel_for.h“ using namespace tbb; class ChangeArray{ int* array; public: ChangeArray (int* a): array(a) {} void operator()( const blocked_range & r ) const{ for (int i=r.begin(); i!=r.end(); i++ ){ Foo (array[i]); } }; void ChangeArrayParallel (int* a, int n ) { parallel_for (blocked_range (0, n), ChangeArray(a)); } int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0; } ChangeArray class defines a for-loop body for parallel_for ChangeArray class defines a for-loop body for parallel_for blocked_range – TBB template representing 1D iteration space blocked_range – TBB template representing 1D iteration space As usual with C++ function objects the main work is done inside operator() As usual with C++ function objects the main work is done inside operator() A call to a template function parallel_for : with arguments Range blocked_range Body ChangeArray A call to a template function parallel_for : with arguments Range blocked_range Body ChangeArray
10
10 Software and Services Group 10 tasks available to thieves [Data, Data+N) [Data, Data+N/2) [Data+N/2, Data+N) [Data, Data+N/k) [Data, Data+GrainSize) parallel_for(Range(Data), Body(), Partitioner());
11
11 Software and Services Group 11 Two Execution Orders Depth First (stack) Small space Excellent cache locality No parallelism Breadth First (queue) Large space Poor cache locality Maximum parallelism
12
12 Software and Services Group 12 Work Depth First; Steal Breadth First L1 L2 victim thread Best choice for theft! big piece of work data far from victim’s hot data. Best choice for theft! big piece of work data far from victim’s hot data. Second best choice.
13
13 Software and Services Group 13 C++0x Lambda Expression Support parallel_for example will transform into: #include "tbb/blocked_range.h" #include "tbb/parallel_for.h“ using namespace tbb; void ChangeArrayParallel (int* a, int n ) { parallel_for (0, n, 1, [=](int i) { Foo (a[i]); }); } int main (){ int A[N]; // initialize array here… ChangeArrayParallel (A, N); return 0; } Capture variables by value from surrounding scope to completely mimic the non-lambda implementation. Note that [&] could be used to capture variables by reference. Capture variables by value from surrounding scope to completely mimic the non-lambda implementation. Note that [&] could be used to capture variables by reference. Using lambda expressions implement MyBody::operator() right inside the call to parallel_for(). Using lambda expressions implement MyBody::operator() right inside the call to parallel_for(). parallel_for has an overload that takes start, stop and step argument and constructs blocked_range internally parallel_for has an overload that takes start, stop and step argument and constructs blocked_range internally
14
14 Software and Services Group 14 Functional parallelism has never been easier int main(int argc, char* argv[]) { spin_mutex m; int a = 1, b = 2; parallel_invoke( foo, [a, b, &m](){ bar(a, b, m); }, [&m](){ for(int i = 0; i < K; ++i) { spin_mutex::scoped_lock l(m); cout << i << endl; } }, [&m](){ parallel_for( 0, N, 1, [&m](int i) { spin_mutex::scoped_lock l(m); cout << i << " "; }); return 0; } void function_handle(void) calling void bar(int, int, mutex) implemented using a lambda expression void function_handle(void) calling void bar(int, int, mutex) implemented using a lambda expression Serial thread-safe job, wrapped in a lambda expression that is being executed in parallel with three other functions Serial thread-safe job, wrapped in a lambda expression that is being executed in parallel with three other functions Parallel job, which is also executed in parallel with other functions. Parallel job, which is also executed in parallel with other functions. Now imagine writing all this code with just plain threads
15
15 Software and Services Group 15 Strongly-typed parallel_pipeline float RootMeanSquare( float* first, float* last ) { float sum=0; parallel_pipeline( /*max_number_of_tokens=*/16, make_filter ( filter::serial, [&](flow_control& fc)-> float*{ if( first<last ) { return first++; } else { fc.stop(); // stop processing return NULL; } ) & make_filter ( filter::parallel, [](float* p){return (*p)*(*p);} ) & make_filter ( filter::serial, [&sum](float x) {sum+=x;} ) ); /* sum=first 2 +(first+1) 2 + … +(last-1) 2 computed in parallel */ return sqrt(sum); } Call function tbb::parallel_pipeline to run pipeline stages (filters) Call function tbb::parallel_pipeline to run pipeline stages (filters) Create pipeline stage object tbb::make_filter<InputDataType, OutputDataType>(mode, body) Create pipeline stage object tbb::make_filter<InputDataType, OutputDataType>(mode, body) Pipeline stage mode can be serial, parallel, serial_in_order, or serial_out_of_order Pipeline stage mode can be serial, parallel, serial_in_order, or serial_out_of_order get new float float*float sum+=float 2 input: void output: float* input: float* output: float input: float output: void
16
16 Software and Services Group 16 Intel ® Threading Building Blocks Task Scheduler
17
17 Software and Services Group 17 Task Scheduler Task scheduler is the engine driving Intel ® Threading Building Blocks Manages thread pool, hiding complexity of native thread management Maps logical tasks to threads Parallel algorithms are based on task scheduler interface Task scheduler is designed to address common performance issues of parallel programming with native threads ProblemIntel ® TBB Approach OversubscriptionOne scheduler thread per hardware thread Fair schedulingNon-preemptive unfair scheduling High overheadProgrammer specifies tasks, not threads. Load imbalanceWork-stealing balances load
18
18 Software and Services Group 18 Logical task – it is just a C++ class Derive from tbb::task class Implement execute() member function Create and spawn root task and your tasks Wait for tasks to finish #include “tbb/task_scheduler_init.h” #include “tbb/task.h” using namespace tbb; ThisIsATask class ThisIsATask: public task { public: task* execute () { WORK (); return NULL; } };
19
19 Software and Services Group 19 Task Tree Example Yellow arrows– Creation sequence Black arrows – Task dependency task child2child1 Time Depth Level root Thread 1 Thread 2 wait for all() Intel ® TBB wait calls don’t block calling thread! It blocks the task however. Intel TBB worker thread keeps stealing tasks while waiting
20
20 Software and Services Group 20 Intel ® Threading Building Blocks Concurrent Containers
21
21 Software and Services Group 21 Concurrent Containers Intel ® TBB provides highly concurrent containers −STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt container −Wrapping a lock around an STL container turns it into a serial bottleneck and still does not always guarantee thread safety >STL containers are inherently not thread-safe Intel TBB provides fine-grained locking or lockless implementations −Worse single-thread performance, but better scalability. −Can be used with the library, OpenMP*, or native threads. *Other names and brands may be claimed as the property of others
22
22 Software and Services Group 22 Concurrent Containers Key Features −concurrent_hash_map >Models hash table of std::pair elements −concurrent_unordered_map >Permits concurrent traversal and insertion (no concurrent erasure) >Requires no visible locking, looks similar to STL interfaces −concurrent_vector >Dynamically growable array of T: grow_by and grow_to_atleast −concurrent_queue >For single threaded run concurrent_queue supports regular “first-in-first-out” ordering >If one thread pushes two values and the other thread pops those two values they will come out in the order as they were pushed −concurrent_bounded_queue >Similar to concurrent_queue with a difference that it allows specifying capacity. Once the capacity is reached ‘push’ will wait until other elements will be popped before it can continue. −concurrent_priority_queue >Similar to std::priority_queue with scalable pop and push oprations
23
23 Software and Services Group 23 Example: concurrent_hash_map struct wordsCompare { bool equal(const string& w1, const string& w2) const { return w1 == w2; } size_t hash(const string& w) const { size_t h = 0; for( const char* s = w.c_str(); *s; s++ ) h = (h*16777179)^*s; return h; } }; void ParallelWordsCounting(const text_t& text) { parallel_for( blocked_range ( 0, text.size() ), [&text]( const blocked_range &r ) { for(int i = r.begin(); i < r.end(); ++i) { concurrent_hash_map ::accessor acc; wordCounters.insert(acc, text[i]); acc->second++; } }); } User-defined “HashCompare” class needs to implement functions for comparing two keys and a hashing function User-defined “HashCompare” class needs to implement functions for comparing two keys and a hashing function an element of a concurrent_hash_map can be accessed by creating an “accessor” object, which is somewhat a smart pointer implementing the necessary data access synchronization an element of a concurrent_hash_map can be accessed by creating an “accessor” object, which is somewhat a smart pointer implementing the necessary data access synchronization
24
24 Software and Services Group 24 Hash-map Examples Concurrent Ops TBB cumap TBB chmap STL map TraversalYesNo InsertionYes No ErasureNoYesNo SearchYes No #include typedef std::map StringTable; for (std::string* p=range.begin(); p!=range.end(); ++p) { tbb::spin_mutex::scoped_lock lock( global_lock ); table[*p] += 1; } #include "tbb/concurrent_hash_map.h" typedef concurrent_hash_map StringTable; for (std::string* p=range.begin(); p!=range.end(); ++p) { StringTable::accessor a; // local lock table.insert( a, *p ); a->second += 1;} } #include "tbb/concurrent_unordered_map.h“ typedef concurrent_unordered_map > StringTable; for (std::string* p=range.begin(); p!=range.end(); ++p) { table[*p] += 1; // similar to STL but value is tbb::atomic }
25
25 Software and Services Group 25 Intel ® Threading Building Blocks Sync Primitives
26
26 Software and Services Group 26 Synchronization Primitives ScalableFairReentrantSleeps mutex OS dependent NoYes spin_mutexNo queuing_mutexYes No spin_rw_mutexNo queuing_rw_mutexYes No recursive_mutex OS dependent Yes
27
27 Software and Services Group 27 Synchronization Primitives Features Atomic Operations. −High-level abstractions Exception-safe Locks −spin_mutex is VERY FAST in lightly contended situations; use it if you need to protect very few instructions −Use queuing_rw_mutex when scalability and fairness are important −Use recursive_mutex when your threading model requires that one thread can re-acquire a lock. All locks should be released by one thread for another one to get a lock. −Use reader-writer mutex to allow non-blocking read for multiple threads Portable condition variables
28
28 Software and Services Group 28 Example: spin_rw_mutex If exception occurs within the protected code block destructor will automatically release the lock if it’s acquired avoiding a dead-lock Any reader lock may be upgraded to writer lock; upgrade_to_writer indicates whether the lock had to be released before it was upgraded #include “tbb/spin_rw_mutex.h” using namespace tbb; spin_rw_mutex MyMutex; int foo (){ // Construction of ‘lock’ acquires ‘MyMutex’ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); … if (!lock.upgrade_to_writer ()) { … } else { … } return 0; // Destructor of ‘lock’ releases ‘MyMutex’ }
29
29 Software and Services Group 29 Intel ® Threading Building Blocks Scalable Memory Allocator
30
30 Software and Services Group 30 Scalable Memory Allocation Problem −Memory allocation is a bottle-neck in concurrent environment Threads acquire a global lock to allocate/deallocate memory from the global heap Solution −Intel ® Threading Building Blocks provides tested, tuned, and scalable memory allocator optimized for all object sizes: Manual and automatic replacement of memory management calls C++ interface to use it with C++ objects as an underlying allocator (e.g. STL containers) Scalable memory pools
31
31 Software and Services Group 31 Memory API Calls Replacement Manual −Change your code to call Intel® TBB scable_malloc/scalable_free instead of malloc and free −Use scalable_* API to implement operators new and delete −Use tbb::scalable_allocator as an underlying allocator for C++ objects (e.g. STL containers) Automatic (Windows* and Linux*) −Requires no code changes just re-link your binaries using proxy libraries Linux*: libtbbmalloc_proxy.so.2 or libtbbmalloc_proxy_debug.so.2 Windows*: tbbmalloc_proxy.dll or tbbmalloc_debug_proxy.dll
32
32 Software and Services Group 32 C++ Allocator Template Use tbb::scalable_allocator as an underlying allocator for C++ objects Example: // STL container used with Intel® TBB scalable allocator std::vector >;
33
33 Software and Services Group 33 Scalable Memory Pools #include "tbb/memory_pool.h"... tbb::memory_pool > my_pool(); void* my_ptr = my_pool.malloc(10); void* my_ptr_2 = my_pool.malloc(20); … my_pool.recycle(); // destructor also frees everything #include "tbb/memory_pool.h"... tbb::memory_pool > my_pool(); void* my_ptr = my_pool.malloc(10); void* my_ptr_2 = my_pool.malloc(20); … my_pool.recycle(); // destructor also frees everything #include "tbb/memory_pool.h"... char buf[1024*1024]; tbb::fixed_pool my_pool(buf, 1024*1024); void* my_ptr = my_pool.malloc(10); my_pool.free(my_ptr);} #include "tbb/memory_pool.h"... char buf[1024*1024]; tbb::fixed_pool my_pool(buf, 1024*1024); void* my_ptr = my_pool.malloc(10); my_pool.free(my_ptr);} Allocate and free from a fixed size buffer Allocate and free from a fixed size buffer Allocate memory from the pool Allocate memory from the pool
34
34 Software and Services Group 34 Scalable Memory Allocator Structure
35
35 Software and Services Group 35 Intel® TBB Memory Allocator Internals Small blocks −Per-thread memory pools Large blocks −Treat memory as “objects” of fixed size, not as ranges of address space. Typically several dozen (or less) object sizes are in active use −Keep released memory objects in a pool and reuse when object of such size is requested −Pooled objects “age” over time Cleanup threshold varies for different object sizes −Low fragmentation is achieved using segregated free lists Intel TBB scalable memory allocator is designed for multi-threaded apps and optimized for multi-core Intel TBB scalable memory allocator is designed for multi-threaded apps and optimized for multi-core
36
36 Software and Services Group 36 Concurrent Containers Common idioms for concurrent access - a scalable alternative to a serial container with a lock around it Miscellaneous Thread-safe timers Generic Parallel Algorithms Efficient scalable way to exploit the power of multi-core without having to start from scratch Task scheduler The engine that empowers parallel algorithms that employs task-stealing to maximize concurrency Synchronization Primitives User-level and OS wrappers for mutual exclusion, ranging from atomic operations to several flavors of mutexes and condition variables Memory Allocation Per-thread scalable memory manager and false-sharing free allocators Intel ® Threading Building Blocks Threads OS API wrappers Thread Local Storage Scalable implementation of thread-local data that supports infinite number of TLS TBB Graph
37
37 Software and Services Group 37 Supplementary Links Commercial Product Web Page www.intel.com/software/products/tbb Open Source Web Portal www.threadingbuildingblocks.org Knowledge Base, Blogs and User Forums http://software.intel.com/en-us/articles/intel-threading-building-blocks/all/1 http://software.intel.com/en-us/blogs/category/osstbb/ http://software.intel.com/en-us/forums/intel-threading-building-blocks Technical Articles: −“Demystify Scalable Parallelism with Intel Threading Building Block’s Generic Parallel Algorithms” http://www.devx.com/cplus/Article/32935 −“Enable Safe, Scalable Parallelism with Intel Threading Building Block's Concurrent Containers” http://www.devx.com/cplus/Article/33334 Industry Articles: −Product Review: Intel Threading Building Blocks http://www.devx.com/go-parallel/Article/33270 −“The Concurrency Revolution”, Herb Sutter, Dr. Dobb’s 1/19/2005 http://www.ddj.com/dept/cpp/184401916
38
38 Software and Services Group 38
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.