Presentation is loading. Please wait.

Presentation is loading. Please wait.

Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.

Similar presentations


Presentation on theme: "Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation."— Presentation transcript:

1 virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation

2 Intel® Threading Building Blocks overview Generic Parallel Algorithms Lab: Parallelize serial application Generic Concurrent Containers Synchronization Primitives Advanced Features Overview Summary virtual techdays INDIA │ 18-20 august 2010 S E S S I O N A G E N D A

3 Enables you to specify tasks instead of threads automatically maps task onto physical threads in the way that makes efficient use of processor resources Targets threading for performance solution for parallelizing a computationally intensive work units and preserve good scalability across various hardware Compatible with other threading packages work well for CPU bound tasks, not I/O bound; coexists with other threading packages Emphasizes scalable, data parallel programming scales well for the bigger number of processors Relies on generic programming Set of templates implemented in the Intel® TBB allows writing the flexible algorithms. virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Overview

4 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Overview – Supported Platforms: IA-32, Intel64 Parallel Studio – Product package includes: Dynamic libraries (debug and release) Header files Sample code Documentation: tutorial, getting started guide, reference – Intel® TBB is a set of generic algorithms and data structures (C++ templates) Trivial Intel® TBB program: #include "tbb/task_schedulerInit.h" using namespace tbb; int main () { task_scheduler_init TBB_Init; return 0; } All public classes and functions are in tbb namespace Library requires explicit initialization: at least one task_scheduler_init object must be active

5 Algorithms and data structures that manipulate with concepts – A concept is requirements on type – A type models a concept – Program defines types required by Intel® TBB constructs Parallel Generic Algorithms and Concurrent Containers – C++ programming experience, basic STL and basic threading knowledge are required to get started. No need to be threading Expert. Task Scheduler – An engine to power Parallel Generic Algorithms that hide the complexity of the tasks management. Task Scheduler may be used for advanced programming when your algorithm doesn’t naturally map onto one of pre-packaged Parallel Algorithms. Threading programming and tuning experience are required. Synchronization Primitives – The objects should be used carefully as inappropriate use of synchronization may lead to performance and correctness issues. Solid threading programming and tuning experience are required. virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Usage Model

6 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Generic Parallel Algorithms

7 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Generic Generic Parallel Algorithms : Basic Concepts Splittable Concept – The type X is splittable if it has a constructor that allows an instance to be split into two pieces Splitting constructor. Splits x into x and y X::X (X&, Split) Range Concept – The type R represents recursively divisible set of values; it must model Splittable Concept Splitting constructor R::R (R&, Split) Returns ‘true’ if range can be partitioned in to two sub- ranges bool R::is_divisible() const Returns ‘true’ if range is empty bool R::is_empty() const Destructor R::~R () Copy constructor R::R (const R&)

8 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Generic parallel_for Template Function Generic Parallel Algorithms : parallel_for Template Function parallel_for Body Concept Requirements Apply Body to Range void Body::operator() (Range&) const Destructor Body::~Body () Copy constructor Body::Body (const Body&) Range type must model Intel® Threading Building Blocks Range Concept described on the previous foil #include “tbb/ParallelFor.h” template parallel_for (const Range& range, const Body& body> – represents parallel execution of Body over each value in the Range

9 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Example: Parallelizing Simple Loops Task: loop over the fixed size array of elements and apply a function to each of them (iterations are independent) Serial version of the solution: const int N = 20000000; void ChangeAarraySerial (int* array, int M) { for (int i = 0; i < M; i++){ array[i] *= 2; } int main (){ int A[N]; for (int i = 0; i < N; i++) { A[i] = i;} ChangeArraySerial (A, N); return 0; }

10 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks  Parallel solution with Intel® TBB : using parallel_for #include "tbb/blocked_range.h" #include "tbb/parallel_for.h" using namespace tbb; const int IdealGrainSize = ; class ChangeArray{ int* array; public: ChangeArray (int* a): array(a) {} void operator()( const blocked_range & r ) const{ for (int i=r.begin(); i!=r.end(); i++ ){ array[i] *= 2; } }; void ChangeArrayParallel (int* a, int n ) { parallel_for (blocked_range (0, n, IdealGrainSize), ChangeArray(a)); } int main (){ int A[N]; // initialize tbb, array here… ChangeArrayParallel (A, N); return 0; } ChangeArray class models ParallelFor Body Blocked_range is a pre-packaged 1D iteration space, models Range Concept Apply change to array element in the body of operator() Call generic function Parallel_for : Range  Blocked_Range Body  ChangeArray Experiment with Grain Size

11 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Convert Serial Matrix multiplication application into parallel application using parallel_for. Lab 1:

12 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Generic parallel_reduce Template Function Generic Parallel Algorithms : parallel_reduce Template Function #include “tbb/ParallelReduce.h” template parallel_reduce (const Range& range, const Body& body > - represents parallel reduction of Body over each value in the Range parallel_reduce Body Concept Requirements Range type must model Intel® Threading Building Blocks Range Concept Apply Body to Range void Body::operator() (Range&) Destructor Body::~Body () Copy constructor Body::Body (const Body&) Body::Body (const Body&, Split) Splitting constructor; must be able to run concurrently with ‘join’, `operator()’ void Body::join (const Body& rhs) The result of rhs must be merged with result of `this`

13 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks  Parallel solution with Intel® TBB : using parallel_reduce #include "tbb/blocked_range.h" #include "tbb/parallel_reduce.h" using namespace tbb; const int IdealGrainSize = ; class SumArray{ int* array; public: int sum; SumArray (int* a): array(a), sum(0) {} void operator()( const blocked_range & r ) { for (counter i=r.begin(); i!=r.end(); i++ ){ sum += array[i]; } SumArray (SumArray& partial_sum, split): array(partial_sum.array), sum(0) {} void join (const SumArray& partial_sum) { sum += partial_sum.sum; } }; void SumArrayParallel (int* a, int n ) { SumArray sum_array (a); parallel_reduce (blocked_range (0, n, IdealGrainSize), sum_array); return sum_array.sum; } Calculate partial ‘sum’ of array elements in the body of operator() Call generic function parallel_reduce Define splitting constructor Class SumArray models parallel_reduce Body Concept Perform Reduction in the body of ‘join’

14 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Generic Concurrent Containers

15 Provides concurrent containers – STL containers are not thread-safe: attempt to modify them concurrently can corrupt container – Standard practice is to wrap a lock around STL containers Turns container into serial bottleneck Interfaces are similar to STL but don’t match 100%. – Some STL interfaces are inherently not thread-safe Fine-grained locking or lockless implementations – Worse single-thread performance, but better scalability. – Can be used with the library, OpenMP, or native threads. virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Concurrent Containers

16 concurrent_hash_table Maps Key to element of type T Hash table of to std::pair You should implement HashCompare class and define 2 methods: ‘hash’ (mapping Key to hash code of type size_t), and predicate ‘equal’ (returns true if two Key’s are equal) virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Concurrent Containers : concurrent_hash_table

17 concurrent_vector Dynamically growable array of T: grow_by and grow_to_atleast clear() method is not thread-safe with respect to resizing ConcurrentVector never moves the element until the array cleared virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Concurrent Containers : concurrent_vector

18 concurrent_queue For single threaded run it supports “first-in-first-out” ordering If one thread pushes two values and the other thread pops those two values they will come out in the order as they were pushed The type of ‘size’ is signed number: if queue is empty and size() returns ‘–n’ this means ‘n’ pops are pending Method ‘empty’ returns true if size is a negative value virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Concurrent Containers : concurrent_queue

19 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Synchronization Primitives

20 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks : Mutex Concept Synchronization Primitives : Mutex Concept Mutexes are C++ objects based on scoped locking pattern M()Construct unlocked mutex ~M()Destroy unlocked mutex typename M::scoped_lockCorresponding scoped_lock type M::scoped_lock ()Construct lock w/out acquiring a mutex M::scoped_lock (M&)Construct lock and acquire lock on mutex M::~scoped_lock ()Release lock if acquired M::scoped_lock::acquire (M&)Acquire lock on mutex M::scoped_lock::release ()Release lock

21 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks : Mutex Flavors Synchronization Primitives : Mutex Flavors  spin_mutex Non-reentrant, unfair, spins in the user space VERY FAST in lightly contended situations; use it if you need to protect very few instructions  queuing_mutex Non-reentrant, fair, spins in the user space Use Queuing_Mutex when scalability and fairness are important  queuing_rw_mutex Non-reentrant, fair, spins in the user space  spin_rw_mutex Non-reentrant, fair, spins in the user space Use ReaderWriterMutex to allow non-blocking read for multiple threads  mutex Wrapper for OS sync: CRITICAL_SECTION for Windows*, pthread_mutex on Linux*

22 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks : Example of spin_rw_mutex Synchronization Primitives : Example of spin_rw_mutex Allows multiple threads to read the protected data, but only one can exclusively change the data (writer) Upgrade/Downgrade operations update_to_writer: returns true if it successfully upgraded a lock without temporarily releasing the mutex downgrade_to_reader #include “tbb/spin_rw_mutex.h” using namespace tbb; spin_rw_mutex MyMutex; int foo (){ /* Construction of ‘lock’ acquires ‘MyMutex’ */ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); … if (!lock.upgrade_to_writer ()) { … } else { … } return 0; /* Destructor of ‘lock’ releases ‘MyMutex’ */ }

23 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks Advanced Features Overview

24 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks : Mutex Concept Synchronization Primitives : Mutex Concept Generic Parallel Algorithms parallel_for parallel_while parallel_reduce pipeline parallel_sort parallel_scan Concurrent Containers concurrent_hash_table concurrent_queue concurrent_vector task_scheduler Low-Level Synchronization Primitives spin_mutex queuing_rw_mutex spin_rw_mutex mutex

25 virtual techdays INDIA │ 18-20 august 2010 Intel® Threading Building Blocks : Summary Scalable data-parallel decomposition providing patterns for parallel algorithms and concurrent data structures Paradigm of logical tasks that are efficiently and automatically mapped onto physical threads by task scheduler Works good for computationally intensive tasks as task scheduler efficiently load balances tasks across the physical threads and it’s cache aware

26 virtual techdays INDIA │ 18-20 august 2010 RESOURCES  Resource-1  http://www.threadingbuildingblocks.org/ http://www.threadingbuildingblocks.org/  Resource-2  http://www.threadingbuildingblocks.org/ http://www.threadingbuildingblocks.org/  You may participate in our community support web site.  Tools Knowledge Base: http://software.intel.com/en-us/articles/toolshttp://software.intel.com/en-us/articles/tools  User forums: http://software.intel.com/en-us/forums/http://software.intel.com/en-us/forums/  Intel® Software Product support info: http://www.intel.com/software/support

27 virtual techdays INDIA │ 18-20 august 2010 RELATED CONTENT  Session-1  Speaker Name  Timing  Session-2  Speaker Name  Timing  Session-3  Speaker Name  Timing

28 virtual techdays THANKS │ 18-20 august 2010 email id │om.p.sachan@intel.com


Download ppt "Virtual techdays INDIA │ 18-20 august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation."

Similar presentations


Ads by Google