INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks 杨剑锋, Wuhan University

Slides:



Advertisements
Similar presentations
Implementing Task Decompositions Intel Software College Introduction to Parallel Programming – Part 5.
Advertisements

Analyzing Parallel Performance Intel Software College Introduction to Parallel Programming – Part 6.
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Improving Parallel Performance Intel Software College Introduction to Parallel Programming – Part 7.
Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.
© 2013 IBM Corporation Implement high-level parallel API in JDK Richard Ning – Enterprise Developer 1 st June 2013.
Chapter 7 Constructors and Other Tools. Copyright © 2006 Pearson Addison-Wesley. All rights reserved. 7-2 Learning Objectives Constructors Definitions.
11 Copyright © 2005, Oracle. All rights reserved. Using Arrays and Collections.
1 Copyright © 2005, Oracle. All rights reserved. Introducing the Java and Oracle Platforms.
Addison Wesley is an imprint of © 2010 Pearson Addison-Wesley. All rights reserved. Chapter 10 Arrays and Tile Mapping Starting Out with Games & Graphics.
Intel Software College Tuning Threading Code with Intel® Thread Profiler for Explicit Threads.
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.
1 Linked Lists III Template Chapter 3. 2 Objectives You will be able to: Write a generic list class as a C++ template. Use the template in a test program.
1 Joe Meehean. Ordered collection of items Not necessarily sorted 0-index (first item is item 0) Abstraction 2 Item 0 Item 1 Item 2 … Item N.
INTEL CONFIDENTIAL Implementing a Task Decomposition Introduction to Parallel Programming – Part 9.
Executional Architecture
25 seconds left…...
INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
© Copyright 1992–2005 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Tutorial 13 – Salary Survey Application: Introducing.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
FOUNDATION TO PARALLEL PROGRAMMING. CONTENT 并行程序设计简介 并行程序设计模型 并行程序设计范型 2.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
INTEL CONFIDENTIAL Improving Parallel Performance Introduction to Parallel Programming – Part 11.
DCN 多核防火墙快速配置之 目的 NAT 配置 神州数码网络 蒋忠平.
计算机 在分析化学的应用 ( 简介 ) 陈辉宏. 一. 概述 信息时代的来临, 各门学科的研究方法都 有了新的发展. 计算机的介入, 为分析化学的进展提供了 一种更方便的研究方法.
嵌入式操作系统 陈香兰 Fall 系统调用 10/27/09 嵌入式 OS 3/12 系统调用的意义  操作系统为用户态进程与硬件设备进行交互提供 了一组接口 —— 系统调用  把用户从底层的硬件编程中解放出来  极大的提高了系统的安全性  使用户程序具有可移植性.
吉林大学远程教育课件 主讲人 : 杨凤杰学 时: 64 ( 第四十八讲 ) 离散数学. 例 设 S 是一个非空集合, ρ ( s )是 S 的幂集合。 不难证明 :(ρ(S),∩, ∪,ˉ, ,S) 是一个布尔代数。 其中: A∩B 表示 A , B 的交集; A ∪ B 表示 A ,
数 学 系 University of Science and Technology of China DEPARTMENT OF MATHEMATICS 第 3 章 曲线拟合的最小二乘法 给出一组离散点,确定一个函数逼近原函数,插值是这样的一种手段。 在实际中,数据不可避免的会有误差,插值函数会将这些误差也包括在内。
在发明中学习 线性代数 概念的引入 李尚志 中国科学技术大学. 随风潜入夜 : 知识的引入 之一、线性方程组的解法 加减消去法  方程的线性组合  原方程组的解是新方程的解 是否有 “ 增根 ” ?  互为线性组合 : 等价变形  初等变换  高斯消去法.
东南大学计算中心 网站应用与实践 主讲人 吴俊. 2 东南大学计算中心 网站制作流程  确定主题、风格  规划栏目、收集素材  版面设计、配色  编辑页面  测试发布 FrontPage 要完成的任务.
第 3 章 控制流分析 内容概述 – 定义一个函数式编程语言,变量可以指称函数 – 以 dynamic dispatch problem 为例(作为参数的 函数被调用时,究竟执行的是哪个函数) – 规范该控制流分析问题,定义什么是可接受的控 制流分析 – 定义可接受分析在语义模型上的可靠性 – 讨论分析算法.
吉林大学远程教育课件 主讲人 : 杨凤杰学 时: 64 ( 第五十三讲 ) 离散数学. 定义 设 G= ( V , T , S , P ) 是一个语法结构,由 G 产生的语言 (或者说 G 的语言)是由初始状态 S 演绎出来的所有终止符的集合, 记为 L ( G ) ={w  T *
平行线的平行公理与判定 九年制义务教育七年级几何 制作者:赵宁睿. 平行线的平行公理与判定 要点回顾 课堂练习 例题解析 课业小结 平行公理 平行判定.
1 第 7 章 存储过程、触发器和程序包 在很多时候,都需要保存 PL/SQL 程序块,以便 随后可以重新使用。这也意味着,程序块需要一个名 称,这样需才可以调用或者引用它。命名的 PL/SQL 程序块可被独立编译并存储在数据库中,任何与数据 库相连接的应用程序都可以访问这些存储的 PL/SQL 程序块。
INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
《 UML 分析与设计》 交互概述图 授课人:唐一韬. 知 识 图 谱知 识 图 谱知 识 图 谱知 识 图 谱.
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
软件调优基础 2004 年 2 月 23 日. 为什么需要调优? 相同的代码 >> 不同的性能 SELFRELEASE OPT : 4 IMSLCXMLATLASMKL50MKL s5.445s5.457s10.996s3.328s0.762s0.848s0.738s for(i=0;i
1 、如果 x + 5 > 4 ,那么两边都 可得 x >- 1 2 、在- 3y >- 4 的两边都乘以 7 可得 3 、在不等式 — x≤5 的两边都乘以- 1 可得 4 、将- 7x — 6 < 8 移项可得 。 5 、将 5 + a >- 2 a 移项可得 。 6 、将- 8x < 0.
Copyright © Curt Hill Generic Classes Template Classes or Container Classes.
Chapter 3 Programming Languages Unit 1 The Development of Programming Languages.
Virtual techdays INDIA │ august 2010 Parallelize applications using Intel Threading Building Blocks Om Sachan │ SSG, Intel Corporation.
§10.2 对偶空间 一、对偶空间与对偶基 二、对偶空间的有关结果 三、例题讲析.
表单自定义 “ 表单自定义 ” 功能是用于制作表单的 工具,用数飞 OA 提供的表单自定义 功能能够快速制作出内容丰富、格 式规范、美观的表单。
INTEL CONFIDENTIAL Shared Memory Considerations Introduction to Parallel Programming – Part 4.
SIMCOM EAT.
力的合成 力的合成 一、力的合成 二、力的平行四边形 上一页下一页 目 录 退 出. 一、力的合成 O. O. 1. 合力与分力 我们常常用 一个力来代替几个力。如果这个 力单独作用在物体上的效果与原 来几个力共同作用在物体上的效 果完全一样,那么,这一个力就 叫做那几个力的合力,而那几个 力就是这个力的分力。
M7120 型平面磨床 电气控制线路原理与故障检修 主讲 张振飞 湖南工学院电气与信息工程系 电工电子实习基地.
向日葵的花盘 画一画 用圆规画圆用圆规画圆 用圆规画圆用圆规画圆 用圆规画圆的方法: ( 1 )把圆规的两脚分开,定好两脚间 的距离(定长) ( 2 )把有针尖的一只脚固定在一点上 (定点) ( 3 )把装有铅笔尖的一只脚旋转一周 ,就画出一个圆(旋转)
八. 真核生物的转录 ㈠ 特点 ① 转录单元为单顺反子( single cistron ),每 个蛋白质基因都有自身的启动子,从而造成在功能 上相关而又独立的基因之间具有更复杂的调控系统。 ② RNA 聚合酶的高度分工,由 3 种不同的酶催化转 录不同的 RNA 。 ③ 需要基本转录因子与转录调控因子的参与,这.
U niversity of S cience and T echnology of C hina VxWorks 及其应用开发 陈香兰 年 7 月.
Chapter 4: Threads. 4.2/23 Chapter 4: Threads Overview Multithreading Models.
Parallel Computing Presented by Justin Reschke
韩文数据库使用说明 鲁锦松. 主要内容 一、为什么要用数据库 二、怎样利用中文数据库 三、怎样利用韩文数据库.
Tuning Threaded Code with Intel® Parallel Amplifier.
1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.
Session 03 - Templates Outline 03.1Introduction 03.2Function Templates 03.3Overloading Template Functions 03.4Class Templates 03.5Class Templates and Non-type.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
1 Game Developers Conference 2008 Comparative Analysis of Game Parallelization Dmitry Eremin Senior Software Engineer, Intel Software and Solutions Group.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 4: Multithreaded Programming
Task Scheduling for Multicore CPUs and NUMA Systems
Parallel Algorithm Design
Many-core Software Development Platforms
Chapter 4: Threads.
Chapter 4: Threads & Concurrency
Presentation transcript:

INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks 杨剑锋, Wuhan University

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 2 回顾 前面我们已经讲过 OpenMP 和 MPI 编程工具: OpenMP :共享内存、方法 MPI :分布式内存、方法 Data Dependence Graphs

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 3 为什么要学 TBB 为什么要学 TBB ? 面向任务编程 简单易用 丰富的辅助开发工具 TBB 是 C++ 的扩展库,同学们对 C++ 很熟悉,使用 TBB 进行多核编程上手快。 TBB 的可扩展性高( Scalability ) 与 OpenMP 相比: 线程的规约可以使用于任意类型,而 OpenMP 的规约只允许在内嵌类型上进行。 与 MPI 相比: MPI 需要人工把任务映射到线程上,而 TBB 直接通过任务来表达并行语义。

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 4 下载

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 5 Objectives At the end of this session, you will be able to: List the different components of the Intel Threading Building Blocks library Code parallel applications using TBB

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 6 Agenda Intel® Threading Building Blocks background Generic Parallel Algorithms parallel_for parallel_reduce parallel_sort example Task Scheduler Generic Highly Concurrent Containers Scalable Memory Allocation Low-level Synchronization Primitives

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 7 Intel® Threading Building Blocks (TBB) - Key Features You specify tasks (what can run concurrently) instead of threads Library maps your logical tasks onto physical threads, efficiently using cache and balancing load Full support for nested parallelism Targets threading for scalable performance Portable across Linux*, Mac OS*, Windows*, and Solaris* Emphasizes scalable data parallel programming Loop parallelism tasks are more scalable than a fixed number of separate tasks Compatible with other threading packages Can be used in concert with native threads and OpenMP Open source and licensed versions available

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 8 Limitations Intel® TBB is not intended for I/O bound processing Real-time processing

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 9 Components of TBB (version 2.1) Synchronization primitives atomic operations various flavors of mutexes (improved) Parallel algorithms parallel_for (improved) parallel_reduce (improved) parallel_do (new) pipeline (improved) parallel_sort parallel_scan Concurrent containers concurrent_hash_map concurrent_queue concurrent_vector (all improved) Task scheduler With new functionality Memory allocators tbb_allocator (new), cache_aligned_allocator, scalable_allocator Utilities tick_count tbb_thread (new)

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 10 Task-based Programming with Intel® TBB Tasks are light-weight entities at user-level Intel® TBB parallel algorithms map tasks onto threads automatically Task scheduler manages the thread pool –Scheduler is unfair to favor tasks that have been most recent in the cache Oversubscription and undersubscription of core resources is prevented by task-stealing technique of TBB scheduler

INTEL CONFIDENTIAL Generic Parallel Algorithms

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 12 Generic Programming Best known example is C++ STL Enables distribution of broadly-useful high-quality algorithms and data structures Write best possible algorithm with fewest constraints Do not force particular data structure on user Classic example: STL std::sort Instantiate algorithm to specific situation C++ template instantiation, partial specialization, and inlining make resulting code efficient Standard Template Library, overall, is not thread-safe

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 13 Generic Programming - Example The compiler creates the needed versions template T max (T x, T y) { if (x < y) return y; return x; } int main() { int i = max(20,5); double f = max(2.5, 5.2); MyClass m = max(MyClass(“foo”),MyClass(“bar”)); return 0; } T must define a copy constructor and a destructor T must define operator<

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 14 Intel® Threading Building Blocks Patterns Task scheduler powers high level parallel patterns that are pre- packaged, tested, and tuned for scalability parallel_for: load-balanced parallel execution of loop iterations where iterations are independent parallel_reduce: load-balanced parallel execution of independent loop iterations that perform reduction (e.g. summation of array elements) parallel_while: load-balanced parallel execution of independent loop iterations with unknown or dynamically changing bounds (e.g. applying function to the element of linked list) parallel_scan: template function that computes parallel prefix pipeline: data-flow pipeline pattern parallel_sort: parallel sort

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 15 Grain Size 粒度:分配给处理器的合理的迭代数量。 OpenMP has similar parameter Part of parallel_for, not underlying task scheduler Grain size exists to amortize overhead, not balance load Units of granularity are loop iterations Typically only need to get it right within an order of magnitude

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 16 Tuning Grain Size Tune by examining single-processor performance When in doubt, err on the side of making it a little too large, so that performance is not hurt when only one core is available. too fine  scheduling overhead dominates too coarse  lose potential parallelism

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 17 The parallel_for Template Requires definition of: A range type to iterate over – Must define a copy constructor and a destructor – Defines is_empty() – Defines is_divisible() – Defines a splitting constructor, R(R &r, split) A body type that operates on the range (or a subrange) – Must define a copy constructor and a destructor – Defines operator() template void parallel_for(const Range& range, const Body &body);

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 18 Body is Generic Requirements for parallel_for Body parallel_for partitions original range into subranges, and deals out subranges to worker threads in a way that: Balances load Uses cache efficiently Scales Body::Body(const Body&)Copy constructor Body::~Body()Destructor void Body::operator() (Range& subrange) constApply the body to subrange.

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 19 Range is Generic Requirements for parallel_for Range Library provides predefined ranges blocked_range and blocked_range2d You can define your own ranges R::R (const R&)Copy constructor R::~R()Destructor bool R::is_empty() constTrue if range is empty bool R::is_divisible() constTrue if range can be partitioned R::R (R& r, split)Splitting constructor; splits r into two subranges

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 20 tasks available to be scheduled to other threads (thieves) How splitting works on blocked_range2d Split range..... recursively......until  grainsize.

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 21 const int N = ; void change_array(float array, int M) { for (int i = 0; i < M; i++){ array[i] *= 2; } int main (){ float A[N]; initialize_array(A); change_array(A, N); return 0; } An Example using parallel_for Independent iterations and fixed/known bounds

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 22 #include “tbb/task_scheduler_init.h” #include “tbb/blocked_range.h” #include “tbb/parallel_for.h” using namespace tbb; int main (){ task_scheduler_init init; float A[N]; initialize_array(A); parallel_change_array(A, N); return 0; } An Example using parallel_for Include Library Headers Use namespace Include and initialize the library Initialize scheduler int main (){ float A[N]; initialize_array(A); change_array(A, N); return 0; } blue = original code green = provided by TBB red = boilerplate for library

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 23 class ChangeArrayBody { float *array; public: ChangeArrayBody (float *a): array(a) {} void operator()( const blocked_range & r ) const{ for (int i = r.begin(); i != r.end(); i++ ){ array[i] *= 2; } }; void parallel_change_array(float *array, int M) { parallel_for (blocked_range (0, M, IdealGrainSize), ChangeArrayBody(array)); } An Example using parallel_for Define Task Use Pattern Establish grain size Use the parallel_for pattern void change_array(float *array, int M) { for (int i = 0; i < M; i++){ array[i] *= 2; } blue = original code green = provided by TBB red = boilerplate for library

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 24 Activity 1: Matrix Multiplication Convert serial matrix multiplication application into parallel application using parallel_for Triple-nested loops

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 25 The parallel_reduce Template Requirements for parallel_reduce Body Reuses Range concept from parallel_for Body::Body( const Body&, split )Splitting constructor Body::~Body()Destructor void Body::operator() (Range& subrange) constAccumulate results from subrange void Body::join( Body& rhs );Merge result of rhs into the result of this. template void parallel_reduce (const Range& range, Body &body);

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 26 Serial Example // Find index of smallest element in a[0...n-1] long SerialMinIndex ( const float a[], size_t n ) { float value_of_min = FLT_MAX; long index_of_min = -1; for( size_t i=0; i<n; ++i ) { float value = a[i]; if( value<value_of_min ) { value_of_min = value; index_of_min = i; } return index_of_min; }

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 27 Parallel Version (1 of 3) class MinIndexBody { const float *const my_a; public: float value_of_min; long index_of_min;... MinIndexBody ( const float a[] ) : my_a(a), value_of_min(FLT_MAX), index_of_min(-1) {} }; // Find index of smallest element in a[0...n-1] long ParallelMinIndex ( const float a[], size_t n ) { MinIndexBody mib(a); parallel_reduce(blocked_range (0,n,GrainSize), mib ); return mib.index_of_min; } blue = original code green = provided by TBB red = boilerplate for library On next slides: operator() Split constructor join

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 28 Parallel Version (2 of 3) class MinIndexBody { const float *const my_a; public:... void operator()( const blocked_range & r ) { const float* a = my_a; for( size_t i = r.begin(); i != r.end; ++i ) { float value = a[i]; if( value<value_of_min ) { value_of_min = value; index_of_min = i; }... accumulate result blue = original code green = provided by TBB red = boilerplate for library

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 29 Parallel Version (3 of 3) class MinIndexBody { const float *const my_a; public:... MinIndexBody( MinIndexBody& x, split ) : my_a(x.my_a), value_of_min(FLT_MAX), index_of_min(-1) {} void join( const MinIndexBody& y ) { if( y.value_of_min < value_of_min ) { value_of_min = y.value_of_min; index_of_min = y.index_of_min; }... join split blue = original code green = provided by TBB red = boilerplate for library

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 30 Activity 2: parallel_reduce Lab Numerical Integration code to compute Pi

INTEL CONFIDENTIAL Parallel Sort Example (with work stealing)

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 32 Quicksort – Step 1 THREAD Thread 1 starts with the initial data tbb::parallel_sort (color, color+64);

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners Quicksort – Step THREAD THREAD 2THREAD 3THREAD 4 Thread 1 partitions/splits its data

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners THREAD 1THREAD Thread 2 gets work by stealing from Thread 1 THREAD 3THREAD 4 Quicksort – Step 2

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners THREAD THREAD Thread 1 partitions/splits its data Quicksort – Step 3 Thread 2 partitions/splits its data

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners THREAD THREAD 2THREAD 3THREAD Thread 3 gets work by stealing from Thread 1 Thread 4 gets work by stealing from Thread 2 Quicksort – Step 3

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners Quicksort – Step 4 THREAD 1THREAD 2THREAD 3THREAD Thread 1 sorts the rest of its data Thread 4 sorts the rest of its data Thread 2 sorts the rest its data Thread 3 partitions/splits its data

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners THREAD THREAD 2THREAD 3THREAD Quicksort – Step 5 Thread 1 gets more work by stealing from Thread 3 Thread 3 sorts the rest of its data

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners THREAD THREAD 2THREAD 3THREAD Thread 1 partitions/splits its data Quicksort – Step 6

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners THREAD THREAD 2THREAD 3THREAD Thread 2 gets more work by stealing from Thread 1 Thread 1 sorts the rest of its data Quicksort – Step 6

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners THREAD THREAD 2THREAD 3THREAD Thread 2 sorts the rest of its data DONE Quicksort – Step 7

INTEL CONFIDENTIAL Task Scheduler

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 43 Task Based Approach Intel ® TBB provides C++ constructs that allow you to express parallel solutions in terms of task objects Task scheduler manages thread pool Task scheduler avoids common performance problems of programming with threads ProblemIntel® TBB Approach Oversubscription One scheduler thread per hardware thread Fair scheduling Non-preemptive unfair scheduling High overhead Programmer specifies tasks, not threads Load imbalance Work-stealing balances load

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 44 Example: Naive Fibonacci Calculation Recursion typically used to calculate Fibonacci number Widely used as toy benchmark Easy to code Has unbalanced task graph F(n)=F(n-1)+F(n-2) long SerialFib( long n ) { if( n<2 ) return n; else return SerialFib(n-1) + SerialFib(n-2); }

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 45 Example: Naive Fibonacci Calculation Can envision Fibonacci computation as a task graph SerialFib(4) SerialFib(3) SerialFib(2) SerialFib(1) SerialFib(2) SerialFib(1) SerialFib(0) SerialFib(2) SerialFib(1) SerialFib(0) SerialFib(3) SerialFib(2) SerialFib(1) SerialFib(0) SerialFib(1) SerialFib(0)

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 46 Fibonacci - Task Spawning Solution Use TBB tasks to thread creation and execution of task graph Create new root task Allocate task object Construct task Spawn (execute) task, wait for completion long ParallelFib( long n ) { long sum; FibTask& a = *new(Task::allocate_root()) FibTask(n,&sum); Task::spawn_root_and_wait(a); return sum; }

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 47 class FibTask: public task { public: const long n; long* const sum; FibTask( long n_, long* sum_ ) : n(n_), sum(sum_) {} task* execute() { // Overrides virtual function task::execute if( n<CutOff ) { *sum = SerialFib(n); } else { long x, y; FibTask& a = *new( allocate_child() ) FibTask(n-1,&x); FibTask& b = *new( allocate_child() ) FibTask(n-2,&y); set_ref_count(3); // 3 = 2 children + 1 for wait spawn( b ); spawn_and_wait_for_all( a ); *sum = x+y; } return NULL; } }; Fibonacci - Task Spawning Solution Derived from TBB task class Create new child tasks to compute (n-1) th and (n-2) th Fibonacci numbers Reference count is used to know when spawned tasks have completed Set before spawning any children Spawn task; return immediately Can be scheduled at any time Spawn task; block until all children have completed execution The execute method does the computation of a task

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 48 Further Optimizations Enabled by Scheduler Recycle tasks Avoid overhead of allocating/freeing Task Avoid copying data and rerunning constructors/destructors Continuation passing Instead of blocking, parent specifies another Task that will continue its work when children are done. Further reduces stack space and enables bypassing scheduler Bypassing scheduler Task can return pointer to next Task to execute –For example, parent returns pointer to its left child –See include/tbb/parallel_for.h for example Saves push/pop on deque (and locking/unlocking it)

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 49 Activity #3: Task Scheduler Interface for Recursive Algorithms Develop code to launch Intel TBB tasks to build and traverse a binary tree.

INTEL CONFIDENTIAL End of this part

INTEL CONFIDENTIAL Generic Highly Concurrent Containers

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 52 Concurrent Containers TBB Library provides highly concurrent containers STL containers are not concurrency-friendly: attempt to modify them concurrently can corrupt container Standard practice is to wrap a lock around STL containers –Turns container into serial bottleneck Library provides fine-grained locking or lockless implementations Worse single-thread performance, but better scalability. Can be used with the library, OpenMP, or native threads.

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 53 Concurrency-Friendly Interfaces Some STL interfaces are inherently not concurrency-friendly For example, suppose two threads each execute: Solution: concurrent_queue has pop_if_present extern std::queue q; if(!q.empty()) { item=q.front(); q.pop(); } At this instant, another thread might pop last element.

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 54 Concurrent Queue Container concurrent_queue Preserves local FIFO order –If thread pushes and another thread pops two values, they come out in the same order that they went in Method push(const T&) places copy of item on back of queue Two kinds of pops –Blocking – pop(T&) –non-blocking – pop_if_present(T&) Method size() returns signed integer –If size() returns –n, it means n pops await corresponding pushes Method empty() returns size() == 0 –Difference between pushes and pops –May return true if queue is empty, but there are pending pop()

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 55 Concurrent Queue Container Example Simple example to enqueue and print integers Constructor for queue Push items onto queue While more things on queue Pop item off Print item #include “tbb/concurrent_queue.h” #include using namespace tbb; int main () { concurrent_queue queue; int j; for (int i = 0; i < 10; i++) queue.push(i); while (!queue.empty()) { queue.pop(&j); printf(“from queue: %d\n”, j); } return 0; }

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 56 Concurrent Vector Container concurrent_vector Dynamically growable array of T –Method grow_by(size_type delta) appends delta elements to end of vector –Method grow_to_at_least(size_type n) adds elements until vector has at least n elements –Method size() returns the number of elements in the vector –Method empty() returns size() == 0 Never moves elements until cleared –Can concurrently access and grow –Method clear() is not thread-safe with respect to access/resizing

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 57 Concurrent Vector Container Example Append a string to the array of characters held in concurrent_vector Grow the vector to accommodate new string grow_by() returns old size of vector (first index of new element) Copy string into vector void Append( concurrent_vector & V, const char* string) { size_type n = strlen(string)+1; memcpy( &V[V.grow_by(n)], string, n+1 ); }

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 58 Concurrent Hash Table Container concurrent_hash_map Maps Key to element of type T You define class HashCompare with two methods –hash() maps Key to hashcode of type size_t –equal() returns true if two Keys are equal Enables concurrent find(), insert(), and erase() operations –find() and insert() set “smart pointer” that acts as lock on item –accessor grants read-write access –const_accessor grants read-only access –lock released when smart pointer is destroyed

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 59 Concurrent Hash Table Container Example MyHashCompare methods User-defined method hash() takes a string as a key and maps to an integer User-defined method equal() returns true if two strings are equal struct MyHashCompare { static size_t hash( const string& x ) { size_t h = 0; for( const char* s = x.c_str(); *s; s++ ) h = (h*157)^*s; return h; } static bool equal( const string& x, const string& y ) { return strcmp(x, y) == 0; } };

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 60 Concurrent Hash Table Container Example Key Insert If insert() returns true, new string insertion Value is key’s place within sequence of strings from getNextString() Otherwise, string has been previously seen typedef concurrent_hash_map myHash; myHash table; string newstring; int place = 0; … while (getNextString(&newString)) { myHash::accessor a; if (table.insert( a, newString )) // new string inserted a->second = ++place; }

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 61 Concurrent Hash Table Container Example Key Find If find() returns true, key was found within hash table myHash table; string s1, s2; int p1, p2; … { myHash::const_accessor a; // read_lock myHash::const_accessor b; if (table.find(a,s1) && table.find(b,s2)) { // find strings p1 = a->second; p2 = b->second; if (p1 < p2) printf(“%s came before %s\n”,s1,s2); else printf(“%s came before %s\n”,s2,s1); } else printf(“One or both strings not seen before\n”); }

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 62 Activity 4: Concurrent Container Lab Use a hash table to keep track of the number of string occurrences.

INTEL CONFIDENTIAL Scalable Memory Allocation

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 64 Scalable Memory Allocators Serial memory allocation can easily become a bottleneck in multithreaded applications Threads require mutual exclusion into shared heap False sharing - threads accessing the same cache line Even accessing distinct locations, cache line can ping-pong Intel® Threading Building Blocks offers two choices for scalable memory allocation Similar to the STL template class std::allocator scalable_allocator –Offers scalability, but not protection from false sharing –Memory is returned to each thread from a separate pool cache_aligned_allocator –Offers both scalability and false sharing protection

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 65 Methods for scalable_allocator #include “tbb/scalable_allocator.h” template class scalable_allocator; Scalable versions of malloc, free, realloc, calloc void *scalable_malloc( size_t size ); void scalable_free( void *ptr ); void *scalable_realloc( void *ptr, size_t size ); void *scalable_calloc( size_t nobj, size_t size ); STL allocator functionality T* A::allocate( size_type n, void* hint=0 ) –Allocate space for n values void A::deallocate( T* p, size_t n ) –Deallocate n values from p void A::construct( T* p, const T& value ) void A::destroy( T* p )

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 66 Scalable Allocators Example #include “tbb/scalable_allocator.h” typedef char _Elem; typedef std::basic_string<_Elem, std::char_traits, tbb::scalable_allocator > MyString;... {... int *p; MyString str1 = "qwertyuiopasdfghjkl"; MyString str2 = "asdfghjklasdfghjkl"; p = tbb::scalable_allocator ().allocate(24);... } Use TBB scalable allocator for STL basic_string class Use TBB scalable allocator to allocate 24 integers

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 67 Activity 5: Memory Allocation Comparison Do scalable memory exercise that first uses “new” and then asks user to replace with TBB scalable allocator

INTEL CONFIDENTIAL Low-Level Synchronization Primitives

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 69 Intel® TBB: Synchronization Primitives Parallel tasks must sometimes touch shared data When data updates might overlap, use mutual exclusion to avoid race High-level generic abstraction for HW atomic operations Atomically protect update of single variable Critical regions of code are protected by scoped locks The range of the lock is determined by its lifetime (scope) Leaving lock scope calls the destructor, making it exception safe Minimizing lock lifetime avoids possible contention Several mutex behaviors are available –Spin vs. queued –“are we there yet” vs. “wake me when we get there” –Writer vs. reader/writer (supports multiple readers/single writer) –Scoped wrapper of native mutual exclusion function

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 70 Atomic Execution atomic T should be integral type or pointer type Full type-safe support for 8, 16, 32, and 64-bit integers Operations atomic i;... int z = i.fetch_and_add(2); ‘= x’ and ‘x = ’read/write value of x x.fetch_and_store (y)z = x, y = x, return z x.fetch_and_add (y)z = x, x += y, return z x.compare_and_swap (y,p)z = x, if (x==p) x=y; return z

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 71 Mutex Concepts Mutexes are C++ objects based on scoped locking pattern Combined with locks, provide mutual exclusion M()Construct unlocked mutex ~M()Destroy unlocked mutex typename M::scoped_lockCorresponding scoped_lock type M::scoped_lock ()Construct lock w/out acquiring a mutex M::scoped_lock (M&)Construct lock and acquire lock on mutex M::~scoped_lock ()Release lock if acquired M::scoped_lock::acquire (M&)Acquire lock on mutex M::scoped_lock::release ()Release lock

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 72 Mutex Flavors spin_mutex Non-reentrant, unfair, spins in the user space VERY FAST in lightly contended situations; use if you need to protect very few instructions queuing_mutex Non-reentrant, fair, spins in the user space Use Queuing_Mutex when scalability and fairness are important queuing_rw_mutex Non-reentrant, fair, spins in the user space spin_rw_mutex Non-reentrant, fair, spins in the user space Use ReaderWriterMutex to allow non-blocking read for multiple threads mutex Wrapper for OS sync: CRITICAL_SECTION for Windows*, pthread_mutex on Linux*

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 73 Reader-Writer Lock Example #include “tbb/spin_rw_mutex.h” using namespace tbb; spin_rw_mutex MyMutex; int foo (){ /* Construction of ‘lock’ acquires ‘MyMutex’ */ spin_rw_mutex::scoped_lock lock (MyMutex, /*is_writer*/ false); read_shared_data (data); if (!lock.upgrade_to_writer ()) { /* lock was released to upgrade; may be unsafe to access data, recheck status before use */ } else { /* lock was not released; no other writer was given access */ } return 0; /* Destructor of ‘lock’ releases ‘MyMutex’ */ }

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 74 One last question… Do not ask! Not even the scheduler knows how many threads really are available –There may be other processes running on the machine Routine may be nested inside other parallel routines Focus on dividing your program into tasks of sufficient size Task should be big enough to amortize scheduler overhead Choose decompositions with good depth-first cache locality and potential breadth-first parallelism Let the scheduler do the mapping How do I know how many threads are available?

Wuhan University Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. * Other brands and names are the property of their respective owners. 75 Review Intel® Threading Building Blocks is a parallel programming model for C++ applications –Used for computationally intense code –A focus on data parallel programming –Uses generic programming Intel® Threading Building Blocks provides –Generic parallel algorithms –Highly concurrent containers –Low-level synchronization primitives –A task scheduler that can be used directly Know when to select Intel® Threading Building Blocks, the OpenMP API or Explicit Threading