Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Programming Overview & Examples

Similar presentations


Presentation on theme: "Parallel Programming Overview & Examples"— Presentation transcript:

1 Parallel Programming Overview & Examples
SF 2009 Parallel Programming Overview & Examples Warren He Ph.D Computing Continuum Platform Architect SSG PRC Office/Innovation Team Intel Software & Service Group LifeScience Energy CFD/MFG NWS FSI

2 Agenda Parallel Computing and HPC overview
Parallel programming performance methodology Multi-thread & OpenMP MPI Tips Hybrid Parallelization: MPI+OpenMP Micro-Architecture optimization

3 HPC Demand Drivers “Today, the world’s data outstrips our ability to comprehend it, much less take maximum advantage of it.” - Justin Rattner, Intel CTO Data Explosion Demands Performance Demand for Real Time Models Exponential Growth in Data-Intensive Computing Explosion in Data Volume Growth But with Issues 42% of data center owners said they would exceed power capacity within 12 to 24 months without expansion. 39% said they would exceed cooling capacity in 12 to 24 months. —Computerworld Increase problem size (at constant time) Decrease elapsed time. (at constant sizes) Demand for data centers continues to grow. For example, A computer world prediction stated enterprise data centers will need to cope with a A zettabyte of data by 2010: ( that is a billion terabytes…) Corporate data growing fifty fold in three years - Worldwide systems are outpaced by data growth- IDC projects Data center construction costs to exceed $1,000/sq ft , $40,000/rack - IDC’s Datacenter Trends Survey, 2007 Your Will that next server cost you 40 Million? Building a new data center may be the right answer, but before you pull the trigger, it is best to make sure this is the right answer. Construction of your new data center will cost 1000/square foot. Maybe more. By analyzing the resources you have and the opportunities with these resources, you can plan out when your data center expansion / replacement will take place and gain the benefit of time and advances in technology before you break ground. 63.6% of data center construction costs associated with power and cooling. Source: Gartner (June 2006) What does this mean to you? It means you will outgrow the capacity you can deliver today. You are not alone. 81% of IT mgrs say they will exceed capacity for power or space in the next 5 years. Green Tech World, TMC 2007 Increase problem complexity (at constant elapsed time and sizes ) “This compels us to rethink how we will manage retrieve, explore, analyze, and communicate this abundance of data.” -US National Science Foundation 3 3

4 HPC Workload Characterization
CPU Bound Memory Capacity Bound Memory BW Bound Size Of Input Processing Time Memory Requirements Algorithmic Changes Application Development Application Support "As the size of the input to an algorithm increases, how do the running time and memory requirements of the algorithm change and what are the implications and ramifications of that change?" OS that understands the platform architecture Red Hat* Enterprise Linux* 5.3 and SUSE* Linux Enterprise 11 Include platform feature support from newer mainline kernels. Microsoft HPC Server 2008 Build-your-own recipe Example: Fedora 10 (distribution) + kernel (from kernel.org) Enable EIST + Intel® Turbo Mode Technology Good memory configuration Intel® Hyper-Threading Technology Consider pinning if using with MPI or OpenMP Optimize Software for Micro-Architecture Intel® C++ Compiler 11.0, Intel® Fortran Compiler 11.0 Flexibility/Cost Sensitive For demonstration purposes only 4 4

5 Hardware for Parallel Computing
General introduction of HPC system Hardware for Parallel Computing Symmetric Multiprocessor (SMP) Non-uniform Memory Architecture (NUMA) Massively Parallel Processor (MPP) Cluster Single Instruction Multiple Data (SIMD)§ Multiple Instruction Multiple Data (MIMD) Parallel Computers Shared Address Space Disjoint Address Space Distributed Computing boom up! §SIMD has failed as a way to organize large-scale computers with multiple processors. It has succeeded, however, as a mechanism to increase instruction-level parallelism in modern microprocessors (in Intel® MMX™ technology). Hardware for Parallel Computing SIMD failed during the late 1980s. It did not work too well and needed custom silicon. Jurassic Park had a Thinking Machine in the background as a pretty computer. The SMP are the nodes in Intel® clusters. NUMA would be Multibus II or SGI Origin, NEC Susa(?). MPP = Paragon or ASCII-Red all custom hardware with the exception of the processor and memory. All custom, that’s means Use 64 nodes of Sony playstation running NAMD Cluster = what we are working on COTS everywhere. Distributed computing is the current rage (the Grid). GPE and Unicore Intel, MMX, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries

6 Stop! Think it twice (not parallelly) before jump into Parallelization
When to Say No (or Maybe) to Parallelism ( Intel Whitepaper) Don’t Parallelize up-optimized serial code Don’t Parallelize if the serial code is running fast enough already Don’t Parallelize by rewriting code from scratch without careful consideration Don’t Parallelize if someone else has already done the work for you.

7 Foster’s Design Methodology
From Designing and Building Parallel Programs by Ian Foster ( 1994) Four Steps: Partitioning Dividing computation and data Communication Sharing data between computations Agglomeration Grouping tasks to improve performance Mapping Assigning tasks to processors/threads The Problem Initial tasks Communication Combined Tasks Purpose of the Slide Introduce Foster’s design methodology for parallel programming. Details This somewhat long ellipsis in the presentation, 8 slides of parallel design points and examples, is intended to prepare the design discussion for our own primes example. Ian Foster’s 1994 book is well-known to practitioners of this dark art, and his book is available online (free!), at Final Program

8 Or look into it as parallel programming patterns -
Details a pattern language for parallel algorithm design Examples in MPI, OpenMP and Java are given Represents the author's hypothesis for how programmers think about parallel programming Patterns for Parallel Programming, Timothy G. Mattson, Beverly A. Sanders, Berna L. Massingill, Addison-Wesley, 2005, ISBN Script: Ok, Here’s the book. This is a screenshot of the cover. Notice in the blue box we have the citation for it. It was published in The premise is to create patterns for parallel algorithm design, not just serial algorithm design. One of the nice features of the book is that it does not just focus on one particular programming method for parallelism. It has examples in MPI, OpenMP & Java, throughout the whole text. One of the key pints here – is that this book is really the author’s hypothesis for how programmers think about parallel programming. No one as yet has a whole lot of experience laying out the entire formalization of parallel programming, but these authors have a lot of parallel programming experience and have distilled their thinking process into some good rules and design paterns We’ll look at some of these more in depth as we go along Instructors Note: A lot of this material we will be presenting comes right out of the book – so if you’ve read the book – you should be a pretty god candidate for teaching this module

9 Pattern Language’s Structure
A software design can be viewed as a series of refinements Consider the process in terms of 4 Design Spaces Add progressively lower level elements to the design Design Space The Evolving Design Messages, synchronization, spawn Implementation Mechanisms Source Code organization, Shared data Supporting Structures Tasks, shared data, partial orders Finding Concurrency Thread/process structures, schedules Algorithm Structure Script: The authors have broken the the pattern Language Structure into 4 design spaces. Each one is a refinement of the previous space. First animation The first space is finding the concurrency. You take your serial application and try to find where the concurrent portions of the code can be. What parts can actually run in parallel? So you are looking for things like: What tasks are independent of each other? Is here any shared data that goes between those tasks? And is there any partial ordering required (ie is there some sort of execution order that you need to follow when you’ve divided those tasks into independent pieces) Second animation The second space is the Algorithm Structure. This is where you decide on how that concurrency can be structured, as far as the algorithm goes. How are the threads & processes structured, what are the schedules that need to be put together with those threads to create the concurrent or parallel version of the serial algorithm. Third animation After that – this is where you start getting into the code itself. The third step is “Supporting Structures”. This represents how the source code gets organized, how the data gets shared, what are the mechanisms that you are going to incorporate into your program to carry out the previous step of the parallel algorithm Fourth animation And then finally – we have the implementation mechanisms. So once you have decided what the algorithm structure is, what supporting structure you are going to use to implement the structure – then you actually do the implementation step. This includes things like the messages, the synchronization between threads or processes, how you create and manage -all of that stuff goes into the implementation mechanisms

10 Finding Concurrency Design Space
Finding the scope for parallelization: Begin with a sequential application that solves the original problem Decompose the application into tasks or data sets Analyze the dependency among tasks before decomposition Decomposition Analysis Dependency Analysis Group tasks Domain decomposition Order groups Task decomposition Script: Now we are going to look at each of those four steps in some depth, with more emphasis being placed on the middle tow steps – the Algorithm Structure & the Supporting Structures. But first – lets take a look at the finding of concurrency We will assume you are starting with a sequential application that solves some original problem. One good reason for starting with a serial application is to ensure that the answers that you get form your parallel algorithm are correct. Given a set of inputs, does the parallel application and the original serial application generate the same set of outputs? Now figure out where the concurrency is – decompose the application into tasks or data sets. This is a standard way of of going about finding a parallel solution. Do I see my application as a set of tasks? Or do I see my application as a big chunk of data is that being operated on that might potentially be broken into concurrently computed subsets? So first decide if you are going to do a or a domain decomposition, organizing your concurrency efforts around data, OR are you going to do a task decomposition. Are you looking at your application as a series of independent tasks. Now once you’ve decided what type of decomposition you’ve got, you need to figure out what your dependencies are. Is there some way that you can group tasks to avoid dependencies, is there some ordering that needs to be done between the groups, how do you do any requisite data sharing. All of these questions may related to each other so this may be an iterative process. That’s why the arrows here point both ways – to indicate the iterative nature of the design process. Data sharing

11 Algorithm Structure Design Space
How is the computation structured? Organized by data Organized by tasks Organized by flow of data Linear? Recursive? Linear? Recursive? Regular? Irregular? Event-based Coordination Geometric Decomposition Recursive Data Task Parallelism Divide and Conquer Pipeline Script: Just a reminder of the tree structure that we had seen before – for the next several foils we will be focusing on he middle branch where computation is organized by tasks that are computed in a more or less linear order – that brings us to task parallelism algorithm structure

12 Supporting Structures Design Space
High-level constructs used to organize the source code Categorized into program structures and data structures Loop parallelism Program Structure Data Structures Boss/Worker Shared queue Shared data SPMD Fork/join Distributed array Script This slide is replicated here for a reminder about what supporting structures are and how they fit into the design pattern. For the Supporting Structures Design Space we will take a look at three of these structures, namely SPMD, Loop parallelism & Boss/Worker These are high level constructs used to organize the source code – we are not going to actually implement in code yet – that is the 4th step but here we need to get our heads around how we are going to organize the source code into one of these models

13 Implementation Mechanisms Design Space
Low level constructs implementing specific constructs used in parallel computing Not proper design patterns; included to make the pattern language self-contained UE* Management Process Control Thread Control Synchronization Mutual Exclusion Memory sync/fences Barriers Communications Collective Comm Message Passing Other Comm Script: The fourth piece we will look at is the Implementation Mechanisms Design Space. This is the actual coding bits and API pieces you have to put in place to implement you parallel algorithm. This is not really a design pattern per se, but is included to make the pattern language self-contained – so not only did consider the design bits and how we divide up data but also how to implement parallelism – how we share data, how we define threads, how we manage processes, how we do synchronization or communicate data – all these are covered to make sure we have a self contained pattern language The pieces in blue boxes above – are all the pieces that pertain to thread programming and the pieces we will be concerned with in the book. How do we control threads, how do we create them, modify them, wait for them, spawn them, wake them up, put them to sleep etc. Synchronization between threads is important because they will be implemented in shared memory so we may need mutual exclusion, we may need barriers to give us some control of order of execution. In shared memory – we do not have to worry about collective communication or message passing because everything is done in shared memory where we have other synchronization/communication needs. While this is covered more in depth in the book, this foil is the last we will touch on these implementation topics * UE = Unit of execution

14 Thinking in Parallel Way
Key Concepts and theory of Parallel Processing Concurrency vs. Parallelism Process vs. Threads Scalability, Speedup & Amdahl law Granularity, Load balance and overhead Throughput vs. Latency Hotspots vs. Bottleneck and 80/20 rule Critical path, CPI & Performance indicator Flops, memory bandwidth & compute efficiency Auto-Parallelization, vectorization and SSE CPU, Memory, I/O, Communication Data, Task & Pipeline Parallelism Amdahl Law 线程的概念和用户线程被Kernal线程的调度关系,线程“漂移”的问题,以及线程绑定,多核的方法,核数和线程的一一对应关系 严格意义上并发和并行的区别 扩展性 – 随核扩展, 加速比和Amdahl定律—对给定的负载和问题规模而言,并行加速比受限于串行部分的份额 粒度,计算和通讯的比率,负载均衡和开销 两大类 – 吞吐量计算和低延迟计算,有时候会融为一体 – 如 search 热点 – 某类activity 密集的地方,但未必是瓶颈

15 Amdahl’s Law: Ideal Speedup = 1/
General introduction of HPC system Amdahl’s Law: Ideal Speedup = 1/ How much faster can you go for a given parallel Algorithm? Load Data Compute T1 Consume Results Compute TN Ttotal(P) = Tload + Tconsume (N/P)*Ttask Compute N independent tasks with P processors Parallel terms – (fraction  of total computation) Serial terms – not parallelized (fraction  of total computation) S = Speedup = TTotal(Sequential) TTotal(Parallel) =  T(Seq) +  T(Seq)/P T(Seq) 1 = For infinite P and perfect parallelization The serial fraction of an algorithm limits the possible speedup for a parallel implementation Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

16 The Picket Fence (Amdahl revisited)
# People 1 10 100 Prepare 1.1 1.2 Paint .1 Clean-up Total 12 hours 3.2 hours 2.5 hours Speed-up 3.75 4.8 Efficiency 100% 37.5% 4.8% Overhead impact due to : communications synchronisations additional work serial part

17 Agenda Parallel programming performance methodology
Parallel Computing and HPC overview Parallel programming performance methodology Multi-thread & OpenMP MPI Tips Hybrid Parallelization: MPI+OpenMP Micro-Architecture optimization

18 Performance Indicators
Performance is the reciprocal of the “Time of execution”: Where: L = Code Length (# of machine instructions) CPI = Clock cycles Per Instruction Tc = Clock period (nSecs) Substitute: IPC = Instructions Per Cycle = 1/CPI F = Frequency = 1/Tc Three way to do things faster: Work more hard: Arch enhancement/IPC Do less: Reduce critical path / use better algorithm Do it together: Parallelization/ Clustering Improve ILP Improve Timing Arch Enhancements

19 Design Metrics IPC = Instructions per Cycle The more the better
Latency – same as Response Time The time interval between when any request for data is made and when the data transfer completes The less the better Throughput The amount of work completed by the system per unit of time. ops/sec

20 High Performance & Throughput Computing: Optimized HW + Familiar SW Development
Common tools suite and programming model Intel® MIC Products for highly parallel applications First intercept 22nm process Software development platforms available this year Xeon® serves Most HPC Workloads Xeon right for most workloads Extending instruction set arch (AVX, etc.) Tools become architecturally aware (Xeon, MICA) Ct programming model Xeon is the Right Solution: The vast majority of application workloads are and will continue to be well served by Xeon processors Our experience has shown that in competitive situations, when we have an opportunity to work with the customer to tune their code via the SSG CRT, we have been able to win >90% of those designs on Xeon versus alternative architectures We will continue to invest to expand the Advanced Vector Extension (AVX) instructions and as well as other investments to maintain Xeon competitiveness Common tools suite and programming model: We are investing with SSG to expand our software tools to support Intel MICA and transparently for the user A key competitive advantage is to provide a common programming model and a tool suite to manage the complexity for the user Ct programming model: a new data parallel programming model; it is well suited to TPT workloads; counters the CUDA programming model with higher-level, more productive abstraction Common Tool Suites insures Application Development Continuity, and Fast Time to Performance 20 20

21

22 Application Classification and Performance Analysis
Intel® Xeon® Processor 5600 Series CPU freq bound Size of problem and Price Intel® Xeon® processor 7500 series Memory BW bound DCC rendering (eg ray tracing) – typically CPU bound DCC video – BW bound

23 We take a Top-down Performance approach
Order Optimize level Optimize Target Key areas to investigate Benefit 1 High: System Level Tuning By improving how application interactive /w whole software stack to boost performance Network problem Disk issue Memory access issue  High 2 Medium: Application Tuning  By improving the implementation of applications’ algorithm and application itself Process/Thread locks Heap contention Thread implementation APIs choose Library selection Data Structure  Medium 3 Low: Micro-arch level tuning By improving how the application run in specific platform from micro-code level Difficulty in micro-arch tuning Data/operation localization (Cache efficiency) Data alignment Vectorization /SSE Low 23

24 Typical Performance Pitfall - System
Memory configuration Balance BW and capacity (max BW, balanced or max capacity?) BIOS configuration SMT / EIST / Turbo / NUMA / Prefetcher Disk & Network IO SSD vs HDD, Raid0 vs Raid1 vs Raid5, stripe size.. Update OS version and re-architect to match the new HW platform in order to release the HW performance

25 Typical Performance Pitfall - System
Considerations for IO intensive Maximum disk throughput, IO request Depth, scheduler. Hide IO latency By overlap IO and calculation Carefully design the buffer strategy, sys buffer vs self-managed memory cache Synchronized IO vs. POSIX AIO vs. Native AIO? pread vs. readv vs. mmap vs. directio?

26 Typical Performance Pitfall – Algorithm implementation
The Big O does matter Look for published algorithms Do not reinvent the wheel Real case study PForDelta vs. v-byte In search engine, inverted Index is always compressed to reduce file size. SSE4 optimized PForDelta can achieve 3GB/s decompress speed, 10x faster than v-byte

27 Algorithm Complexity JPEG Image downsizing AES vs. TEA vs. AES-NI
Frequency domain down-sampling to closest 1/(2^n) size first, then downsize to target size. Performance ~2.5x AES vs. TEA vs. AES-NI AES is more than 3x faster than TEA and much higher encryption strength. Switch to AES lead to 80% application level performance improvements 27

28 First: do it serial Use the compiler option Ask for a detailled report
Check what has been done Within the report Using the hardware counters Help the compiler and go to 2. When done , move to SSE/AVX intrinsics for more Inst / Cycle

29 Second : do it in parallel
CPU bound. “HPL” Memory bound. “Stream” Real world applications main task threads

30 The Big Picture Performance is all about…
“Keeping CPU busy in the most efficient manner” Reduce instruction cache issues Reduce “Retired” component by minimizing the number of instructions Reduce “Stalls” by removing memory access and other bottlenecks Reduce “Non-retired” component by reducing the branch mis-predictions Utilize Vectorization / SSE / AVX Fine tune Memory utilization , Threading, I/O and OS

31 First order approximation: only a matter of Freq and BW
Flops versus BW ratio defines the « efficiency or scalability » Frequency is « not an issue » Data movement is the issue for (i=0;i<=MAX;i++) c[i]= a[i] + b[i]* d[i]; store load add mul Efficiency ( Peak flops / Achieved flops) won’t be high enough if store / load are not fast enough (GB/s)

32 Memory Bandwith: a key point
Basical rules for theoretical memory BW: 1 [byte] * 8 [DRAM component]* Mem freq [Gcycles/sec] * nb of channels * nb of sockets Example for Westmere DP server: with 1333 Mhz DDR3: 8*1333* 3*2 = GB/s where Stream triad gives ~ 42 Gb/s Future : More channels and higher memory frequencies ..

33 So Why applications don’t scale ??
11/12/2018 So Why applications don’t scale ?? Amdahl’s law : - the serial bottleneck Software : Compiler, OpenMP, MPI Data Locality: - false sharing, TLB misses, latency Load Balancing: - limited parallel efficiency Parallel Overhead: - limitate parallel scaling Operating System: - scheduling, Placement, I/O,... Hardware: - CPU MHz, Latency, BW,Topology, I/O

34 Tools: Delivering Choice of Parallelism
Software Types Intel® Tools for Parallelism (Multi-core  Future) Performance Libraries Intel® Math Kernel Library, Intel® Integrated Performance Primitives Intel® VTune™ Performance Analyzer Intel® Thread Checker Intel® Trace Analyzer and Collector Data Level Parallelism Ct, Intel® Streaming SIMD Extensions/Intel® Advanced Vector Extensions Cluster Parallel Message Passing Interface (MPI) Task and Data Parallel (Shared memory) OpenMP*, Intel® Threading Building Blocks, Intel® Cilk++ compiler Single Thread/ Sequential 自底向上的 - Our tools help you identify Thread level flow control instructions that may be optimized ,指令级并行,靠Compiler – e.g. if then else They help you optimize data parallel code and provide a strong foundation for adapting to new architectures - API They optimize task level parallelism and support obvious standards that include MPI and OPEN MP 进程级 The quickly identify loop level parallelism where you may as an example adding a number to a matrix What I want to leave you with is that these tools are very comprehensive but as Andrew Tanebaum said “Sequential programming is really hard” … the difficulty is “parallel programming is a step beyond that.”  并行的复杂度,Intel面向主流高性能计算服务器集群领域开发了一系列工具,其中某些进行整合,简化下移到Windows开发端,是Linux端功能的子集 Amplifer – Vtune, Thread profiler Composer – Compiler, IPP, OpenMP, TBB Inspector – Thread checker Our tools will not simplify that - they will simply how you go about doing it. And that is what our software tools are all about Fortran, C/C++ Compilers Selective Comprehensive Parallelism Methods & Supporting Standards from HPC tools to form Parallel Studio

35 Segmentation of Intel Developer Tools Two Product Lines for Two Needs
Maximize Parallel Performance C++ and Fortran on Windows*, Linux*, Mac OS*X Available Investments Continue Maximize Parallel Productivity C++ using Visual Studio* on Windows* Available Investments Grow

36 Intel® C++ and Fortran Compiler Professional Editions
Most comprehensive multicore and standards support OpenMP* 3.0, auto-vectorization, auto-parallelization, parallel valarray, “parallel lint” Advanced compilers and libraries support Intel and compatible processors in a single binary Make multithreaded application development practical OpenMP* , Intel® Threading Building Blocks, Auto-parallelization OpenMP* compatibility library supports Microsoft and GNU* OpenMP* implementations Includes a Parallel debugger for IA-32/Intel® 64 Linux Maximize application performance Built-in optimization features including high level optimization, automatic vectorizer, interprocedural optimization (IPO), profile guided optimization (PGO) and takes advantage of the latest processor capabilities (e.g., SSE4) Create optimized applications that run on Intel and AMD processors Utilize one application development toolset for 32 & 64 bit Windows*, Linux* and Mac OS* X Future Intel instruction set support for Advanced Vector Extensions (AVX) and Advanced Encryption Standard (AES) – see whatif.intel.com for an instruction emulator The Professional editions include: Intel® C++ Compiler and/or Intel® Fortran Compiler Intel® Integrated Performance Primitives Intel® Math Kernel Library Intel® Threading Building Blocks IA-64 – Itanium Intel® 64 – Commonly know as X86-64, the 64 bit capability IA-32 – Intel 32 bit architecture OS IA-32 Intel® 64 IA-64 Windows* Linux* Mac OS* X

37 Agenda Parallel Computing and HPC overview
Parallel programming performance methodology Multi-thread & OpenMP MPI Tips Hybrid Parallelization: MPI+OpenMP Micro-Architecture optimization

38 Parallel Programming Models
Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC Functional Decomposition Task parallelism Divide the computation, then associate the data Independent tasks of the same problem Atmosphere Model Ocean Model Land Surface Model Hydrology Model Data Decomposition Same operation performed on different data Divide data into pieces, then associate computation Purpose of the Slide Introduce the primary conceptual partitions in parallel programming: task and data. Details Task parallel has traditionally been used in threaded desktop apps (partition among screen update, disk read, print etc), data parallel in HPC apps; both may be appropriate in different sections of an app.

39 Thread issues “We wrote regression tests that achieved 100% code coverage. The nightly build and regression tests ran on a two-processor SMP machine, which exhibited different thread behavior than the development machines, which all had a single processor. The Ptolemy II system itself became widely used, and every use of the system exercised this code. No problems were observed until the code deadlocked in April 2004, four years later.” -- “The Problem with Threads” by Edward A. Lee, IEEE Computer from May 2006.

40 Typical Performance Pitfall in Threading
Granularity Fine-grained lead to higher overhead Coarse-Grained lead to unbalance Concurrency Higher brings thread switch overhead and higher latency Lower may lead to resource under utilization Lock Freedom and Order, You are the governor! Thread Library Choose Use the correct API Linux Thread vs. NPTL Choose the high level abstract thread model Thread Building Block, Boost, ICE, ACE etc.

41 A lock Case The following case show the threading conflicts on the application memory cache, which is frequently visited By group the cache memory into different sets by individual lock each may relief the sync between threads. Which leads 21% performance boost Computation threads Computation threads Cache memory Cache memory segment Cache memory segment Cache memory segment Split one lock in to different sets of locks

42 User thread & kernel thread inside Linux 2.6
Waiting Kernel Thread Computation Process Processes Ready Kernel Thread Active Kernel Thread User space OpenMP Library Thread Library runqueue runqueue runqueue runqueue Kernel space HAL HW

43 Affinity setting Important for multi-socket platforms
Works best with default static schedule When using all logical processors KMP_AFFINITY=compact[,0,verbose] 1 thread per core HT enabled GOMP_CPU_AFFINITY=1,3,5,7,9,11,13,15 Compile time option (Intel 11.1) set KMP_AFFINITY, e.g. = physical, for OpenMP* apps. Especially important if Intel® Hyper-Threading Technology enabled and threads not saturated -par_affinity=compact Resembles PGI* default Functionality in doubt Can’t override this at run time.

44 (combined C/C++ and Fortran)
What Is OpenMP? Portable, shared-memory threading API Fortran, C, and C++ Multi-vendor support for both Linux and Windows Standardizes task & loop-level parallelism Supports coarse-grained parallelism Combines serial and parallel code in single source Standardizes ~ 20 years of compiler-directed threading experience Current spec is OpenMP 3.0 318 Pages (combined C/C++ and Fortran) Script: What is OpenMP? OpenMP is a portable (OpenMP codes can be moved between linux & windows for example), shared-memory threading API that standardizes task & loop level parallelism. Because OpenMP clauses have both lexical and dynamic extent, it is possible support a broad multi-file course grained parallelism. Often, the best parallelism technique is to parallel at the coarsest grain possible often parallelizing tasks or loops from with the main driver itself – as this gives the most bang for the buck (the most computation for the necessary threading overhead costs). Another key benefits is that OpenMP allows for a developer to parallelize their applications incrementally. Since OpenMP is primarily a pragma or directive based approach we can easily combine serial & parallel code in a single source. By simply compiling with or without the /OpenMP compiler flag we can turn OpenMP on or off. Code compiled without the /OpenMP flag simply ignores the OpenMp pragmas which allows simple access back to the original serial application. Openmp also standardizes about 20 years of compiler directed threading experience. For more information or to review the latest OpenMP spec (currently the latest spec is OpenMP 3.0) – goto

45 OpenMP Fork-join parallelism:
Master thread spawns a team of threads as needed Parallelism is added incrementally sequential program evolves into a parallel program Parallel Regions Master Thread

46 Design #pragma omp parallel for
OpenMP Defined by the for loop Design #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Create threads here for this parallel region

47 OpenMP labs. - Afternoon
PrimeOpenMP lab Thread Correctness check lab Performance tuning for OpenMP – avoid Naïve OpenMP Parallel Overhead Due to thread creation, scheduling … Synchronization Excessive use of global data, contention for the same synchronization object Load Imbalance Improper distribution of parallel work Granularity No sufficient parallel work

48 Agenda Parallel Computing and HPC overview
Parallel programming performance methodology Multi-thread & OpenMP MPI Tips Hybrid Parallelization: MPI+OpenMP Micro-Architecture optimization

49 MPI version: Hello World
MPI是个复杂的系统,它为程序员提供一个并行环境库,程序员通过调用MPI的库程序来达到程序员所要达到的并行目的 MPI提供C语言和Fortran语言接口,它包含了129个函数(根据1994年发布的MPI标准),1997年修订的标准(MPI-2),已超过200多个,目前最常用的也有约30个 可以只使用其中的6个最基本的函数就能编写一个完整的MPI程序去求解很多问题

50 Result run in 4 machines:
Hello World! Process 1 of 4 on c0101 Hello World! Process 0 of 4 on c0101 Hello World! Process 2 of 4 on c0101 Hello World! Process 3 of 4 on c0101 Result run in 4 machines: Hello World! Process 3 of 4 on c0104 Hello World! Process 0 of 4 on c0101 Hello World! Process 2 of 4 on c0103 Hello World! Process 1 of 4 on c0102

51 Hello World MPI Runtime Behavior

52 CFD mesh Domain decomposition Physical Model Whole compute zone Block
Process 0 Process 1 Process 2 Process 3 Process 4 Process 5 Process 6 Process 7 Block Domain Sum 127 Blocks Decomposed into 8 processes Page 11 增加了 TKANPARA.01 文件的介绍,you can find it at :/panfs/intel.cn/home/warren/testx43/1/TKANPARA.01 文件 NBLOCK 127 表示本计算中一共有127个网格块 NPROC 8表示127个网格分成8个process计算 : 第一个17表示当前的process上面有17个网格块,其余的表示block的全局的编号(如图中第三行:17,63,109…..)。

53 MPI程序的一般结构 包含MPI头文件 相关变量声明 Call MPI_INIT() Call MPI_COMM_RANK()
Call MPI_COMM_SIZE() Call MPI_FINALIZE() END 相关变量声明 Call MPI_INIT() 应用程序实体: 1.计算控制程序体 2.进程间通信 进入MPI系统,通信器MPI_COMM_WORLD形成 建立新的通信器、定义新的数据类型和进程拓扑结构 退出MPI系统

54 Ghost grids store the data of neighbor block.
空间离散分块并行 Different grid connection(2D example) Data needs to be sent to another process Receive data from neighbor block Ghost grids store the data of neighbor block.

55 并行计算流程 主进程 从进程1 从进程2 从进程3

56 Typical CFD Parallel Framework
To use MPI IO to read grid data and write field data block by File IO SCOUNTER=0 DO 25 IMODE = ISEND,IRECV DO 20 IB = 1,IBM ...... IF (IMODE.EQ.ISEND) THEN CALL WSEND (...) ELSE CALL WCOPY (...) ENDIF 20 CONTINUE 25 CONTINUE CALL MPI_WAITALL(SCOUNTER,SREQ,STATUS,INFO)

57 21 tuning tips for Intel MPI
Make sure your cluster is properly configured Use Intel MPI automatic tuning utility Build application for highest performance Use best available communication fabric Disable fallback device for benchmarking Use multi-rail capability Use connectionless communication Select proper process layout Use proper process pinning Enable MPI/OpenMP* mixed mode for threaded apps Disable dynamic connection mode for small jobs Apply wait mode to oversubscribed jobs Use Intel MPI lightweight statistics Adjust eager/rendezvous protocol threshold Bypass shared memory for intranode transfers Choose the best collective algorithms Bypass cache for intranode transfers Tune message passing progress engine Reduce size of pre-reserved memory for DAPL communication device Reduce amount of memory consumed by DAPL provider Tune TCP/IP connection Prerequisites Basics Advanced Black belt

58 Performance issues I/O execution Serial I/O ….
Rank 0 Rank 1 Rank 2 Rank 3 …. Send/Recv Serial I/O File All processes send data to rank 0, Send/Recv are time consuming Rank 0 writes it to the file, serial execution, low performance

59 Parallel IO API ( e.g.MPI_File_Write_at / MPI_File_Read_at )
Performance issues I/O execution Parallel I/O Rank 0 Rank 1 Rank 2 Rank 3 …. File Parallel IO API ( e.g.MPI_File_Write_at / MPI_File_Read_at ) Page 是mpi io 的测试数据,请在这边强调是在一个 panasas shelf 上面完成的测试,(之前我明白为什么带宽这么低,后来从paul 那里得到消息是: 每个用户只是被绑定在一个shelf上面,而一个shelf的理论带宽为300M左右,从page 27上面,在128个nodes上面的数据达到或者说接近了300M这个带宽,也就是说, parallel io可以最大程度的发挥并行文件系统的带宽) Multiple processes of a parallel program accessing data (reading or writing) from a common file; Performance, portability, Convenience

60 Write 8.5 GB File with 16 Processes, parallel IO got 1.53+ speedup
Performance issues I/O execution-in Intel itac Write 8.5 GB File with 16 Processes, parallel IO got speedup

61 App. level Tuning: MPI & Series code
Application level MPI-OpenMP* hybrid parallel Load balance Reduce collective communication calls Reduce large size MPI message in collective ops Hide communication latency into computation Adjust communication topology to reduce communication time System level High speed interconnect Use high performance MPI Choose best load mode per topology Set MPI parameters per ITC/ITA profile result Series Code Eliminate pipeline stall

62 NWS Application Characterization and MPI Tuning
HW: High frequency of CPU. High Memory BW system. HW: High Network BW. And low Latency Performance bottle-neck: Memory BW and network Computation and Communication balance, Hide communication latency into computation Match communication topology with HW topology Mixture Parallel Model – MPI+OpenMP* for Fat Node cluster System

63 在英特尔® 平台集群上GRAPES*优化 问题: 解决方案: 每个进程调用MPIIO 并行项磁盘写入数据 在128 核下获得30%性能提升
由主进程收集数据并写入磁盘 问题: IO step Low was better 由单一进程收集数据并写入磁盘 过多的MPI消息收集 调用在每一次IO步骤上 解决方案: 每个进程调用MPIIO 并行项磁盘写入数据 使用MPIIO的改进 结果: 在128 核下获得30%性能提升 随着进程数增加并行加速比由1.4改善到 “新一代全球/区域多尺度 通用同化与数值预报系统(GRAPES) -=水平分辨率为 30 km 的 GRAPES 区域中尺度数 值预报系统(GRAPES-Meso)于 2006 年在中央气象台 正式投入国家级业务运行, 2007 年水平分辨率升级 为15 km. 多重循环内部单一processer将垂直尺度上的水平多变量网格数据写入磁盘文件 用分属于个网格节点的进程并行写入文件系统

64 Performance issues Poor parallel paradigm

65 MPI algorithm tuning Poor parallel paradigm ……begin the loop……
call MPI_ISEND (buf_s, len_s, MPI_REAL, nbr_s, tag, comm, requests, ierr) call MPI_RECV (buf_r, len_r, MPI_REAL, nbr_r,tag, comm, ierr) call MPI_WAIT (requests, status, ierr) Perform Computation (not requiring message data or requiring message data) ……end the loop…… Poor parallel paradigm ……begin the loop…… …… call MPI_ISEND (buf_s, len_s, MPI_REAL, nbr_s, tag, comm, requests_s, ierr) call MPI_IRECV (buf_r, len_r, MPI_REAL, nbr_r, tag, comm, requests_i, ierr) Perform Computation (not requiring message data) call MPI_WAITALL (2, requests_i&s, status, ierr) Perform Computation (requiring message data) ……end the loop……

66 MPI algorithm tuning Better parallel paradigm
Works fine in the MPI system, and achieve overlap of communication and computation

67 Case Study: Load balance by ITAC
Application: CCSM3 Number process of cpl increase from 1 to 4. So the initialization time reduced from 85s to s During same period, Total effective iterations increased from 5 to 10 Baseline: one process of cpl model; one process of lnd model and one process of ice model; 4 processes of ocn model and 4 processes of atm model. Increase process number of cpl model to reduce CCSM initialization time than baseline. Only increase processes number of atm modes to reduce computation time than baseline Best Balance: So we must increase process number of cpl and atm model to achieve a load balance between components of CCSM. Number process of atm increase from 4 to 8. So computation time reduced from 230s to s

68 Agenda Parallel Computing and HPC overview
Parallel programming performance methodology Multi-thread & OpenMP MPI Tips Hybrid Parallelization: MPI+OpenMP Micro-Architecture optimization

69 HPC Clustering – Hybrid Parallelism
CLUSTER OF SHARED MEMORY NODES Interconnect (MPI) Message Passing between Nodes I/O I/O I/O I/O Multi-Threading within each (SMP) Node M M M M CORE P P P P P P P P In general, modern HPC systems today are a cluster of shared memory nodes of some kind, and more and more consisting of multi-core SMP nodes. So a single cluster node becomes more and more a larger SMP system in itself. As we have seen before, multi-threading is a good way to implement parallelism within such multi-core/SMP nodes, while message passing can be used between the different (distributed) SMP nodes. [CLICK] For this reason, we will focus on these two main programming models and SOFTWRAE TOOLS or parallel HPC application development. Two Main Programming Models in HPC: Multi-Threading & Message Passing Multi-core/Multi-threading + multi-processing = hybrid parallelism 69

70 Hybrid MPI/OpenMP* programming
Short history (Less than 10 years) SMP-based cluster provides a perfect development platform. Cost of MPI collective functions and limited scalability of OpenMP* push users to think about hybrid programming for solving large (huge) problem. This is a future trend for solving huge problems on a large number of cores. Commercial ISVs started to join the game recently. SMP Node Socket 1 Quad-core CPU Socket 2 …… Node Interconnect

71 MPI vs. OpenMP* MPI Portable to distributed and shared memory machines
Pure MPI Pros: Portable to distributed and shared memory machines Scales beyond one node No data placement problem Pure MPI Cons: Difficult to develop and debug High latency, low bandwidth Explicit inter-processes communication Large granularity Difficult load balancing Pure OpenMP* Pros: Easy to implement parallelism Low latency, high bandwidth Implicit communication Coarse and fine granularity Dynamic load balancing Pure OpenMP* Cons: Only on shared memory machines Scales only within one node Possible data placement problem No specific thread order Mostly limited to “fork-join” parallelism data Local data in each process Explicit Message Passing By calling MPI Send & MPI Recv MPI Sequential Program on Each CPU Not critical OpenMP (shared data) Some_serial_code #pragma omp parallel for for (j=…;…; j++) block_to_be_parallelized Again_some_serial_code Master thread Other threads … sleeping …

72 Why Hybrid? Is hybrid MPI/OpenMP* better than pure MPI?
Hybrid MPI/OpenMP* paradigm is a software trend for SMP-based clusters. Natural in concept and architecture: using MPI across nodes and OpenMP* within nodes. Good usage of shared memory system resource (memory, latency, and bandwidth). Avoids the extra communication overhead with MPI within a node or within a socket. OpenMP* adds fine granularity (larger message sizes) and allows increased and/or dynamic load balancing. Some problems have two-level parallelism naturally. Some problems could only use restricted number of MPI tasks. Could have better scalability than either MPI or OpenMP* alone. Is hybrid MPI/OpenMP* better than pure MPI?

73 Position of Hybrid Programming
The current “official” statement is hybrid MPI/OpenMP* might be better than pure MPI at very high process counts. Label the horizontal axis The statement source: SC08 73

74 Try hybrid SMP Cluster Performance model – Amdahl in realworld
Amdahl Rule: TN = (1-p) * T1 + (p/N) * T1 For Pure MPI Model TN = (1-p)*T1 + (p/N)*T1 + Tmpi For MPI + OpenMP threads: TN = (1-p)*T1 + (p/(N’MPI * Nthread *TR%))*T1 + T’pi + ∑Tth 进程间通讯 MPI_Init !$omp Parallel !$omp End Parallel MPI_Finalize 上层的MPI表示节点间的并行;下层的OpenMP表示节点内的多线程并行:首先对问题进行区域分解,将任务划分成通信不密集的几个部分,每个部分分配到一个SMP节点上,节点间通过MPI消息传递进行通信;然后在每个进程内采用OpenMP编译制导语句再次分解,并分配到SMP的不同处理器上由多个线程并行执行,节点内通过共享存储进行通信。图5.1描述了SMP集群上MPI+OpenMP混合编程模型的实现机制,其中

75 Hybrid Parallelization Strategies
From sequential code, decompose with MPI first, then add OpenMP*. From OpenMP* code, treat as serial code. From MPI code, add OpenMP*. Simplest and least error-prone way is to use MPI outside parallel region, and allow only master thread to communicate between MPI tasks. Could use MPI inside parallel region with thread-safe MPI. Serial code MPI code Hybrid MPI/OpenMP* code Step 1 Step 2 Step 3

76 Data decomposition in WRF
MPI for coarse-grained domain decomposition Point-to-point communications for halo exchange Collective operations if there are nested domains Per-process domain part is further decomposed into multiple tiles Multiple decomposition algorithms available that match different architectures Each OpenMP thread processes one or more tiles MPI 1 MPI 2 MPI 3 MPI 4 MPI 5 MPI 6 MPI 7 MPI 8 MPI 9 MPI 10 MPI 11 MPI 12 MPI 13 MPI 14 MPI 15 MPI 16 Tile 1 Tile 2 Tile 3 Tile 4 Tile 5 Tile 6 Tile 7 Tile 8 Tile 9 Tile 10 Tile 11 Tile 12 Tile 13 Tile 14 Tile 15 Tile 16 OpenMP 1 OpenMP 2 OpenMP 3 OpenMP 4

77 Hybrid setup & halo exchange
Pure MPI setup Hybrid setup (2 threads) MPI MPI EXCH shared OpenMP 1 OpenMP 2 EXCH EXCH EXCH EXCH MPI MPI EXCH shared OpenMP 1 OpenMP 2 Some boundaries that were exchanged in pure MPI case are shared in hybrid case. Therefore, less data is transmitted.

78 Mapping hybrid MPI processes to hardware
Process/thread placement should match hardware: Threads should share (at least some portion of) cache Explicit pinning to avoid thread migration Process should not cross socket boundary Excessive cache coherency traffic Possible memory access penalties on NUMA setups Process affinity setup used on 2-way Intel® Xeon® E54xx node (each MPI process runs 2 OpenMP threads; each thread is pinned to its own core) Compute node CPU 1 Core Cache Process 1 Process 2 CPU 2 Process 3 Process 4 Experiments were made with other pinning setups. This was found to be optimal.

79 CFD example Init MPI environment here Computation intensive
Source code: Main.F C C MPI INITIATE CALL SetParalle(NProc,CProc,Master) CALL INITIATE CALL INIPAR C C PERFORM INPUT AND PRINT INPUT SHEET CALL READIN C C SETUP DATA FOR CURRENT GRID LEVEL C CALL SETUPM CALL RUNGEK C …… Source code: RUNGEK.F ……begin the loop…… …… call MPI_ISEND (buf_s, len_s, MPI_REAL, nbr_s, tag, comm, requests_s, ierr) call MPI_IRECV (buf_r, len_r, MPI_REAL, nbr_r, tag, comm, requests_i, ierr) Perform Computation (not requiring message data) call MPI_WAITALL (2, requests_i&s, status, ierr) CALL TVISC CALL AUSM ……end the loop…… Init MPI environment here Computation intensive Source code: TVISC.F Source code: AUSM.F …… !$omp parallel do private(i,n,j,fluxc,fs,gs,dw) firstprivate(hs) DO 100 K = 2,KL FLUXWJ2(I,1,K,I2M,J2M,K2M,W,SI,SJ,SK,P,FLUXC) 100 CONTINUE OpenMP Implementations

80 Performance Gained Scalability & Hybrid paradigm
Pure MPI (8 processes) Scalability & Hybrid paradigm Hybrid (4 processes X 2 threads/process)

81 Performance issues Scalability & Hybrid paradigm

82 Hybrid has better scalability
Hybrid overhead Scalability & Hybrid paradigm Gigabyte Ethernet Infiniband Hybrid has better scalability

83 Yes & No for Hybrid Hybrid configurations show better performance in most of the cases Similar or better scalability than MPI Lower pressure on interconnect Less data participates in halo exchange (point-to-point communications) Fewer processes involved in collective communications We realized that hybrid MPI/OpenMP* version is only useful for specific cases on large number of cores.

84 Agenda Parallel Computing and HPC overview
Parallel programming performance methodology Multi-thread & OpenMP MPI Tips Hybrid Parallelization: MPI+OpenMP Micro-Architecture optimization

85 Characterize App. by micro-arch indicators
BKMs Description time (s) speedup Baseline: icc icc lbm.cpp -Wno-deprecated 11.199 BKM #1 double -> float long double -> float icc lbmf.cpp -Wno-deprecated 8.444 32.63% BKM #2 icc lbmf.cpp -Wno-deprecated -ipo 5.938 88.60% BKM #3 icc lbmf.cpp -Wno-deprecated -ipo -no-prec-div 5.081 120.41% BKM #4 icc lbmf.cpp -Wno-deprecated -ipo -no-prec-div -O3 4.735 136.52% Baseline -> BKM#4 Time: > 4.7s CPI: > 0.7 X87: > 0.0 SSE: > 43 sSSE: 0.0 -> 56.6 Baseline 11.2s BKM#4 4.7s 85

86 ISA Intel® Advanced Vector Extensions (Intel® AVX)
Vector FP offers scalable FLOPS and impressive Flops/W Future Extensions Hardware FMA (Fused Multiply Add) Many more options 15% gain/year (due to frequency and uArch) Intel® AVX* 2X throughput - vector FP 2X load throughput 3-operand instructions Performance / core 256 bits (2010) 128 bits (1999) YMM0 XMM0 Intel® microarchitecure (Westmere) Cryptography acceleration instructions Next generation Intel® microarchitecture (Nehalem) Superior Memory latency+BW Fast Unaligned Support Vectors increased from 128 bit to 256 bit The new state extends/overlays SSE The lower part (bits 0-127) of the YMM registers is mapped onto XMM registers Penryn: 47 New SSE instructions Nehalem: 7 New instructions 4 instructions: Acceleration of text/string/XML parsing POPCNT CRC32 Westmere: AES Encryption/Decryption >3x performance improvement over existing code All instruction details public (on intel.com) 2009 2010 2011 > 2011 Intel® AVX is a general purpose architecture building upon SSE Variety of software development tools to help exploit Intel® AVX instructions are available All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. 86

87 SSE Data Types & Speedup Potential
4x floats 16x bytes 8x 16-bit shorts 4x 32-bit integers 2x 64-bit integers 1x 128-bit integer 2x doubles Potential speedup (in the targeted loop) roughly the same as the amount of packing ie. For floats – speedup ~ 4X SSE-2 SSE-3 SSE-4 Example: SSE scalar to vector speedup is 128 buts divided by size of data type 128b/32b = 4X possible speedup

88 Goal of SSE(x) Uses full width of XMM registers Many functional units
SIMD processing with SSE(2,3,4) one instruction produces multiple results + x3 x2 x1 x0 y3 y2 y1 y0 x3+y3 x2+y2 x1+y1 x0+y0 X Y X + Y = Scalar processing traditional mode one instruction produces one result Uses full width of XMM registers Many functional units Choice of many of instructions Not all loops can be vectorized Cant vectorize most function calls X + Y The goal is to take scalar code (on left) and turn it into vector code (on right). This is at the heart of using as much SIMD HW as is available . = X + Y

89 Use Compiler Vectorization Report
Add assertions using Application knowledge to help the compiler Modify Compiler Switch Settings (-x[ S, T, P, O, W, N, B ]) Modify Source if need (Pragmas: #pragma ivdep, etc.) Generate vec-reports: icc [optimizations] -vec-report[ ] example.cpp Examples 11.1 has a better vectorizer for (i=0; i<100; i++) { a[i] = 0; for (j=0; j<100; j++) { a[i] += b[j][i]; } 9.1 file.c(14) : (col. 3) remark: loop was not vectorized: not inner loop. file.c(16) : (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient. 10.0: file.c(16) : (col. 8) remark: PARTIAL LOOP WAS VECTORIZED. file.c(16) : (col. 8) remark: loop was not vectorized: not inner loop. file.c(14) : (col. 10) remark: PERMUTED LOOP WAS VECTORIZED. a[0:99:1] = 0; for (j=0; j<100; j++) { a[0:99:1] += b[j][0:99:1]; } short a[N], e[N]; … for (i=0; i<=n; i++) a[i] = (e[i] < 0) ? -e[i] : e[i]; } 9.1: file.c(23) : (col. 4) remark: loop was not vectorized: operator unsuited for vectorization. 10.0: file.c(22) : (col. 1) remark: LOOP WAS VECTORIZED. a[0:n:1] = ABS(e[0:n:1]); Note: use -vec-report max - The Default level – does not identify compiler problems or identify potential solutions

90 Micro-arch level “Vecterization” Tuning
Vectorization boosted the performance by 30% on Nehalem, CPI down from 0.99 to 0.65 Intel® Microarchitecture codenamed Nehalem Sometime, we can’t always change the loop structure in order to help compiler generate the SSE code due to the complexity of the algorithm. In such cases, the SSE intrinsic and Vector class can be used to develop your code. Here are the examples. In this case, SIN tooks ~40% CPU cycles. separate the branch with the computation in the loop parallel the computation with packed SSE Compiler can’t auto-vectorize the loop since there is a branch within the loop Baseline takes more than 600K cycles, After vecterization, it only takes 90K cycles

91 High CPU Clock High CPI

92 Scalar Single/Double mul/add/mov/… cost lot’s clocks!
NO FSB/Cache Bottleneck, but Code is inefficient 1.Used Scalar operation! 2. Used cvtps2pd to convert float to double and then use cvttsd2s1 to convert double to integer Scalar Single/Double mul/add/mov/… cost lot’s clocks!

93 Vectorization: Modify the Source Code
Simple code may not lead to good performance. evaluate the result of float multiply(/div) add to integer variable Break complex heavy loop into multiple simple loops, which can be fully vectorized to use SSE/SSE2/SSE3 Instructions… Introduce temporary variable to simplify micro-operation vectorized float multiply add operation vectorized float to integer conversion operation vectorized float evaluation operation Near 8 times performance improvement of Quad-Core Intel® Xeon® processor (Clovertown)! GeoBenchmark oode modification and share is granted by Kurin for Intel IDF usage

94 SSE Decompress Algorithm
Example: Decompress use SSE intrinsic (17 bits example) SHUFFLE MASK: _mm_set_epi8 (0xFF, 8, 7, 6, 0xFF, 6, 5, 4, 0xFF, 4, 3, 2, 0xFF, 2, 1, 0); F E D C B A 9 8 7 6 5 4 3 2 1 42 ... Load a pre-fetched 128-bit segment of input data into SSE register. 3. Align the values from unequally shifted DWORDs. By packed and, packed multiply and packed shift right 4. Store uncompressed values. 2702 1772 65536 110300 2. Copy compressed values to target DWORDs “32-bit segment”. _mm_shuffle_epi8 by SHUFFLE MASK Use super shuffle engine (_mm_shuffle_epi8) Enhanced cache line split load in SSE 4, which is 2x~4x faster than the best case on Core2, Still not free, 1 Cache split load increase 4~6 cycles. 5. Use PTEST to check exceptions.

95 SSE Decompress Algorithm
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: MB/s Higher is better Function Name Cycles CPI intersect 30% 1.7 Decode 20% 0.5 Library Integration benchmark finished. Decode() further reduced to ~10% by SSE.

96 Micro-Arch Level “Vectorization” Tuning: VLC3D code Performance on Intel® AVX
Performance tests and ratings are measured using SDE simulator with AVX support. Any difference in system hardware or software design or configuration may affect actual performance.  For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations Before optimization Opcode Count Instruction group group % *isa-ext-BASE BASE 13.9% *isa-ext-SSE 516042 SSE ~0 *isa-ext-SSE2 *isa-ext-SSE3 *isa-ext-SSE4 *avx-scalar AVX scalar 83% *avx128 AVX 128 3% *avx256 AVX 256 *isa-set-AVX Total 100% After optimization Opcode Count Instruction group group % *isa-ext-BASE BASE 31% *isa-ext-SSE 516042 SSE ~0 *isa-ext-SSE2 *isa-ext-SSE3 *isa-ext-SSE4 *avx-scalar AVX scalar 24% *avx128 AVX 128 17% *avx256 AVX 256 44% *isa-set-AVX Total 100% 4.67x speedup by vectorization on X5400, hotspot Potential_LJ () which takes 95% of CPU time are ~100% vectorized by AVX256. Used AVX256 instruction boosted from 0 to 44%.

97 Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests.  Any difference in system hardware or software design or configuration may affect actual performance.  Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing.  For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations 使用Intel® TBB优化内存分配 App 1+Intel® TBB: 2.3x faster memory allocation vs. libc, 1.12x overall App 2: Intel® TBB, Google* TC Cache & Hoard vs. libc 1.5x faster memory allocation vs. libc, 1.08x overall Memory Allocation speedup: Overall App speedup: Malloc no longer visible in libc’s profile calls are intercepted by libtbbmalloc_proxy.so.2 and forwarded to libtbbmalloc.so.2 Intel® TBB malloc speedup libc time initial 69K libc with Intel® TBB 5K Diff = libc malloc = 46K Intel® TBB time = 20K Intel® TBB and others give additional improvement on threaded apps compared to libc malloc Explain impact of program’s allocation memory being larger & App1 cache manager slowdown with Intel® TBB Hoard version 351 INTEL CONFIDENTIAL – FOR INTERNAL USE ONLY DO NOT DISTRIBUTE 2018/11/12q 97 97

98


Download ppt "Parallel Programming Overview & Examples"

Similar presentations


Ads by Google