Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intel® Parallel Studio XE 2013 SP1 Intel® Cluster Studio XE 2013 SP1

Similar presentations


Presentation on theme: "Intel® Parallel Studio XE 2013 SP1 Intel® Cluster Studio XE 2013 SP1"— Presentation transcript:

1 Intel® Parallel Studio XE 2013 SP1 Intel® Cluster Studio XE 2013 SP1
Helping Developers Efficiently Produce Fast, Scalable and Reliable Applications

2 Intel® Parallel Studio XE Intel® Cluster Studio XE (30 second & 3 minutes)
30 seconds – use just the next 2 slides 3 minutes – use next 3 slides Optional – “What’s new” slide for existing customers

3 Intel® Parallel Studio XE 2013 Intel® Cluster Studio XE 2013 SP1
Helping Developers Efficiently Produce Fast, Scalable and Reliable Applications

4 More Cores. Wider Vectors. Performance Delivered
More Cores. Wider Vectors. Performance Delivered. Intel® Parallel Studio XE and Intel® Cluster Studio XE Scaling Performance Efficiently More Cores Multicore Many-core Serial Performance 50+ cores Wider Vectors Industry-leading performance from advanced compilers Comprehensive libraries Parallel programming models Insightful analysis tools Task & Data Parallel Performance Scientific, engineering, and enterprise hardware configurations are ever growing in compute capacity. More cores are providing greater parallelism opportunities and wider vectors allow more data throughput. Software applications require the right tools and methodologies to efficiently program for performance and scalability on the platforms that run their applications. These tools are provided by Intel. Intel® Parallel Studio XE combines industry-leading compilers, libraries, error checking and performance profiling tools for C/C++ and Fortran. The suite enables software developers to productively boost application performance on today and tomorrow’s hardware while preserving investment in existing code. Single language editions are also available with Intel® C++ Studio XE and Intel® Fortran Studio XE. Intel® Cluster Studio XE builds on Intel® Parallel Studio XE by including Intel’s high performance MPI tools. The suite enables software developers to boost performance on multicore, many-core and cluster systems. 128 Bits 256 Bits Distributed Performance 512 Bits

5 Efficiently Produce Fast, Scalable and Reliable Applications
Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 Service Pack 1 Phase Product Feature Benefit Build Intel® Composer XE Compilers, Performance and Threading Libraries Out of the box performance Intel® MPI Library† High Performance Message Passing (MPI) Library Interconnect independence Intel® Advisor XE Threading Prototyping Tool (Studio XE products only) Simplifies parallel application design Verify & Tune Intel® VTune™ Amplifier XE Performance Profiler Find performance bottlenecks Intel® Inspector XE Memory & Threading Dynamic and Static Analysis Code quality, improved security Intel® Trace Analyzer & Collector† MPI Performance Profiler Find performance bottlenecks in cluster-based applications This table represents the products that are delivered with Intel® Parallel Studio XE and Intel® Cluster Studio XE In each phase of their development cycle, key tools are needed and Intel tools provides them from our compiler technology, high performance libraries, MPI library, verification and tuning tools, and our parallel programming models. “†” Indicates this products is a component in Intel® Cluster Studio XE 2013 only. Efficiently Produce Fast, Scalable and Reliable Applications

6 Faster code + Simplified development
Top New Features Performance Analysis Efficiency Productivity Cross Platform Portability Improved compiler and library performance Intel® AVX-512 ready Broadwell & Haswell-EP microarchitecture optimizations Windows* support for Intel® Xeon Phi™ coprocessor Better data mining for performance tuning Incremental Analysis and Easier Suppression Management Easier migration from other debugging tools Enhanced MPI analysis interface Improved conditional numerical reproducibility Enhanced GDB for Linux* and OS X* OpenMP* 4.0 for SIMD and target constructs Expanded C++11 Expanded Fortran 2003 & 2008 Improved MPI 2.2 support Performance Our Industry-Leading performance from advanced compilers are ready to support Intel AVX- 512 plus next generation Xeon processors. This includes Broadwell and Haswell-EP. We have extended our Xeon Phi support to include windows hosted machines. Analysis Efficiency VTune Amplifier not only collects a wealth of useful data, it provides powerful analysis tools to mine that data for useful information. Inspector: Improved Error Suppression eliminates false, not the real errors. Developers now have the ability to import suppressions from Purify* & Valgrind* to Inspector on Linux* Cluster developers will find a fresh look-and-feel to the enhanced Intel® Trace Analyzer Graphical Interface. This allows more streamlined analysis flow Productivity Last year we introduced conditional numerical reproducibility . This year, Intel® Math Kernel Library saves developer’s time by removing the prerequisite of memory alignment New debugging support on Linux* and OS X* via the GNU Project Debugger* (GDB*) with Intel extensions for branch tracing, data race detection, Pointer Checker support, and support for Intel® Transactional Synchronization Extensions. Cross Platform Portability Intel is committed to supporting standards. This year we have added OpenMP 4.0 SIMD and target constructs expanded our C++11, Fortran 2003 and Fortran 2008 support improved our MPI support

7 Intel® Parallel Studio XE Intel® Cluster Studio XE (30 minutes)
Use this for a deeper dive into our tools

8 Intel® Parallel Studio XE 2013 Intel® Cluster Studio XE 2013
Helping Developers Efficiently Produce Fast, Scalable and Reliable Applications

9 More Cores. Wider Vectors. Performance Delivered
More Cores. Wider Vectors. Performance Delivered. Intel® Parallel Studio XE and Intel® Cluster Studio XE Scaling Performance Efficiently More Cores Multicore Many-core Serial Performance 50+ cores Wider Vectors Industry-leading performance from advanced compilers Comprehensive libraries Parallel programming models Insightful analysis tools Task & Data Parallel Performance Scientific, engineering, and enterprise hardware configurations are ever growing in compute capacity. More cores are providing greater parallelism opportunities and wider vectors allow more data throughput. Software applications require the right tools and methodologies to efficiently program for performance and scalability on the platforms that run their applications. These tools are provided by Intel. Intel® Parallel Studio XE combines industry-leading compilers, libraries, error checking and performance profiling tools for C/C++ and Fortran. The suite enables software developers to productively boost application performance on today and tomorrow’s hardware while preserving investment in existing code. Single language editions are also available with Intel® C++ Studio XE and Intel® Fortran Studio XE. Intel® Cluster Studio XE builds on Intel® Parallel Studio XE by including Intel’s high performance MPI tools. The suite enables software developers to boost performance on multicore, many-core and cluster systems. 128 Bits 256 Bits Distributed Performance 512 Bits

10 Efficiently Produce Fast, Scalable and Reliable Applications
Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 Service Pack 1 Phase Product Feature Benefit Build Intel® Composer XE Compilers, Performance and Threading Libraries Out of the box performance Intel® MPI Library† High Performance Message Passing (MPI) Library Interconnect independence Intel® Advisor XE Threading Prototyping Tool (Studio XE products only) Simplifies parallel application design Verify & Tune Intel® VTune™ Amplifier XE Performance Profiler Find performance bottlenecks Intel® Inspector XE Memory & Threading Dynamic and Static Analysis Code quality, improved security Intel® Trace Analyzer & Collector† MPI Performance Profiler Find performance bottlenecks in cluster-based applications This table represents the products that are delivered with Intel® Parallel Studio XE and Intel® Cluster Studio XE In each phase of their development cycle, key tools are needed and Intel tools provides them from our compiler technology, high performance libraries, MPI library, verification and tuning tools, and our parallel programming models. “†” Indicates this products is a component in Intel® Cluster Studio XE 2013 only. Efficiently Produce Fast, Scalable and Reliable Applications

11 Faster code + Simplified development
Top New Features Performance Analysis Efficiency Productivity Cross Platform Portability Improved compiler and library performance Intel® AVX-512 ready Broadwell & Haswell-EP microarchitecture optimizations Windows* support for Intel® Xeon Phi™ coprocessor Better data mining for performance tuning Incremental Analysis and Easier Suppression Management Easier migration from other debugging tools Enhanced MPI analysis interface Improved conditional numerical reproducibility Enhanced GDB for Linux* and OS X* OpenMP* 4.0 for SIMD and target constructs Expanded C++11 Expanded Fortran 2003 & 2008 Improved MPI 2.2 support Performance Our Industry-Leading performance from advanced compilers are ready to support Intel AVX- 512 plus next generation Xeon processors. This includes Broadwell and Haswell-EP. We have extended our Xeon Phi support to include windows hosted machines. Analysis Efficiency VTune Amplifier not only collects a wealth of useful data, it provides powerful analysis tools to mine that data for useful information. Inspector: Improved Error Suppression eliminates false, not the real errors. Developers now have the ability to import suppressions from Purify* & Valgrind* to Inspector on Linux* Cluster developers will find a fresh look-and-feel to the enhanced Intel® Trace Analyzer Graphical Interface. This allows more streamlined analysis flow Productivity Last year we introduced conditional numerical reproducibility . This year, Intel® Math Kernel Library saves developer’s time by removing the prerequisite of memory alignment New debugging support on Linux* and OS X* via the GNU Project Debugger* (GDB*) with Intel extensions for branch tracing, data race detection, Pointer Checker support, and support for Intel® Transactional Synchronization Extensions. Cross Platform Portability Intel is committed to supporting standards. This year we have added OpenMP 4.0 SIMD and target constructs expanded our C++11, Fortran 2003 and Fortran 2008 support improved our MPI support

12 Boost Performance Not change increase performance, but efficiently boost performance

13 Support for Latest Intel Processors and Coprocessors
Intel® Haswell microarchitecture Intel® Broadwell microarchitecture Intel® Xeon Phi™ coprocessor Intel® C++ and Fortran Compiler Intel® TBB library Intel® MKL library Intel® MPI library Intel® VTune™ Amplifier XE† Intel® Inspector XE†† Now with Windows* support All of tools will support the new processors as the ship. MKL includes optimized support for next generation processors before they’re available. This gives SW developers time to update their applications so they’re ready to scale before next generation systems arrive. MKL 11.0 includes support for two new architectures – Intel® Xeon Phi™ Coprocessors and Haswell CPUs. For Intel® Xeon Phi™ Coprocessor, MKL includes advanced load estimation and balancing algorithms to automatically determine how much of the compute workload should be offloaded to one or more Intel® Xeon Phi™ Coprocessor cards and how much to run on the CPU. Alternatively, developers can exert full control over the offload process to optimize for their own special needs. For the Haswell microarchitecture, that will ship in the first half of 2013, MKL will utilize the revolutionary AVX2 and FMA3 instruction set enhancements to deliver dramatic increases for all floating point calculations. Finally, MKL also utilizes the new Ivy Bridge microarchitecture digital random number generator for truly random seeding of vector statistics calculations. Note: FMA3 = 3 operand Floating Point Multiply accumulate instructions Intel® VTune™ Amplifier XE releases updates with events for new processors shortly after those processors ship. The initial release of the 2013 product includes Ivy Bridge micro-architecture and Intel® Xeon Phi™ architecture events. Haswell micro-architecture events will be available under NDA before introduction and in the released product shortly after shipment begins. Intel® Inspector XE analyzes code for the latest micro-architecture. Intel® Inspector XE can be used to analyze software for Intel® Xeon Phi™ products even though the analysis does not run on Intel® Xeon Phi™ products. Inspecting your app with Intel Inspector XE running your app on a multicore processor will detect memory errors and threading errors that will occur when running on a many-core processor. † Hardware events for new processors added as new processors ship. †† Analysis runs on multicore processors, provides analysis for multicore and many-core processors. New Product Announcements Embargoed until September 4, 8am Pacific Time

14 Intel® C++, Intel® Fortran, with Performance Libraries Intel® Composer XE
Industry leading application performance, serial and parallel Intel compilers: Intel Fortran and Intel C++ with Intel® Cilk Plus Intel Performance Libraries Intel® Threading Building Blocks Intel® Math Kernel Library Intel® Integrated Performance Primitives Architecture support: IA 32, Intel 64, Intel® Xeon Phi™ product family, Intel compatible processors Compatibility Windows: Visual* C++ and Visual Studio* 2008, 2010, 2012 Linux, Mac OS X, including Mountain Lion: gcc and, for C++ Eclipse & Xcode for Mac FPO Developers interested in performance should take a look at Intel compilers. We offer C++ and Fortran in a variety of packages, some with analysis tools, some without. The compiler packages include language features focused on performance, such as Intel® Cilk™ Plus, and also libraries which deliver performance. The compilers support all contemporary IA platforms, are available for Windows* (compatible with Microsoft Visual C++*), Linux* and OS X* (compatible with gcc*). Your performance will vary depending on the performance-sensitive parts of your application and memory-management issues but here is an example of the performance advantage based on the SPEC benchmark. Performance Compatibility Support

15 Continued compiler performance leadership on C++ and Fortran
Intel® C++ and Fortran Compilers FPO

16 Leadership Application Performance
More Performance for your C++ applications Just recompile Uses Intel® AVX and Intel® AVX2 instructions Intel® Xeon Phi™ product family support (Linux) Intel® Cilk™ Plus: Tasking and vectorization More Performance for your Fortran applications Intel® Xeon Phi™ product family: Linux compiler, debugger support Access to Intel® AVX and Intel® AVX2 instructions (-xa or /Qxa) Auto-parallelizer & directives to access SIMD instructions Coarrays & synchronization constructs support parallel programming Loop optimization directives: VECTOR, PARALLEL, SIMD More control over array data alignment (align arrayNbytes) New in 2013 XE SP1 release: more Fortran 2008 support Regarding Fortran, the following Fortran 2008 features are new in this release: The ENTRY statement is an obsolescent feature. A source statement can begin with one or more semicolon characters. The following Fortran 2008 features are supported: Coarray intrinsic routines: ATOMIC_DEFINE and ATOMIC_REF A polymorphic MOLD= specifier for ALLOCATE Coarrays Image control statements: SYNC ALL, SYNC IMAGES, SYNC MEMORY, CRITICAL, LOCK, and UNLOCK Coarray intrinsic routines: IMAGE_INDEX, LCOBOUND, NUM_IMAGES, THIS_IMAGE, and UCOBOUND CRITICAL construct Maximum array rank has been raised to 31 dimensions; the Fortran 2008 Standard specifies a maximum rank of 15 G0 and G0.d format edit descriptors FINAL routines GENERIC, OPERATOR, and ASSIGNMENT overloading in type-bound procedures A generic interface may have the same name as a derived type Bounds specification and bounds remapping list on a pointer assignment In formatting, a * indicates an unlimited repeat count NEWUNIT= specifier in OPEN A CONTAINS section can be empty Attributes CODIMENSION and CONTIGUOUS Coarrays can be specified in ALLOCATABLE, ALLOCATE, and TARGET statements MOLD keyword in ALLOCATE DO CONCURRENT statement ERROR STOP statement Intrinsic functions BESSEL_J0, BESSEL_J1, BESSEL_JN, BESSEL_Y0, BESSEL_Y1, BESSEL_YN, BGE, BGT, BLE, BLT, DSHIFTL, DSHIFTR, ERF, ERFC, ERFC_SCALED, GAMMA, HYPOT, IALL, IANY, IPARITY, IS_CONTIGUOUS, LEADZ, LOG_GAMMA, MASKL, MASKR, MERGE_BITS, NORM2, PARITY, POPCNT, POPPAR, SHIFTA, SHIFTL, SHIFTR, STORAGE_SIZE, TRAILZ ISO_FORTRAN_ENV module constants ATOMIC_INT_KIND, ATOMIC_LOGICAL_KIND, CHARACTER_KINDS, INTEGER_KINDS, INT8, INT16,INT32, INT64, LOGICAL_KINDS, REAL_KINDS, REAL32, REAL64, REAL128, STAT_LOCKED, STAT_LOCKED_OTHER_IMAGE, STAT_UNLOCKED ISO_FORTRAN_ENV type LOCK_TYPE SCALAR keyword for ALLOCATED

17 Up to 4x Faster Performance with Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Support
Intel® Compilers and Intel® Math Kernel Library will be updated in Q4 with AVX-512 support Significant leap to 512-bit SIMD support Increased compatibility with AVX One byte longer EVEX prefix, enabling additional functionality First implemented in the future Intel® Xeon Phi™ coprocessor, code named Knights Landing AVX-512 2x up to faster AVX / AVX2 4x up to faster Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions represent a significant leap to 512-bit SIMD support. Programs can pack eight double precision or sixteen single precision floating-point numbers, or eight 64-bit integers, or sixteen 32-bit integers within the 512-bit vectors. This enables processing of twice the number of data elements that AVX/AVX2 can process with a single instruction and four times that of SSE. Intel AVX instructions use the VEX prefix while Intel AVX-512 instructions use the EVEX prefix which is one byte longer. The EVEX prefix enables the additional functionality of Intel AVX-512. AVX-512 offers a level of compatibility with AVX that is stronger than prior transitions to new widths for SIMD operations. Unlike SSE and AVX that cannot be mixed without performance penalties, the mixing of AVX and AVX-512 instructions is supported without penalty.  Intel AVX-512 will be first implemented in the future Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing, and will also be supported by some future Xeon processors scheduled to be introduced after Knights Landing.  Note - speed-ups are actually higher 2 for peak flops when using FMA. AVX-512 is 8x faster than SSE2  AVX-512 is 4x faster than AVX AVX-512 is 2x faster than AVX2 Peak single precision floating point performance Enables higher performance for the most demanding computational tasks

18 Intel® Math Kernel Library (Intel® MKL)
Vectorized and threaded for highest performance on all Intel and compatible processors De facto standard APIs for simple code integration Compatible with all C, C++ and Fortran compilers Royalty-free, per developer licensing for low cost deployment #1 used math library in the world Source: Evans Data WW Developer Surveys Just Link to the Next Intel® MKL Version to Realize New Processor Performance

19 LAPACK Performance Improves with Intel® Math Kernel Library
Compilers & Libraries LAPACK Performance Improves with Intel® Math Kernel Library Need Update Based on SP1 MKL DGETRF = LU factorization. Key in processing time of benchmark LINPACK BLAS 3 that computes an LU factorization of a general M-by-N matrix A using partial pivoting with row interchanges. A = P * L * U P is a permutation matrix L is lower triangular with unit diagonal elements (lower trapezoidal if m > n) U is upper triangular (upper trapezoidal if m < n).

20 Intel® IPP Makes Your Applications Faster
Intel® Integrated Performance Primitives Highly Optimized Functions for Multimedia & Data Applications Optimized for Performance and Power Efficiency Primary focus on image and signal processing functions. Highly optimized using SSE, AVX instruction sets Performance beyond what an optimized compiler produces alone Intel Engineered & Future Proofed to Save You Time Ready-to-use & royalty free Fully optimized for current and past processors Save development, debug, and maintenance time Code once now, receive future optimizations later Wide range of Cross Platform & OS Functionality Thousands of optimized functions Supports Windows*, Linux*, and Mac OS* X Supports Intel® Atom, Intel® Core, Intel® Xeon, platforms Intel® IPP Makes Your Applications Faster Availability: Composer XE and Studio XE products as well as standalone

21 Intel® IPP Easy Access to Intel® AVX Performance
Need Update Based on SP1 MKL

22 Gain Performance with Less Effort
Intel® Parallel Studio XE Intel® Cluster Studio XE Compilers & Libraries C++ Performance Guide Performance Wizard for Windows Quick 5 step process for more performance Get help choosing optimization options The wizard guides developers through different compiler options to make sure their application is optimized for performance. Gain Performance with Less Effort

23 Intel® VTune™ Amplifier XE Performance Profiler
Where is my application… Spending Time? Wasting Time? Waiting Too Long? Focus tuning on functions taking time See call stacks See time on source Find cache misses, branch mis-predictions and other inefficiencies Tune bandwidth See locks by wait time Red/Green for CPU utilization during wait Windows & Linux Low overhead No special recompiles Claire Cates Principal Developer, SAS Institute Inc. We improved the performance of the latest run 3 fold. We wouldn't have found the problem without something like Intel® VTune™ Amplifier XE. Advanced Profiling For Scalable Multicore Performance

24 Low Overhead Java* Profiling Intel® VTune™ Amplifier XE
Low Overhead & Precise Sampling is fast / unobtrusive Hardware sampling even faster (Now with optional stacks!) Advanced profiles are unique (cache misses, bandwidth…) Versatile & Easy to Use Multiple simultaneous JVMs Mixed Java / C++ / Fortran See results on the Java source Low runtime overhead (most other Java* SW tools use SW instrumentation and messages back-and-forth to the JVM (java Virtual Machine) to gather data).  Intel VTune Amplifier XE’s main technology is interrupt based sampling with a little communication with the JVM to know where JIT compiled code ended up.  We even offer hardware sampling using the on chip PMU for very low overhead as well as advanced tuning like cache misses. The mixed programming language capability as described above is also a feature most other tools don’t have.  We are not just another Java tool.  We don’t duplicate what the other tools do well (Java heap profiling, …).  We augment those tools with unique, low overhead profiling and mixed mode support. Advanced profiling for Java is a unique advantage of VTune Amplifier XE. No other vendor offers the wide selection of pre-defined advanced profiles (general exploration, cache miss, bandwidth, branch mis-predict, etc.) for the latest Intel processors. Results automatically highlight potential tuning opportunities. Low overhead performance profiling    Less java app slowdown when gathering performance data        Interrupt-based sampling technology Function-to-function calling sequences Performance data for CPU events    Cache misses, branch mispredictions, …   Highlighting the results when those events are performance problems       “General Exploration” Intel VTune Amplifier XE collector       Answers the question as to how many cache misses are too many Mixed Programming language support    Gives full performance information for apps with both Java and c/c++ code Drill down to Java* source code   See which executable statements are causing performance issues Better data, lower overhead, easier to use

25 Intel® VTune™ Amplifier XE Tune Applications for Scalable Multicore Performance
Fast, Accurate Performance Profiles Hotspot (Statistical call tree) Call counts (Statistical) Hardware-Event Sampling Thread Profiling Visualize thread interactions on timeline Balance workloads Easy set-up Pre-defined performance profiles Use a normal production build Find Answers Fast Filter extraneous data View results on the source / assembly Compatible Microsoft, GCC, Intel compilers C/C++, Fortran, Assembly, .NET, Java Latest Intel® processors and compatible processors1 Windows or Linux Visual Studio Integration (Windows) Standalone user i/f and command line 32 and 64-bit Accurate performance data – Without data you are just guessing about the location of the performance bottleneck and can easily waste a lot of time. Easy set-up – We’ve added a number of pre-defined performance profiling experiments to the full custom capabilities of earlier versions of VTune™ analyzer. This makes it easier to get great profiling information without needing to know micro architectural details. Powerful profile analysis – Getting good profiling data is only the first step. We’ve added a number of features like the timeline, filtering and frame analysis to help turn that data into actionable information. Tune threaded and non-threaded code – Identify the threads and synchronization objects which impact performance. See the distribution of work to threads and pinpoint load imbalances. Low overhead – Collecting data always has a cost. VTune™ Amplifier XE keeps the overhead low making data collection faster and the results more accurate. Normal production build – Use a production build with symbols from your normal compiler or assembler. No special builds are required. C++, Fortran, Assembly and more – Use compilers from any vendor (Microsoft, GCC, Intel, …) that follow platform standards. Intel® processors and compatible processors (IA32 & Intel® 64) – Many of the profiling features work on both genuine Intel® processors and compatible processors. Some features, which use the on-chip performance monitoring unit for event based sampling, require a genuine Intel® processor for data collection, but the results can be saved and analyzed on any compatible processor. Windows or Linux, 32 and 64-bit – Both Windows and Linux versions are available. Windows 7, Vista, XP, Windows Server, RHEL, Fedora, Suse… Windows: Visual Studio 2005, 2008 and 2010 Integration – Integrate performance analysis into the Visual Studio environment or run standalone. Linux: No root privileges required – Root privileges are not required for basic performance analysis. Installation of the driver for event based sampling (EBS) requires root access, but it can be done later if needed. 1 IA32 and Intel® 64 architectures. Many features work with compatible processors. Event based sampling requires a genuine Intel® Processor.

26 What’s New in 2013 SP1? Intel® VTune™ Amplifier XE
More Profiling Data Intel® Xeon Phi™ – memory and vectorization profiling Gen graphics tuning – GT event counting, offload, OpenCL*, … Better Data Mining – Find Answers Faster Search added to all grids Timeline sorting, band height, time scale configuration Loop hierarchy, overhead and spin time metrics OpenMP* 4.0 – affinity controls, tasking and scalability analysis Easier to Use Attach to a running Java process Contextual help for hardware events and performance metrics Easier generation of command line options from the user i/f New OS & Processor Support Intel® Xeon Phi™, Haswell – Windows* & Linux* Windows 8 desktop and Visual Studio* 2012 Collection on Windows UI and Windows Blue Latest Linux distributions New since the first 2013 release. Some features released in earlier updates. Intel Confidential

27 Find performance issues faster Intel® VTune™ Amplifier XE
Search added to all grid views Configurable: Band Height Sorting Time Scale Visualize overhead and spin times VTune not only collects a wealth of useful data, it provides powerful analysis tools to mine that data for useful information. Search let’s you quickly locate functions of interest in the results. For example if a user is interested in all functions in a particular class, they can just type in the name of the class and see all the functions. With increased numbers of threads the user can choose to see more threads with less detail or more detail for fewer threads. Both are useful. Overhead an spin times now are shown on the timeline making it easier to diagnose inefficiencies in parallel programs.

28 Scale Forward

29 OpenMP Coarray Fortran MPI
Compilers & Libraries Scale Forward with Intel Parallel Models Extend to Intel® Xeon Phi™ Coprocessors Intel® Xeon Processors, and Compatible Processors Abstract, Scalable, Composable Support Standards OpenMP Coarray Fortran MPI Intel® Cilk™ Plus C, C++ language extensions to simplify vectorization & parallelism Intel® Threading Building Blocks Widely used C++ template library for thread management Intel® Xeon Phi™ product family Open programming models and also Intel products Intel brings you a strong suite of standards-based tools and features. Intel continues to invest in parallel programming models to help you get the most performance from multicore. These models are all common to both multicore (e.g., Intel Xeon processors) and many-core (e.g., Intel® Xeon Phi™ product family). Selecting the best models for your application today will set a path for you to take advantage of many-core performance. Start today by implementing parallelism for multicore Intel® Architecture today using models that will be ready for many-core. We recommend using abstract models that allows scalability and friendly co-existence with other parallel model types. Intel® Cilk™ Plus Simple extensions to C, C++ language Using three keywords in C, C++ (cilk_for, cilk_spawn, cilk_sync) Array notations, allowing greater vectorization in C, C++ Intel introduced Intel Cilk™ Plus in It is built on research form M.I.T. and productization experiences by industry leader Cilk Arts. Supports elemental functions A note on Cilk Plus compatibility: There is a gcc development branch to which Cilk Plus is being added (Cilk Plus is open sourced and available in Intel products). Compatibility with Cilk Plus as it ships in the Intel C++ compiler will grow over time. If you are using the mainline gcc release, or Visual C++, you will get syntax errors when you compile. That said, if you want to comment out the 3 keywords using #ifdef or #pragma, your application will compile, and you’ll probably get warning messages about them, but your application should run as it did before you applied the keywords. We will talk more about Cilk Plus in a moment when we talk about array notation and elemental functions for vectorization. As of these “Plus” parts of Cilk Plus cannot be commented out and are basically incompatible with the production releases of gcc and Visual C++, but, again, work is underway on a gcc branch that supports Cilk Plus. Intel® Threading Building Blocks (Intel® TBB) A popular parallel abstraction for C++ developers A C++ template library Scalable memory allocation Load-balancing Work-stealing task scheduling Thread-safe pipeline Concurrent containers High-level parallel algorithms Numerous synchronization primitives Flow graph We also support other standards such as OpenMP (v 3.1), Coarray Fortran and MPI. Don’t Leave Your Code Behind

30 Compilers & Libraries Simplify Parallelism Intel® Cilk™ Plus, Intel® Threading Building Blocks Intel® Cilk™ Plus #pragma SIMD and array notation: easy-to-use, powerful vectorization 3 simple keywords for parallelism Support for task & data parallelism Semantics similar to serial code Simple way to vectorize, parallelize your code Sequentially consistent, low over-head, powerful solution Supports C, C++, Windows and Linux Get more from your IA hardware Intel® Threading Building Blocks Parallel algorithms and data structures Scalable memory allocation and task scheduling Synchronization primitives Rich feature set for general purpose parallelism Available as open source or commercial license Supports C++, Windows, Linux, Mac OS X, other OSs What Features Why Language extensions to simplify task/data parallelism Widely used C++ template library for task parallelism Intel® Cilk™ Plus – convenient semantics cilk_spawn cilk_sync cilk_for Serialize code with command line switch Vector notation – natural expression of vector code A[:] = B[:] + C[:] Generates SIMD code (SSE, AVX) Vector functions (e.g. A[:]=sin(B[:]) Includes facilities for user-defined vector functions Notation example: for (i=0; i<N i++) { A[i] = B[i] + t*C[i] } Becomes A[:] = B[:] + t*C[:] Vectorization and Parallelism Made Easier

31 Intel® Threading Building Blocks (Intel® TBB)
Popular, proven parallelism abstraction layer delivered as a C++ template library Scalable memory allocation Load-balancing Work-stealing task scheduling Thread-safe pipeline Flexible flow graph Concurrent containers High-level parallel algorithms Numerous synchronization primitives Open source, portable across many OSs "Intel® TBB provided us with optimized code that we did not have to develop or maintain for critical system services. I could assign my developers to code what we bring to the software table.” Achieve linear performance scaling with Intel® TBB. The C++ library abstracts access to the multiple processors by allowing the operations to be treated as "tasks", which are allocated to individual cores dynamically by the library's run-time engine, and by automating efficient use of the CPU cache. A TBB program creates, synchronizes and destroys graphs of dependent tasks according to algorithms, i.e. high-level parallel programming paradigms. Tasks are then executed respecting graph dependencies. This approach groups TBB in a family of solutions for parallel programming aiming to decouple the programming from the particulars of the underlying machine. Michaël Rouillé, CTO, Golaem Simplify Parallelism with a Scalable Parallel Model

32 Excellent scalability on Intel® Intel® Xeon Phi™ coprocessors

33 Intel® Cilk™ Plus: A C++ Language Extension
Scale Forward and Extend to Intel® Xeon Phi™ Coprocessors With Intel® Cilk™ Plus Intel® Cilk™ Plus: A C++ Language Extension Easier Task & Data Parallelism Intel® Cilk™ Plus Array Notation 3 simple Keywords: And #pragma SIMD cilk_for, cilk_spawn, cilk_sync Save time with powerful vectorization Code snippet before #pragma SIMD for(int i = 2; i < n ;i++)       y[i] = y[i-2] + 1; In addition to “more cores” we are also seeing wider vectorization and larger vector instruction sets like AVX2. There are several ways that the Intel compiler can know to vectorize your code. Auto-vectorization is a great way to start since it is very low effort, but it may not vectorize your code in all cases. Cilk Plus gives you an easy way to express vector operations with a guarantee that the compiler will use the best vector instructions. It is recommended because it is easy to understand and less error prone than pragma simd. Pragma simd also works, but it is a riskier choice that should be used carefully. As noted earlier, there is a gcc development branch to which Cilk Plus is being added (Cilk Plus is open sourced and available in Intel products). Compatibility with Cilk Plus as it ships in the Intel C++ compiler will grow over time. If you are using the mainline gcc release, or Visual C++, you will get syntax errors when you compile. That said, as mentioned earlier, if you want to comment out the 3 keywords using #ifdef or #pragma, your application will compile, and you’ll probably get warning messages about them, but your application should run as it did before you applied the keywords. Regarding the “Plus” parts that are the focus of this page, vectorization capabilities of Cilk Plus cannot be commented out and are basically incompatible with the production releases of gcc and Visual C++, but, again, work is underway on a gcc branch that supports Cilk Plus. Code snippet with #pragma SIMD #pragma simd vectorlength(2) for(int i = 2; i < n ;i++)       y[i] = y[i-2] + 1; Intel® Cilk™ #pragma SIMG Save time with powerful vectorization Minimize Software Re-Work for New Hardware

34 Visualize OpenMP* scalability Intel® VTune™ Amplifier XE
Visualize time regions from the fork to the join See what is serial, balanced and imbalanced. seconds in an imbalanced region 3.652 seconds in a fairly well balanced region New VTune support to make it easier to understand OpenMP data including affinity controls, tasking and scalability analysis. Intel® Vtune™ Amplifier XE now lets you visualize time regions from the fork to the join. You can see what is serial, what is balanced, and what is imbalanced to make sure your OpenMP*-based application is optimally scaling. Actionable Data for Better OpenMP Scalability New Product Announcements Embargoed until September 4, 8am Pacific Time

35 Add Parallelism with Less Effort, Less Risk and More Impact
Data Driven Threading Design Intel® Advisor XE – Threading Prototyping Tool for Architects Have you: Tried threading an app, but seen little performance benefit? Hit a “scalability barrier”? Performance gains level off as you add cores? Delayed a release that adds threading because of synchronization errors? Breakthrough for threading design: Quickly prototype multiple options Project scaling on larger systems Find synchronization errors before implementing threading Separate design and implementation, design without disrupting development Avoid wasted effort – increase your confidence of success We would like to make a tool that fully automates this process, but unfortunately the state of the art is not there yet. Instead, Advisor focuses on doing things that machines do well and gathers and distills information for the developers to help them make better decisions. We’ve had great customer feedback on the predictions Advisor makes. People who followed the advice and added threading where suggested actually saw the projected speedup. We’ve even had customers add threading in places where it was not recommended, and, sure enough, there was little performance gain. One of the real benefits is that you can identify issues (e.g., global variables, synchronization, …) and fix them before you implement the parallelism. This lets you test the app using your existing test framework and gain confidence that the threading will be successful, before adding the threading implementation. Add Parallelism with Less Effort, Less Risk and More Impact

36 Design Then Implement Intel® Advisor XE – Threading Prototyping Tool
Design Parallelism No disruption to regular development All test cases continue to work Tune and debug the design before you implement it 1) Analyze it. 2) Design it. (Compiler ignores these annotations.) 3) Tune it. 4) Check it. Implement Parallelism 5) Do it! Less Effort, Less Risk, More Impact

37 Thread Prototyping - Intel® Advisor XE What’s New in SP1?
Easier to Learn New tutorial New training videos Improved assistance window Easier to Compare Alternate Designs Snapshot copy saves a copy of your workspace Improved Performance Pause/resume eliminates analysis of low risk code New OS & Processor Support Haswell – Windows* & Linux* Windows 8* desktop Visual Studio* 2012 Latest Linux distributions New since the first 2013 release. Some features released in earlier updates. Intel Confidential

38 Increase Reliability

39 Excellent Support for C++ 11 on Windows* and Linux*
Expanded C++ 11 support New in SP1 (14.0 compilers) Unrestricted unions (Linux*, OS X*) Non-static data member initializers Explicit Virtual Overrides Allowing move constructors to throw Defining Move Special Member Functions Inline namespaces Rvalue references v2 Intel is committed to supporting the C++11 standard and with SP1 (icc 14.0) we are almost there. We expect to offer full support in our next release. The list below is long but some of these include unicode string literals, long long (64-bit integer type) and other item that reconcile C99 with subsequent work on C++0x. For a full list of supported features, see Excellent Support for C++ 11 on Windows* and Linux*

40 Excellent Fortran 2008 Support
Compilers & Libraries Excellent Fortran 2008 Support Maximum array rank has been raised to 31 dimensions (Fortran 2008 specifies 15) Recursive type may have ALLOCATABLE components Coarrays CODIMENSION attribute SYNC ALL statement SYNC IMAGES statement SYNC MEMORY statement CRITICAL and END CRITICAL statements LOCK and UNLOCK statements ERROR STOP statement ALLOCATE and DEALLOCATE may specify coarrays Intrinsic procedures IMAGE_INDEX, LCOBOUND, NUM_IMAGES, THIS_IMAGE, UCOBOUND CONTIGUOUS attribute MOLD keyword in ALLOCATE DO CONCURRENT NEWUNIT keyword in OPEN G0 and G0.d format edit descriptor Unlimited format item repeat count specifier CONTAINS section may be empty Intrinsic procedures BESSEL_J0, BESSEL_J1, BESSEL_JN, BESSEL_YN, BGE, BGT, BLE, BLT, DSHIFTL, DSHIFTR, ERF, ERFC, ERFC_SCALED, GAMMA, HYPOT, IALL, IANY, IPARITY, IS_CONTIGUOUS, LEADZ, LOG_GAMMA, MASKL, MASKR, MERGE_BITS, NORM2, PARITY, POPCNT, POPPAR, SHIFTA, SHIFTL, SHIFTR, STORAGE_SIZE, TRAILZ Additions to intrinsic module ISO_FORTRAN_ENV: ATOMIC_INT_KIND, ATOMIC_LOGICAL_KIND, CHARACTER_KINDS, INTEGER_KINDS, INT8, INT16, INT32, INT64, LOCK_TYPE, LOGICAL_KINDS, REAL_KINDS, REAL32, REAL64, REAL128, STAT_LOCKED, STAT_LOCKED_OTHER_IMAGE, STAT_UNLOCKED Intel Fortran supports all of the widely used features of the F2003 standard and key parts of the 2008 standard, including co-arrays. Leadership F2008 Support on Linux*, Windows* & OSX*

41 Expanded Fortran Capabilities in SP1
Added support for user defined derived type input and output Fortran 2008 ATOMIC_DEFINE and ATOMIC_REF initialization of polymorphic INTENT(OUT) dummy arguments standard handling of G format and of printing the value zero polymorphic source allocation Co-array now supports Intel® Xeon Phi™ Coprocessor Fortran 2003: Custom subroutines can be used to handle input or output for objects of a derived type Motivation: Allows user control of the way input and output is handled for derived type variables The default I/O handlers can not be used for some derived types Component variables printed in order of declaration No pointers or allocatables F2008 Three coarray execution models Images run on host with offload regions (w/restrictions) Images run on both coprocessor and host (hetereogeneous) Images run natively on the coprocessor Note: Last 2 models requires manual upload of referenced shared object libs including MPI (impi.so) and coarray (libicaf.so) The following Fortran 2008 features are supported: Coarray intrinsic routines: ATOMIC_DEFINE and ATOMIC_REF A polymorphic MOLD= specifier for ALLOCATE Coarrays Image control statements: SYNC ALL, SYNC IMAGES, SYNC MEMORY, CRITICAL, LOCK, and UNLOCK CRITICAL construct Coarray intrinsic routines: IMAGE_INDEX, LCOBOUND, NUM_IMAGES, THIS_IMAGE, and UCOBOUND Maximum array rank has been raised to 31 dimensions; the Fortran 2008 Standard specifies a maximum rank of 15 G0 and G0.d format edit descriptors FINAL routines A generic interface may have the same name as a derived type GENERIC, OPERATOR, and ASSIGNMENT overloading in type-bound procedures Bounds specification and bounds remapping list on a pointer assignment In formatting, a * indicates an unlimited repeat count NEWUNIT= specifier in OPEN Attributes CODIMENSION and CONTIGUOUS A CONTAINS section can be empty Coarrays can be specified in ALLOCATABLE, ALLOCATE, and TARGET statements MOLD keyword in ALLOCATE DO CONCURRENT statement Intrinsic functions BESSEL_J0, BESSEL_J1, BESSEL_JN, BESSEL_Y0, BESSEL_Y1, BESSEL_YN, BGE, BGT, BLE, BLT, DSHIFTL, DSHIFTR, ERF, ERFC, ERFC_SCALED, GAMMA, HYPOT, IALL, IANY, IPARITY, IS_CONTIGUOUS, LEADZ, LOG_GAMMA, MASKL, MASKR, MERGE_BITS, NORM2, PARITY, POPCNT, POPPAR, SHIFTA, SHIFTL, SHIFTR, STORAGE_SIZE, TRAILZ ERROR STOP statement ISO_FORTRAN_ENV module constants ATOMIC_INT_KIND, ATOMIC_LOGICAL_KIND, CHARACTER_KINDS, INTEGER_KINDS, INT8, INT16,INT32, INT64, LOGICAL_KINDS, REAL_KINDS, REAL32, REAL64, REAL128, STAT_LOCKED, STAT_LOCKED_OTHER_IMAGE, STAT_UNLOCKED SCALAR keyword for ALLOCATED ISO_FORTRAN_ENV type LOCK_TYPE Intel Fortran: Leadership Performance with a Leading Feature Set

42 Conditional Numerical Reproducibility (CNR)
Compilers & Libraries Conditional Numerical Reproducibility (CNR) “I’m a C++ and Fortran developer and have high praise for the Intel® Math Kernel Library. One nice feature I’d like to stress is the numerical reproducibility of MKL which helps me get the assurance I need that I’m getting the same floating point results from run to run." Intel® Math Kernel Library: New deterministic task scheduling and code path selection options OpenMP*: New deterministic reduction option Intel® Threading Building Blocks New parallel deterministic reduce option Reference: 1.MKL CWBR: 2. Compiler options to use and manage reproducible results: 3. The OpenMP RTL already has undocumented support for using a tree algorithm for OpenMP reductions to combine the contributions from different threads. This defines the order in which the contributions from different threads are combined, and yields reproducible results for a fixed number of threads. It may result in a slight loss of performance, but might also improve performance for large numbers of threads. Set the environment variable:  KMP_DETERMINISTIC_REDUCTION=1 Here is the official name of this feature:  “OpenMP deterministic reduction feature”.  In the past, James has asked for the actual syntax for invoking such things.  In this case, a user would set an environment variable: KMP_DETERMINISTIC_REDUCTION=true to turn it on.  This forces run-to-run consistent results for any OpenMP reduction, given the same input data.  Engineering considers this a minor feature but important to a small subset of customers.  One known customer (Mateo France) said they would make it a condition of their next purchase.  The benefit is to customers who must have reproducible results and consistency in computation.  It takes away a source of imprecision leading to different results which some might otherwise have to be analyzed for the source of the differences.  As explained to me, multiple threads in an application will calculate answers by combining data in the order the threads arrive at the end of the parallel region.  Different thread ordering will result in slightly different results.  This feature enables the use of parallelism to speed up computation to get reproducible results which would otherwise only be reproducible if you ran the application in serial (non-threaded) mode.  4. TBB: Franz Bernasek Owner / CEO , Senior Developer MSTC Modern Software Technology Help Achieve Reproducible Results, Despite Non-associative Floating Point Math

43 New and Simplified CNR Intel® Math Kernel Library now supports unaligned data for conditional numerical reproducibility Extends the feature to remove memory alignment as a prerequisite Balances performance with reproducible results. Allows greater flexibility in code branch choice and ensures algorithms are deterministic. Overcome the inherently non-associativity characteristics of floating-point arithmetic results with MKL, TBB and OpenMP New in this MKL release is the ability to achieve reproducibility without memory alignment. “I’m a C++ and Fortran developer and have high praise for the Intel® Math Kernel Library. One nice feature I’d like to stress is the bitwise reproducibility of MKL which helps me get the assurance I need that I’m getting the same floating point results from run to run.“ Franz Bernasek CEO and Senior Developer, MSTC Modern Software Technology

44 Find errors earlier when they are less expensive to fix
Correctness tools increase ROI by 12%-21% Cost Factors – Square Project Analysis CERT: U.S. Computer Emergency Readiness Team, and Carnegie Mellon CyLab NIST: National Institute of Standards & Technology : Square Project Results Size and complexity of applications is growing Correctness tools find defects during development prior to shipment Reworking defects is 40%-50% of total project effort Reduce time, effort, and cost to repair The earlier you find errors the less expensive they are to fix. There are many studies, the %ROI varies, but they are consistent that finding and fixing errors early saves $. Start at unit test, some errors detected, then move to larger tests at integration. Find errors earlier in the cycle. Find and diagnose errors with less effort Find errors earlier when they are less expensive to fix

45 Deliver More Reliable Applications Intel® Inspector XE and Intel® Parallel Studio XE family of suites Dynamic Analysis Memory Errors Threading Errors Intel® Inspector XE alone Intel Inspector XE dynamically instruments & runs the application and watches for errors. Use any build, any compiler (debug build is best). Static Analysis Code & Security Errors Pointer Checker Pointer Errors Added bonus features in Intel® Parallel Studio XE and Intel® Cluster Studio XE suites We offer a wealth of reliability analysis tools in Parallel Studio XE. Multiple tools all with the same common user interface so only one thing to learn. The top row of dynamic analysis capabilities are included in Intel Inspector XE. The second row are bonus capabilities available in our Parallel Studio XE family of suites. (C++ Studio XE, Fortran Studio XE, Cluster Studio XE, Parallel Studio XE) Dynamic analysis: C, C++, C#, F#, Fortran Static analysis: C, C++, Fortran Pointer Checker: C, C++ Intel compiler inspects source. Use any compiler for production. Intel compiler run time checks. Use any compiler for production. Static Analysis & Pointer Checker are only available in the Parallel Studio XE family of suites. Not sold separately.

46 Diagnose errors with less effort Intel® Inspector XE Dynamic Memory & Thread Analysis
Diagnose heap growth. Get a list of memory allocations not freed in an interval set with the GUI or an API. Heap Growth Analysis Diagnose the problem. Break into the debugger just before the error occurs. Examine the variables and threads. Debugger Breakpoints More precise, easy to edit, team shareable. Choose which stack frame to suppress. Eliminate the false, not the real errors. Improved Error Suppression Pause/Resume Collection Speed-up analysis by limiting its scope. Turn on analysis only during the execution of the suspected problem. Find and eliminate errors Inspector XE is both a memory checker and a threading checker that can find errors earlier in the design cycle when they are less expensive to fix. This is especially important with threading errors like deadlocks and race conditions which are difficult to detect without a tool. Almost NEW! – With our recent service pack we’ve adapted both Inspector XE and Amplifier XE to analyze MPI applications (cluster apps). This is important because more and more MPI apps are adding threading and need these tools. NEW! Heap growth analysis – APIs are used to mark the region(s) for growth analysis. This lets you see what is causing the heap to grow in a problematic region. It is not same as heap profiler, but is a first step. Heap growth analysis feature reports list of memory allocations that are not freed within an interval. Interval is defined by way of GUI button clicks by user or programmatically using itt_notify API. Faster & Easier to use NEW! Debugger breakpoints – easier diagnosis of difficult bugs, lower overhead when running to a breakpoint Now you can break into the debugger when an error is detected and diagnose the cause. Supports MS debugger (2008, 2010 and soon Dev11) on Windows. On Linux both the Intel Debugger and GDB are supported. NEW! Pause/resume collection – we’ve had many requests to be able to “turn off” the analysis, run fast up to a point before the suspected error occurs, then “turn on” the analysis. Now you can. There is still overhead involved when analysis is off, but it is greatly reduced. Intel® Xeon Phi™ Products Software is analyzed on a regular multicore processor, fixed, then targeted at systems with Intel® Xeon Phi™ coprocessors. There is a class of errors unique to the hardware environment that this analysis will miss, but there is a very significant set of errors that can be found and eliminated. Q: Why would I want to do debug simultaneously with an analysis session when the IXE problem report gives me the code location (I can run the application under the debugger the way I always have and set a code breakpoint )? A: When debugging without memory or threading analysis a code breakpoint will stop execution at the right location, but that location might be executed thousands of times before the conditions are right that resulted in the problem being reported. By combining debug with analysis, the tool does the work of determining when the problem conditions have occurred and stops at the right time as well as location. Also new, but not mentioned: Hierarchical display of results – easier result navigation Share conclusions & comments among team members Threading intelligence maps errors to the source, not internal runtime gibberish for OpenMP, TBB and now Cilk Plus

47 What’s New in SP1 Intel® Inspector XE Dynamic Memory & Thread Analysis
Import Suppressions from Purify* & Valgrind* on Linux* Incremental Leak Analysis Improved Error Suppression bash-4.1$ inspxe-cl – convert-suppression-file –from=valgrind.sup –to=inspector.sup Converted old format or third-party suppression file /tmp/my_app/valgrind.sup to /tmp/my_app/inspector.sup. Incremental leak reports, no waiting! Set a base line and see the leaks as they are detected. Users of third party tools can leverage their investment creating suppression files. More precise, easy to edit, team shareable. Choose which stack frame to suppress. Eliminate the false, not the real errors. Incremental Leak Analysis Ask for a leak report any time during the run of the program and get it immediately. Set/reset a baseline for leak analysis – see just the leaks since the baseline. This lets you focus on leaks only in a given section of time or a given section of the code. Available via GUI buttons or API calls. Import Suppressions from Purify* & Valgrind* on Linux* Import suppression lists from other popular memory debuggers like Purify* and Valgrind*. Use Intel Inspector XE without wasting time recreating suppression lists. Now in an easy to edit text format. Improved Error Suppression Choose the exact stack frame or frames to suppress. Eliminate the false error without suppressing potential real errors. Share suppressions with your team. Find and diagnose errors with less effort

48 Cluster Tools

49 Scale Forward, Scale Faster Intel Cluster Tools
Intel® Cluster Studio XE Scale Forward, Scale Faster Intel Cluster Tools Compilers & Libraries Scale Performance – Perform on More Nodes MPI Latency - Intel® MPI Library - Up to <UpdateX> as fast as alternative MPI libraries Compiler Performance – Industry leading Intel® C/C++ & Fortran compilers Scale Forward – multicore now, many-core ready Intel MPI Library scales beyond 120k processes Focused to preserve programming investments for multicore and manycore machines Scale Efficiently – Tune & Debug on More Nodes Thread & Memory Correctness Checking – Intel® Inspector XE now MPI enabled across many nodes Rapid Node Level Performance Profiling – Intel VTune Amplifier XE can identify hotspots faster and on thousands of nodes High Performance Standards Driven Fabric Flexible MPI Library 49

50 Intel® MPI Library 4.1 Update 1 What’s New on Linux*
Improved MPI application performance and scalability Better Scalability at OFA fabric Improved support for NUMA applications and advanced process pinning controls Addition of a DAPL* auto-provider functionality for selecting best fabric at startup Extended support for the Intel® Xeon Phi™ Coprocessor architecture for improved bandwidth and latency through: Native port of the Tag Matching Interface (TMI) over the Qlogic* PSM fabric Extending support for Checkpoint/Restart (BLCR*) on the Intel® Xeon Phi™ coprocessor Backwards compatibility with existing Intel® MPI Library 4.x applications New GUI-based installer Better Scalability at OFA fabric - A new connection manager for the OFA fabric was developed for use with the Hydra scheduler enhancing scalability MPI runtime DAPL selector has been enhanced to automate and optimize to the best available hardware job

51 Intel® MPI Library 4.1 Update 1 What’s New on Windows*
Highly Scalable Hydra Process Manager Now available on Windows* OS Support for Microsoft*’s Network Direct Enabling low-latency RDMA devices on Windows*-based clusters Porting Hyrda to be supported on Windows MSFT Network Direct is MSFT version of DAPL

52 Intel® ITAC Intel® Trace Analyzer and Collector Optimize MPI Communications (part of Intel® Cluster Studio XE) Visually understand parallel application behavior Communications Patterns Hotspots Load Balance MPI Checking Detect Deadlocks Data Corruption Errors in Parameters, Data Types, etc Processes Year Understanding an MPI program’s communications is key to optimizing its performance.

53 Intel® Trace Analyzer and Collector 8.1 Update 3 What’s New
Intel® Cluster Studio XE Intel® Trace Analyzer and Collector 8.1 Update 3 What’s New Compilers & Libraries Fresh look-and-feel to the Intel® Trace Analyzer Graphical Interface New toolbars, icons, and dialogs for more streamlined analysis flow Addition of Welcome Page and easy access to past projects Support of Dynamic Profiling Tool Command MPI_PControl supported Support for MPI 2.x Standard New GUI-based installer on Linux* Expanded MPI standards support with MPI_PControl allowing the user to have a single code base with dynamic profiling regardless of the profiling tool.

54 Pricing and Availability
Includes Intel® C++ Composer XE Intel® Fortran Composer XE Intel® Inspector XE Intel® VTune™ Amplifier XE Intel® MPI Library Intel® Trace Analyzer and Collector Price Intel® Parallel Studio XE $2,299 Intel® C++ Studio XE $1,599 Intel® Fortran Studio XE $1,899 Intel® Cluster Studio XE $2,949 Additional configurations are available on our website!

55 Intel® Parallel Studio XE Suites Leading development suites for application performance
Intel® Cluster Studio XE Intel® Parallel Studio XE Analysis Intel® VTune™ Amplifier XE - Performance Profiler Intel® Inspector XE - Memory & Thread Analyzer Static Analysis & Pointer Checker - Find Coding & Security Errors Intel® Advisor XE - Threading Prototyping Tool Intel® Trace Analyzer & Collector - MPI Optimizing Tool Compilers Libraries & Intel® Compiler - Optimizing Compiler for C, C++ and Fortran Intel® Integrated Performance Primitives† - Media and Data Optimizations Intel® Threading Building Blocks† - Parallelize Applications for Performance Intel® Math Kernel Library - High Performance Math Intel® MPI Library - Flexible, Efficient and Scalable Messaging C, C++ only and Fortran only versions of Parallel Studio XE are also available. † Available for C, C++ only Create fast, reliable code

56

57

58 Backup

59 Value of Suites Suite Only Features
Advisor XE Threading Prototyping Tool C++ Performance Guide Performance Wizard Pointer Checker Reduces memory corruption Code Complexity Analysis Find code likely to be less reliable Static Analysis Find Errors and Harden your Security

60 Interprocedural Optimization (IPO)
Cross-module optimization IPO is seamless process. Most optimization actually happens during Link Phase Benefits of IPO Optimization of large number of frequently used small & medium functions, especially those called in loops Function Inlining Eliminates need for arguments setup, call branch/return overhead Enables opportunities for other optimizations (const prop, DCE, &c.) Dead code elimination, Better register usage Improved alias analysis for better auto-vectorization & loop transformations May increase build-time/binary size

61 “LOOP WAS VECTORIZED” build message
Auto-Vectorization “LOOP WAS VECTORIZED” build message subroutine quad(len,a,b,c,x1,x2) real(4) a(len),b(len), c(len), x1(len), x2(len), s do i=1,len s = b(i)**2 - 4.*a(i)*c(i) if (s.ge.0.) then x1(i) = sqrt(s) x2(i) = (-x1(i) - b(i)) *0.5 / a(i) x1(i) = ( x1(i) - b(i)) *0.5 / a(i) else x2(i)=0. x1(i)=0. endif enddo end > ifort -c -vec-report2 quad.f90 quad.f90(4): (col. 3) remark: LOOP WAS VECTORIZED. Auto-vectorizer exploits SIMD/DLP opportunities Auto-vectorizes sequential operations using SSE and AVX instructions No significant changes to source-code Much easier to learn, debug, maintain, ... Forward looking w.r.t. compilers & processors! Optimized code for targeted processors Both Intel and Intel compatible processors Mixed-processor environment supported Processor Specific Optimization Targeting specific Intel Processor(s) e.g. for Intel® Core i7 use /QxSSE4.2 Best Applied to Code for IA-32 and Intel Additional Tuning Required for Intel® Xeon Phi™ coprocessors

62 Auto-Parallelization
Serial portion of code automatically translated into multi-threaded code when possible Determines good worksharing portion of serial code Performs dataflow analysis to verify correct parallel execution Partitions data for threaded code Parallel runtime support offers same features as in OpenMP* Handling details of loop iteration modification Thread scheduling Synchronization Enabled by “/Qparallel” switch Best Applied to Code for IA-32 and Intel Additional Tuning Required for Intel® Xeon Phi™ coprocessors 62

63 Intel® Guided Auto Parallelism (GAP) Let the Compiler Tell You What it Needs
Motivation Effective, simplify way to add parallelism to your applications Use built-in compiler technology to speed parallelism development What is GAP? Compiler-based analyzer that provides guidance to developers to change code so it can be compiled to automatically optimize code through vectorization, parallelization, or data transformation Built upon existing auto-vectorization and auto-parallelization technology GAP does not Analyze code and find hotspots for threading (see Advisor) Verify threading correctness (use Inspector, Inspector XE) Do any performance/hotspot analysis (use Amplifier, VTune Amplifier XE) Developer Must Verify Semantics of GAP Recommendations Best Applied to Code for IA-32 and Intel Additional Tuning Required for Intel® Xeon Phi™ coprocessors 63

64 Using GAP Requires optimization level set to /O2 or higher
Works with both command line options or in the MSVC IDE IPO or PGO is not require, but advice may change based on options User may apply all or a subset of the advice provided by GAP However, when multiple messages apply to a given loop ALL suggestions for that loop must be applied to get desired optimization User can specify regions of a file or routine considered “hot” Advice will be restricted to the hot region Default is to provide advice on entire compilation-unit Advice may involve Suggestions for source-change that assert new properties Adding pragmas for loop if semantics are satisfied Adding new options GAP output is a set of GAP messages, not .exe Best Applied to Code for IA-32 and Intel Additional Tuning Required for Intel® Xeon Phi™ coprocessors 64

65 Optimization Reports GAP – Guided Auto Parallelism Other reports
New in Intel® Parallel Composer 2011! /Qguide switch Provides advice on source changes that could enable parallelism Doesn’t actually generate code, just provides analysis and suggestions Other reports /Qvec-report Which loops were vectorized, which were not Why they weren’t vectorized /Qpar-report /Qopt-report Reports available for a variety of optimizations icl –help reports for more details Best Applied to Code for IA-32 and Intel Additional Tuning Required for Intel® Xeon Phi™ coprocessors 65

66 High-Level Optimizations (HLO)
Enabled with –O3 (/O3 on Windows) With auto-vectorization does more aggressive data dependency analysis than at /O2 Exploits properties of source code (loops & arrays) Best chance for performing loop transformations Performs loop transformations Loop distribution Loop interchange Loop fusion Loop unrolling Data pre-fetching PGO based loop unrolling Etc. 66

67 Boost Parallel Application Performance running on Windows. , Linux
Boost Parallel Application Performance running on Windows*, Linux* and OS*X Intel Confidential 9/20/2018

68 Analysis - Intel® Inspector XE What’s New in SP1?
Easier Migration From Other Tools Import suppression lists from Purify* and Valgrind* on Linux* Fewer False Errors and Easier Suppression Management Precise suppressions specify single or multiple stack locations User editable suppression files (or use the GUI) Fortran – reduced false positives due to allocation Leak Reports No Waiting! Set a baseline for incremental analysis with GUI or API Report incremental leaks and heap growth since the baseline No waiting until the end of the analysis run New OS, Threading Model & Processor Support OpenMP 4.0 Haswell – Windows* & Linux* Windows* 8 desktop Visual Studio* 2012 Latest Linux* distributions New since the first 2013 release. Some features released in earlier updates. Intel Confidential

69 CPU Power Analysis Intel® VTune™ Amplifier XE
Minimize wake-ups to decrease CPU power usage Identify wake-up causes Timers triggered by application Interrupts mapped to HW intr. level Show wake-up rate Display source code for events that wake-up processor Show CPU frequencies by core (CPU frequencies can change by CPU activity level) Linux only Select & filter to see a single wake up object: Identifies the cause of wake-ups and gives timer call stacks

70 Increase Power Efficiency Intel® VTune™ Amplifier XE
Traditional optimization Reduce total resource utilization Achieved by Use of new instructions Increase parallelism New optimization Increase uninterrupted idle time Achieve by Reduce the frequency of activity Consolidate activities The traditional approach to power optimization is to hurry up and finish the work. This is still a good idea. The new optimization approach is to also consolidate the work so that the processor spends longer uninterrupted periods of time in a low power state. To do that, you need to know what is waking up the processor… (next slide) Minimize Wake-ups from Timers and Interrupts


Download ppt "Intel® Parallel Studio XE 2013 SP1 Intel® Cluster Studio XE 2013 SP1"

Similar presentations


Ads by Google