Compiler Ecosystem November 22, 2018 Computation Products Group.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Part IV: Memory Management
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Fortran Jordan Martin Steven Devine. Background Developed by IBM in the 1950s Designed for use in scientific and engineering fields Originally written.
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
Eos Compilers Fernanda Foertter HPC User Assistance Specialist.
The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Computer Organization and Architecture
High Performance Computing with AMD Opteron Maurizio Davini.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Visual C New Optimizations Ayman Shoukry Program Manager Visual C++ Microsoft Corporation.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
University of Washington Roadmap 1 car *c = malloc(sizeof(car)); c->miles = 100; c->gals = 17; float mpg = get_mpg(c); free(c); Car c = new Car(); c.setMiles(100);
Performance Optimization Getting your programs to run faster CS 691.
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
Software Overview Environment, libraries, debuggers, programming tools and applications Jonathan Carter NUG Training 3 Oct 2005.
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
Performance Optimization Getting your programs to run faster.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.
Introduction to MMX, XMM, SSE and SSE2 Technology
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Full and Para Virtualization
Performance Tuning John Black CS 425 UNR, Fall 2000.
Single Node Optimization Computational Astrophysics.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Just-In-Time Compilation. Introduction Just-in-time compilation (JIT), also known as dynamic translation, is a method to improve the runtime performance.
Intra-Socket and Inter-Socket Communication in Multi-core Systems Roshan N.P S7 CSB Roll no:29.
Visual C++ Optimizations Jonathan Caves Principal Software Engineer Visual C++ Microsoft Corporation.
Course Contents KIIT UNIVERSITY Sr # Major and Detailed Coverage Area
Virtual memory.
Chapter 1 Introduction.
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Compiler Construction (CS-636)
Operating Systems (CS 340 D)
Roadmap C: Java: Assembly language: OS: Machine code: Computer system:
Parallel Processing - introduction
CS4961 Parallel Programming Lecture 11: Data Locality, cont
Chapter 1 Introduction.
5.2 Eleven Advanced Optimizations of Cache Performance
Morgan Kaufmann Publishers
/ Computer Architecture and Design
Vector Processing => Multimedia
Many-core Software Development Platforms
Teaching Computing to GCSE
Morgan Kaufmann Publishers The Processor
Superscalar Processors & VLIW Processors
Performance Optimization for Embedded Software
Compiler Back End Panel
Compiler Back End Panel
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Process Description and Control
Sampoorani, Sivakumar and Joshua
* From AMD 1996 Publication #18522 Revision E
Computer Architecture
Multi-Core Programming Assignment
A Level Computer Science Topic 5: Computer Architecture and Assembly
Chapter 11 Processor Structure and function
1.3.7 High- and low-level languages and their translators
6- General Purpose GPU Programming
Presentation transcript:

Compiler Ecosystem November 22, 2018 Computation Products Group

Compiler Comparisons Table Critical Features Supported by x86 Compilers Vector SIMD Support Peels Loops Global IPA Open MP Links ACML Libraries Profile Guided Feedback Aligns Parallel Debuggers Large Array Support Medium Memory Model PGI GNU Intel Pathscale Absoft SUN Microsoft                                                                     November 22, 2018 Computation Products Group

Intel CPUID Checks How to determine if they exist in a binary CPUID instruction reports: Types of x86/x86-64 instructions supported (SSE, SSE2, SSE3) Vendor of the processor (Genuine Intel or Authentic AMD) Intel C and FORTRAN compiler’s runtime library enviorments check “Vendor of Processor” and then run down alternate code path that: segmentation faults because Intel doesn’t support non-Intel processors executes legacy code optimized for Pentium PRO, PII or PIII CPUID checks also exist in Intel’s Math Kernel Library applications calling FFTs or Linear Algebra strongly impacted ISVs and customers must utilize ACML (likely a 2x performance boost) November 22, 2018 Computation Products Group

Intel CPUID Checks How to determine if they exist in a binary How to check if CPUID checks exist in a binary, type: Dump all assembly instructions in binary to a txt file, type: objdump –d “binary” > binary.txt Search “binary.txt” file for lines containing cpuid instructions, type: grep “cpuid” binary.txt Search above will print out instruction address at the beginning of each line containing cpuid cpuid located in function called: “IntelProcessorIdentificationFunction:” determine how many times it is called in “binary.txt” by typing: grep “IntelProcessorIdentificationFunction” binary.txt Illustrating to ISVs and customers the practices employed by Intel at the user’s inconvenience builds rapport and confidence between them and AMD November 22, 2018 Computation Products Group

Intel Compiler and MKL on Opteron Threat Assessment of using Intel Compilers The compiler is a weapon – maker can control the code generated and run upon their chip and their competitor working with PGI and NAG we can address the performance and functionality issues of a customer by modifying the compiler or ACML CPUID checks – instruction compatibility not checked but rather the Vendor ID AMD platform issues not supported unless reproducible on Intel platforms CPUID checks placed into code because Intel doesn’t trust users intellect http://support.intel.com/support/performancetools/c/sb/cs-009787.htm Issues on AMD platforms can not be addressed and will not be reproducible since we do not issue the same VENDOR ID in the CPUID instruction  ISVs and customers draw the conclusion AMD Platforms aren’t dependable November 22, 2018 Computation Products Group

Intel Compiler and MKL on Opteron Threat Assessment of using Intel Compilers The AMD Core Math Library (ACML) can not be linked with the Intel 8.1 AMD64 compiler, the only option is Intel’s MKL Opteron runs many Intel MKL routines 25-75% the rate it runs the counterpart ACML routines (ex: CFFT1D, CFFT2D, DGEMM, …) ISVs and customers whose applications are performance bound by FFTs, BLAS or LAPACK strongly impacted (ex: ANSYS performance increased 43% moving to 64-bit using ACML rather than MKL) Necessitates increasing the # of compilers and binaries required to support both AMD and Intel platforms PGI creates both AMD (-tp k8-64) and Intel (-tp p7-64) tuned binaries work done by AMD tuning PGI compiler leveraged also in Intel binaries On LS-DYNA the PGI 64-bit binary targeted towards XEON with -tp p7-64 is faster than the Intel 8.1 binary by 4% November 22, 2018 Computation Products Group

Intel Compiler and MKL on Opteron Threat Assessment of using Intel Compilers Intel has stated at the link below that in 8.1 Intel compilers the switches to target chips without SSE2 or SSE3 will no longer function http://support.intel.com/support/performancetools/c/sb/cs-009787.htm Opteron lacks SSE3 support until Jackhammer in Q2 ‘05 The user will be unable to tell the compiler not to utilize SSE3 insturctions ISVs and Customers will have no solution as to using binaries built by Intel compilers upon Opteron Occurrences such as this will continue every time Intel introduces a new instruction set for x86 based systems (SSE4?) Users presently using the Intel compiler upon Opteron based systems or ISVs supporting customers in a similar manner will have no method of optimizing code for an AMD based system with the exception of compiling without optimization November 22, 2018 Computation Products Group

Tuning Performance with Compilers Maintaining Stability while Optimizing STEP 0: Build application using the following procedure: compile all files with the most aggressive optimization flags below: -tp k8-64 –fastsse if compilation fails or the application doesn’t run properly, turn off vectorization: -tp k8-64 –fast –Mscalarsse if problems persist compile at Optimization level 1: -tp k8-64 –O0 STEP 1: Profile binary and determine performance critical routines STEP 2: Repeat STEP 0 on performance critical functions, one at a time, and run binary after each step to check stability November 22, 2018 Computation Products Group

Tuning Memory IO Bandwidth Optimizing large streaming operations 2 Methods of writing to memory in x86/x86-64: traditional memory stores cause write allocates to cache Mov %rax,[%rdi] movsd %xmm0,[%rdi] movapd %xmm0,[%rdi] page to be modified is read into cache cache is modified, written to memory when new memory page loaded to write N bytes, 2N bytes of bandwidth generated non-temporal stores bypass cache and write directly to memory no write allocate to cache, to write N bytes, N bytes of bandwidth generated data is not backed up into cache, do not use with often reused data Use only on functions which write L2/2 > bytes of data or more, normally would assure little cache reuse value Group all eligible routines into a common file to as to simplify the compilation procedure. Enable non-temporal stores in PGI compiler with the –Mnontemporal compiler option November 22, 2018 Computation Products Group

PGI Compiler Flags Optimization Flags Below are 3 different sets of recommended PGI compiler flags for flag mining application source bases: Most aggressive: -tp k8-64 –fastsse –Mipa=fast enables instruction level tuning for Opteron, O2 level optimizations, sse scalar and vector code generation, inter-procedural analysis, LRE optimizations and unrolling strongly recommended for any single precision source code Middle of the ground: -tp k8-64 –fast –Mscalarsse enables all of the most aggressive except vector code generation, which can reorder loops and generate slightly different results in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code Least aggressive: -tp k8-64 –O0 (or –O1) November 22, 2018 Computation Products Group

PGI Compiler Flags Functionality Flags -mcmodel=medium use if your application statically allocates a net sum of data structures greater than 2GB -Mlarge_arrays use if any array in your application is greater than 2GB -KPIC use when linking to shared object (dynamically linked) libraries -mp process OpenMP/SGI directives/pragmas (build multi-threaded code) -Mconcur attempt auto-parallelization of your code on SMP system with OpenMP November 22, 2018 Computation Products Group

Absoft Compiler Flags Optimization Flags Below are 3 different sets of recommended Absoft compiler flags for flag mining application source bases: Most aggressive: -O3 loop transformations, instruction preference tuning, cache tiling, & SIMD code generation (CG). Generally provides the best performance but may cause compilation failure or slow performance in some cases strongly recommended for any single precision source code Middle of the road: -O2 enables most options by –O3, including SIMD CG, instruction preferences, common sub-expression elimination, & pipelining and unrolling. in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code Least aggressive: -O1 November 22, 2018 Computation Products Group

Absoft Compiler Flags Functionality Flags -mcmodel=medium use if your application statically allocates a net sum of data structures greater than 2GB -g77 enables full compatibility with g77 produced objects and libraries (must use this option to link to GNU ACML libraries) -fpic use when linking to shared object (dynamically linked) libraries -safefp performs certain floating point operations in a slower manner that avoids overflow, underflow and assures proper handling of NaNs November 22, 2018 Computation Products Group

Pathscale Compiler Flags Optimization Flags Most aggressive: -Ofast Equivalent to –O3 –ipa –OPT:Ofast –fno-math-errno Aggressive : -O3 optimizations for highest quality code enabled at cost of compile time Some generally beneficial optimization included may hurt performance Reasonable: -O2 Extensive conservative optimizations Optimizations almost always beneficial Faster compile time Avoids changes which affect floating point accuracy. November 22, 2018 Computation Products Group

Pathscale Compiler Flags Functionality Flags -mcmodel=medium use if static data structures are greater than 2GB -ffortran-bounds-check (fortran) check array bounds -shared generate position independent code for calling shared object libraries Feedback Directed Optimization STEP 0: Compile binary with -fb_create_fbdata STEP 1: Run code collect data STEP 2: Recompile binary with -fb_opt fbdata -march=(opteron|athlon64|athlon64fx) Optimize code for selected platform (Opteron is default) November 22, 2018 Computation Products Group

Microsoft Compiler Flags Optimization Flags Recommended Flags : /O2 /Ob2 /GL /fp:fast /O2 turns on several general optimization & /O2 enable inline expansion /GL enables inter-procedural optimizations /fp:fast allows the compiler to use a fast floating point model Feedback Directed Optimization STEP 0: Compile binary with /LTCG:PGI STEP 1: Run code collect data STEP 2: Recompile binary with /LTCG:PGO Turn off Buffer Over Run Checking The compiler by default runs on /GS to check for buffer overruns. Turning off checking by specifying /GS- may result in additional performance November 22, 2018 Computation Products Group

Microsoft Compiler Flags Functionality Flags /GT enables run-time information /Wp64 supports fiber safety for data allocated using static thread-local storage /LD detects most 64-bit portability problems /Oa creates a dynamic-link library /Ow assumes aliasing across function calls but not inside functions November 22, 2018 Computation Products Group

64-Bit Operating Systems Recommendations and Status SUSE SLES 9 with latest Service Pack available Has technology for supporting latest AMD processor features Widest breadth of NUMA support and enabled by default Oprofile system profiler installable as an RPM and modularized complete support for static & dynamically linked 32-bit binaries Red Hat Enterprise Server 3.0 Service Pack 2 or later NUMA features support not as complete as that of SUSE SLES 9 Oprofile installable as an RPM but installation is not modularized and may require a kernel rebuild if RPM version isn’t satisfactory only SP 2 or later has complete 32-bit shared object library support (a requirement to run all 32-bit binaries in 64-bit) Posix-threading library changed between 2.1 and 3.0, may require users to rebuild applications November 22, 2018 Computation Products Group