Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Slides:

Advertisements

Similar presentations

Advertisements

Part IV: Memory Management

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

879 CISC Parallel Computation High Performance Fortran (HPF) Ibrahim Halil Saruhan Although the [Fortran] group broke new ground …

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Eos Compilers Fernanda Foertter HPC User Assistance Specialist.

The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05.

The Functions and Purposes of Translators Code Generation (Intermediate Code, Optimisation, Final Code), Linkers & Loaders.

Programming Types of Testing.

Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room A, Chris Hill, Room ,

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Fortran: Array Features Session Five ICoCSIS. Outline 1.Zero-sized Array 2.Assumed-shaped Array 3.Automatic Objects 4.Allocation of Data 5.Elemental Operations.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.

Lecture Roger Sutton CO331 Visual programming 15: Debugging 1.

Inline Assembly Section 1: Recitation 7. In the early days of computing, most programs were written in assembly code. –Unmanageable because No type checking,

Application of Fortran 90 to ocean model codes Mark Hadfield National Institute of Water and Atmospheric Research New Zealand.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Portability Issues. The MPI standard was defined in May of This standardization effort was a response to the many incompatible versions of parallel.

1 Lecture 1  Getting ready to program  Hardware Model  Software Model  Programming Languages  The C Language  Software Engineering  Programming.

Chapter 7 Introduction to Arrays Part I Dr. Ali Can Takinacı İstanbul Technical University Faculty of Naval Architecture and Ocean Engineering İstanbul.

Guide To UNIX Using Linux Third Edition

Binary Degraders Often we want to control both scattering and energy loss in a beam line element. For instance, we might want a contoured scatterer with.

NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Introduction to FORTRAN

1 Week 12 Arrays, vectors, matrices and cubes. Introduction to Scientific & Engineering Computing 2 Array subscript expressions n Each subscript in an.

1 Intel Mathematics Kernel Library (MKL) Quickstart COLA Lab, Department of Mathematics, Nat’l Taiwan University 2010/05/11.

Programming Translators.

Multi-Dimensional Arrays

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Software Overview. Why review software? Software is the set of instructions that tells hardware what to do The reason for hardware is to execute a program.

1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.

Copyright © 2005 Elsevier Chapter 8 :: Subroutines and Control Abstraction Programming Language Pragmatics Michael L. Scott.

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 1- 1 October 20, October 20, 2015October 20, 2015October 20,

Performance Optimization Getting your programs to run faster CS 691.

An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 5 Arrays.

Chapter 25: Code-Tuning Strategies. Chapter 25  Code tuning is one way of improving a program’s performance, You can often find other ways to improve.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.

Pointers OVERVIEW.

1 Serial Run-time Error Detection and the Fortran Standard Glenn Luecke Professor of Mathematics, and Director, High Performance Computing Group Iowa State.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Performance Optimization Getting your programs to run faster.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

Microprocessors The ia32 User Instruction Set Jan 31st, 2002.

1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.

Introduction to MMX, XMM, SSE and SSE2 Technology

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Computational Methods of Scientific Programming Lecturers Thomas A Herring, Room A, Chris Hill, Room ,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

MICROPROCESSOR BASED SYSTEM DESIGN

Chapter 9 :: Subroutines and Control Abstraction

Chap. 8 :: Subroutines and Control Abstraction

Chap. 8 :: Subroutines and Control Abstraction

Compiler Ecosystem November 22, 2018 Computation Products Group.

Objective of This Course

Closure Representations in Higher-Order Programming Languages

Parallel Computing Explained How to Parallelize a Code

Programming with Shared Memory Specifying parallelism

Presentation transcript:

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 Parallel Computing Explained Porting Issues

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 3.1 Recompile 3.2 Word Length 3.3 Compiler Options for Debugging 3.4 Standards Violations 3.5 IEEE Arithmetic Differences 3.6 Math Library Differences 3.7 Compute Order Related Differences 3.8 Optimization Level Too High 3.9 Diagnostic Listings 3.10 Further Information

Porting Issues In order to run a computer program that presently runs on a workstation, a mainframe, a vector computer, or another parallel computer, on a new parallel computer you must first "port" the code. After porting the code, it is important to have some benchmark results you can use for comparison. To do this, run the original program on a well-defined dataset, and save the results from the old or “baseline” computer. Then run the ported code on the new computer and compare the results. If the results are different, don't automatically assume that the new results are wrong – they may actually be better. There are several reasons why this might be true, including: Precision Differences - the new results may actually be more accurate than the baseline results. Code Flaws - porting your code to a new computer may have uncovered a hidden flaw in the code that was already there. Detection methods for finding code flaws, solutions, and workarounds are provided in this lecture.

Recompile Some codes just need to be recompiled to get accurate results. The compilers available on the NCSA computer platforms are shown in the following table: LanguageSGI Origin2000IA-32 LinuxIA-64 Linux MIPSpro Portland Group IntelGNU Portland Group IntelGNU Fortran 77f77ifortg77pgf77ifortg77 Fortran 90f90ifortpgf90ifort Fortran 90f95ifort High Performance Fortran pghpf Ccciccgccpgcciccgcc C++CCicpcg++pgCCicpcg++

Word Length Code flaws can occur when you are porting your code to a different word length computer. For C, the size of an integer variable differs depending on the machine and how the variable is generated. On the IA32 and IA64 Linux clusters, the size of an integer variable is 4 and 8 bytes, respectively. On the SGI Origin2000, the corresponding value is 4 bytes if the code is compiled with the –n32 flag, and 8 bytes if compiled without any flags or explicitly with the –64 flag. For Fortran, the SGI MIPSpro and Intel compilers contain the following flags to set default variable size. -in where n is a number: set the default INTEGER to INTEGER*n. The value of n can be 4 or 8 on SGI, and 2, 4, or 8 on the Linux clusters. -rn where n is a number: set the default REAL to REAL*n. The value of n can be 4 or 8 on SGI, and 4, 8, or 16 on the Linux clusters.

Compiler Options for Debugging On the SGI Origin2000, the MIPSpro compilers include debugging options via the –DEBUG:group. The syntax is as follows: -DEBUG:option1[=value1]:option2[=value2]... Two examples are: Array-bound checking: check for subscripts out of range at runtime. -DEBUG:subscript_check=ON Force all un-initialized stack, automatic and dynamically allocated variables to be initialized. -DEBUG:trap_uninitialized=ON

Compiler Options for Debugging On the IA32 Linux cluster, the Fortran compiler is equipped with the following –C flags for runtime diagnostics: -CA: pointers and allocatable references -CB: array and subscript bounds -CS: consistent shape of intrinsic procedure -CU: use of uninitialized variables -CV: correspondence between dummy and actual arguments

Standards Violations Code flaws can occur when the program has non-ANSI standard Fortran coding. ANSI standard Fortran is a set of rules for compiler writers that specify, for example, the value of the do loop index upon exit from the do loop. Standards Violations Detection To detect standards violations on the SGI Origin2000 computer use the -ansi flag. This option generates a listing of warning messages for the use of non-ANSI standard coding. On the Linux clusters, the -ansi[-] flag enables/disables assumption of ANSI conformance.

IEEE Arithmetic Differences Code flaws occur when the baseline computer conforms to the IEEE arithmetic standard and the new computer does not. The IEEE Arithmetic Standard is a set of rules governing arithmetic roundoff and overflow behavior. For example, it prohibits the compiler writer from replacing x/y with x *recip (y) since the two results may differ slightly for some operands. You can make your program strictly conform to the IEEE standard. To make your program conform to the IEEE Arithmetic Standards on the SGI Origin2000 computer use: f90 -OPT:IEEEarithmetic=n... prog.f where n is 1, 2, or 3. This option specifies the level of conformance to the IEEE standard where 1 is the most stringent and 3 is the most liberal. On the Linux clusters, the Intel compilers can achieve conformance to IEEE standard at a stringent level with the –mp flag, or a slightly relaxed level with the –mp1 flag.

Math Library Differences Most high-performance parallel computers are equipped with vendor-supplied math libraries. On the SGI Origin2000 platform, there are SGI/Cray Scientific Library (SCSL) and Complib.sgimath. SCSL contains Level 1, 2, and 3 Basic Linear Algebra Subprograms (BLAS), LAPACK and Fast Fourier Transform (FFT) routines. SCSL can be linked with –lscs for the serial version, or –mp – lscs_mp for the parallel version. The complib library can be linked with –lcomplib.sgimath for the serial version, or –mp –lcomplib.sgimath_mp for the parallel version. The Intel Math Kernel Library (MKL) contains the complete set of functions from BLAS, the extended BLAS (sparse), the complete set of LAPACK routines, and Fast Fourier Transform (FFT) routines.

Math Library Differences On the IA32 Linux cluster, the libraries to link to are: For BLAS: -L/usr/local/intel/mkl/lib/32 -lmkl -lguide – lpthread For LAPACK: - L/usr/local/intel/mkl/lib/32 –lmkl_lapack - lmkl -lguide –lpthread When calling MKL routines from C/C++ programs, you also need to link with –lF90. On the IA64 Linux cluster, the corresponding libraries are: For BLAS: -L/usr/local/intel/mkl/lib/64 –lmkl_itp – lpthread For LAPACK: -L/usr/local/intel/mkl/lib/64 –lmkl_lapack – lmkl_itp –lpthread When calling MKL routines from C/C++ programs, you also need to link with -lPEPCF90 –lCEPCF90 –lF90 -lintrins

Compute Order Related Differences Code flaws can occur because of the non-deterministic computation of data elements on a parallel computer. The compute order in which the threads will run cannot be guaranteed. For example, in a data parallel program, the 50th index of a do loop may be computed before the 10th index of the loop. Furthermore, the threads may run in one order on the first run, and in another order on the next run of the program. Note: : If your algorithm depends on data being compared in a specific order, your code is inappropriate for a parallel computer. Use the following method to detect compute order related differences: If your loop looks like DO I = 1, N change it to DO I = N, 1, -1 The results should not change if the iterations are independent

Optimization Level Too High Code flaws can occur when the optimization level has been set too high thus trading speed for accuracy. The compiler reorders and optimizes your code based on assumptions it makes about your program. This can sometimes cause answers to change at higher optimization level. Setting the Optimization Level Both SGI Origin2000 computer and IBM Linux clusters provide Level 0 (no optimization) to Level 3 (most aggressive) optimization, using the –O {0,1,2, or 3} flag. One should bear in mind that Level 3 optimization may carry out loop transformations that affect the correctness of calculations. Checking correctness and precision of calculation is highly recommended when –O3 is used. For example on the Origin 2000 f90 -O0 … prog.f turns off all optimizations.

Optimization Level Too High Isolating Optimization Level Problems You can sometimes isolate optimization level problems using the method of binary chop. To do this, divide your program prog.f into halves. Name them prog1.f and prog2.f. Compile the first half with -O0 and the second half with -O3 f90 -c -O0 prog1.f f90 -c -O3 prog2.f f90 prog1.o prog2.o a.out > results If the results are correct, the optimization problem lies in prog1.f Next divide prog1.f into halves. Name them prog1a.f and prog1b.f Compile prog1a.f with -O0 and prog1b.f with -O3 f90 -c -O0 prog1a.f f90 -c -O3 prog1b.f f90 prog1a.o prog1b.o prog2.o a.out > results Continue in this manner until you have isolated the section of code that is producing incorrect results.

Diagnostic Listings The SGI Origin 2000 compiler will generate all kinds of diagnostic warnings and messages, but not always by default. Some useful listing options are: f90 -listing... f90 -fullwarn... f90 -showdefaults... f90 -version... f90 -help...

Further Information SGI man f77/f90/cc man debug_group man math man complib.sgimath MIPSpro 64-Bit Porting and Transition Guide Online Manuals Linux clusters pages ifort/icc/icpc –help (IA32, IA64, Intel64) Intel Fortran Compiler for Linux Intel C/C++ Compiler for Linux

Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 4.1 Aggressive Compiler Options 4.2 Compiler Optimizations 4.3 Vendor Tuned Code 4.4 Further Information

Scalar Tuning If you are not satisfied with the performance of your program on the new computer, you can tune the scalar code to decrease its runtime. This chapter describes many of these techniques: The use of the most aggressive compiler options The improvement of loop unrolling The use of subroutine inlining The use of vendor supplied tuned code The detection of cache problems, and their solution are presented in the Cache Tuning chapter.

Aggressive Compiler Options For the SGI Origin2000 Linux clusters the main optimization switch is - On where n ranges from 0 to 3. - O0 turns off all optimizations. - O1 and -O2 do beneficial optimizations that will not effect the accuracy of results. - O3 specifies the most aggressive optimizations. It takes the most compile time, may produce changes in accuracy, and turns on software pipelining.

Aggressive Compiler Options It should be noted that –O3 might carry out loop transformations that produce incorrect results in some codes. It is recommended that one compare the answer obtained from Level 3 optimization with one obtained from a lower-level optimization. On the SGI Origin2000 and the Linux clusters, –O3 can be used together with – OPT:IEEE_arithmetic=n (n=1,2, or 3) and –mp (or –mp1 ), respectively, to enforce operation conformance to IEEE standard at different levels. On the SGI Origin2000, the option -Ofast = ip27 is also available. This option specifies the most aggressive optimizations that are specifically tuned for the Origin2000 computer.