Performance Optimization Getting your programs to run faster.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Compiler Challenges for High Performance Architectures
DCO1 Performance Measurement and Improvement Lecture 7.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Introduction CS 524 – High-Performance Computing.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
1 Lecture 6 Performance Measurement and Improvement.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.
CS267 L2 Memory Hierarchies.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
E.Papandrea PM3 - Paris, 2 nd Mar 2004 DFCI COMPUTING PERFORMANCEPage 1 Enzo Papandrea COMPUTING PERFORMANCE.
Chapter 1 Algorithm Analysis
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
What have mr aldred’s dirty clothes got to do with the cpu
® 1 March 1999 Optimizing 3D Pipeline using Streaming SIMD Extensions in C++ Ronen Zohar
Performance Optimization Getting your programs to run faster CS 691.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Notes on Homework 1. 2x2 Matrix Multiply C 00 += A 00 B 00 + A 01 B 10 C 10 += A 10 B 00 + A 11 B 10 C 01 += A 00 B 01 + A 01 B 11 C 11 += A 10 B 01 +
Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.
Code Optimization II: Machine Dependent Optimization Topics Machine-Dependent Optimizations Unrolling Enabling instruction level parallelism.
Optimization of C Code The C for Speed
Performance Performance
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Memory-Aware Compilation Philip Sweany 10/20/2011.
IBM ATS Deep Computing © 2007 IBM Corporation Compiler Optimization HPC Workshop – University of Kentucky May 9, 2007 – May 10, 2007 Andrew Komornicki,
Outline Announcements: –HW II Idue Friday! Validating Model Problem Software performance Measuring performance Improving performance.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.
1 ENERGY 211 / CME 211 Lecture 4 September 29, 2008.
My Coordinates Office EM G.27 contact time:
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Optimization. How to Optimize Code Conventional Wisdom: 1.Don't do it 2.(For experts only) Don't do it yet.
1 Lecture 5a: CPU architecture 101 boris.
How do we evaluate computer architectures?
Visit for more Learning Resources
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CS203 – Advanced Computer Architecture
Getting Started with Automatic Compiler Vectorization
Compilers for Embedded Systems
Memory Hierarchies.
Register Pressure Guided Unroll-and-Jam
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Multithreading Why & How.
Optimizing program performance
Notes on Homework 1 CS267 Lecture 2 CS267 Lecture 2 1.
Memory System Performance Chapter 3
Optimization.
Lecture 11: Machine-Dependent Optimization
Optimizing single thread performance
ENERGY 211 / CME 211 Lecture 11 October 15, 2008.
Presentation transcript:

Performance Optimization Getting your programs to run faster

Why optimize Better turn-around on jobs Run more programs/scenarios Release resources to other applications You want the job to finish before you retire

Ways to get more performance Run on bigger, faster hardware clock speed, more memory, … Tweak your algorithm Optimize your code

Loop Unrolling Converting passes of a loop into in-line streams of code Useful when loops do calculations on data in arrays Unrolling can take advantage of pipeline processing units in processors Compiler may preload operands into CPU registers

Loop Unrolling – disadvantages may be limited by the number of Floating point registers Pentium III: 8 Pentium 4: 8 Itanium: 128

Loop Unrolling – simple example Loop do i=1,n a(i) = b(i) +x*c(i) enddo Unrolled Loop do i=1,n,4 a(i) = b(i) +x*c(i) a(i+1) = b(i+1) +x*c(i+1) a(i+2) = b(i+2) +x*c(i+2) a(i+3) = b(i+3) +x*c(i+3) enddo

Loop Unrolling – simple example Performance – Rolled P3 550mhz – 13 mflops Itanium – 30 mflops Performance Unrolled P3 550mhz – 30 mflops Itanium – 107 mflops *from: LCI and NCSA

Loop Unrolling int a[100]; for (i=0;i<100;i++){ a[i] = a[i] * 2; } int a[100]; for (i=0;i<100;i+=5){ a[i] = a[i] * 2; a[i+1]=a[i+1]*2; a[i+2]=a[i+2]*2; a[i+3]=a[i+3]*2; a[i+4]=a[i+4]*2; }

Loop unrolling int a[10][10]; for (i=0;i<10;i++){ for (j=0;j<10;j++) { a[i][j] = a[i][j] *2;} int a[10][10]; for (i=0;i<10;i++){ a[i][0]=a[i][0]*2; a[i][1]=a[i][1]*2; a[i][2]=a[i][2]*2; a[i][3]=a[i][3]*2; a[i][4]=a[i][4]*2; a[i][5]=a[i][5]*2; a[i][6]=a[i][6]*2; a[i][7]=a[i][7]*2; a[i][8]=a[i][8]*2; a[i][9]=a[i][9]*2;}

Loop unrolling – Matrix Dot Product float a[100]; float b[100]; float z; for (i=0;i<100;i++){ z = z + a[i] * b[i]; } float a[100]; float b[100]; float z; for (i=0;i<100;i+=2){ z = z + a[i] * b[i]; z = z + a[i+1] * b[i+1]; }

Unrolling Loops You can do it automatically

Unrolling Loops – compiler options GNU Compilers -funroll-loops -funrull-all-loops (not recommended) PGI Compilers -Munroll -Munroll=c:N -Munroll=n:M

Unrolling Loops – Compiler Options Intel Compilers -unrollM (up to M times) -unroll

Design your program to minimize cache faults Align data arrays with cache boundaries

If your algorithm has repetitive iterations across specific rows or columns… try making your array dimensions match the cache buffer size of your computer i.e. if you array is 1000 x 1000 (single byte integers) and you have a 1024 byte cache… allocate the array as 1024 x 1024

Align data arrays with cache boundaries..or if you array is 500x500 … allocate it as 512x512

Align data arrays with cache boundaries

1,11,21,31,41,51,61,7 2,12,22,32,4 1, 1 1, 2 1, 3 1, 4 1, 5 1, 6 1,7 2, 1 2, 2 2, 3 2, 4 2, 5 2, 6 2,7 3, 1 3, 2 3, 3 1, 1 1, 2 1, 3 1, 4 1, 5 1, 6 1, 7 2, 1 2, 2 2, 3

Taking Memory in Order Optimizing the use of cache row major order vs column major order row major --  a(1,1), a(2,1), a(3,1), a(1,2), a(2,2),… column major –  a(1,1), a(1,2), a(1,3), a(2,1), a(2,2),…

Taking Memory in Order Remember C and Fortran store arrays in the opposite manner  C – row major  Fortran – column major

Taking Memory in Order c  Fortran 

Taking Memory in Order do i=1,m do j=1,n a(i,j)=b(i,j)+c(i) end do do j=1,m do i=1,n a(i,j)=b(i,j)+c(i) end do loop time: loop runs at 4.48 Mflops loop time: 2.80 loop runs at Mflops

Floating Point Division FP Division is very expensive in terms of processor time clock cycles to compute Usually not pipelined FP Division required by IEEE “rules”

Floating point division – use reciprocal float a[100]; for (i=0;i<100;i++){ a[i]=a[i]/2; } float a[100]; Float denom; denom = 1/2; for (i=0;i<100;i++){ a[i]=a[i]*denom; }

Compiler options for IEEE Compatibility PGI Compilers  -Knoieee Intel Compilers  -mp GNU Compilers  can’t do Floating Point Division

Compilers can’t optimize if divisor is not scalar Breaks IEEE “rules” May impact portability

Function Inlining Build functions/subroutines in as inline parts of the programs code… … rather than functions/subroutines minimizes functions calls (and management of…)

Function Inlining Compile with – -Minline  compiler tries to inline what it can (meet compiler criteria) -Minline=except:func  excludes func from inlining -Minline=func  inline only func

Function Inlining …Compile with- -Minline=myfile.lib  inlines functions from inline library file -Minline=levels:n  inlines functions up to n levels of calls  usually default = 1

MPI Tuning Minimize messages Pointers/counts MPI Derived datatypes MPI_Pack/MPI_Unpack Using shared memory for message passing #PBS –l nodes=6:ppn=1 … but… #PBS –l nodes=3:ppn=2 … is better.

Compiler optimizations -O0 –no optimization -O1 –local optimization, register allocation -O2 –local/limited global optimization -O3 –aggressive global optimization -Munroll – loop unrolling -Mvect - vectorization -Minline – function inlining

gcc Compiler Optimatizations See: