Performance Analysis: Theory and Practice Chris Mueller Tech Tuesday Talk July 10, 2007.

Slides:

Advertisements

Similar presentations

CSCI 4717/5717 Computer Architecture

Advertisements

MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Fundamentals of Python: From First Programs Through Data Structures

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Introduction CS 524 – High-Performance Computing.

Introduction to Analysis of Algorithms

Data Locality CS 524 – High-Performance Computing.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Elementary Data Structures and Algorithms

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Advanced Computer Architectures

Chapter 5 Algorithm Analysis 1CSCI 3333 Data Structures.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

CS 1704 Introduction to Data Structures and Software Engineering.

Multi-Core Architectures

Chapter One Introduction to Pipelined Processors.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Analysis of Algorithms These slides are a modified version of the slides used by Prof. Eltabakh in his offering of CS2223 in D term 2013.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Complexity of Algorithms

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.

Implementing Data Parallel Algorithms for Bioinformatics Christopher Mueller, Mehmet Dalkilic, Andrew Lumsdaine SIAM Conference on Computational Science.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.

Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.

QCAdesigner – CUDA HPPS project

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

A Parallel, High Performance Implementation of the Dot Plot Algorithm Chris Mueller July 8, 2004.

Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Sunpyo Hong, Hyesoon Kim

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

My Coordinates Office EM G.27 contact time:

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Chapter 3 Chapter Summary  Algorithms o Example Algorithms searching for an element in a list sorting a list so its elements are in some prescribed.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Code Optimization.

Introduction to Algorithms

FPGAs in AWS and First Use Cases, Kees Vissers

CSCI1600: Embedded and Real Time Software

Introduction to Computer Systems

Objective of This Course

November 14 6 classes to go! Read

Memory System Performance Chapter 3

COMPUTER ORGANIZATION AND ARCHITECTURE

CSCI1600: Embedded and Real Time Software

Presentation transcript:

Performance Analysis: Theory and Practice Chris Mueller Tech Tuesday Talk July 10, 2007

Computational Science  Traditional supercomputing Simulation Modeling Scientific Visualization  Key: problem (simulation) size

Informatics  Traditional business applications, life sciences, security, the Web  Workloads Data mining Databases Statistical analysis and modeling Graph problems Information visualization  Key: data size

Theory

Big-O Notation Big-O notation is a method for describing the time or space requirements of an algorithm. f(x) = O(g(x)) Example: f(x) = O(4x 2 + 2x + 1) = O(x 2 )f(x) is of order n 2  f is an algorithm whose memory or runtime is a function of x  O(g(x)) is an idealized class of functions of x that captures a the general behavior of f(x)  Only the dominant term in g(x) is used, constants are dropped

Asymptotic Analysis Asymptotic analysis studies how algorithms scale and typically measures computation or memory use. ConstantO(1) y = table[idx] Table lookup LinearO(n)Dot product for x, y in zip(X, Y): result += x * y LogarithmicO(log n)Binary search QuadraticO(n 2 )Comparison matrix ExponentialO(a n )Multiple sequence alignment Fin GGAGTGACCGATTA---GATAGC Sei GGAGTGACCGATTA---GATAGC Cow AGAGTGACCGATTACCGGATAGC Giraffe AGGAGTGACCGATTACCGGATAGC Order Example

Example: Dotplot 1.function dotplot(qseq, sseq, win, strig): for q in qseq: 4. for s in sseq: 5. if CompareWindow(qseq[q:q+win], s[s:s+win], strig): 6. AddDot(q, s) win = 3, strig = 2

Dotplot Analysis 1.function dotplot(qseq, sseq, win, strig): for q in qseq: 4. for s in sseq: 5. if CompareWindow(qseq[q:q+win], 6. s[s:s+win], strig): 7. AddDot(q, s) Dotplot is a polynomial time algorithm that is quadractic when the sequences are the same length.  m is the number of elements read from q  n is the number of elements read from s  win is the number of comparisons for a single dot  comp is the cost of a single character comparison  win and comp are much smaller than the sequence lengths O(m * n * win * comp) ≈ O(nm) or O(n 2 ) if m = n

Memory Hierarchy Unfortunately, real computers are not idealized compute engines with fast, random access to infinite memory stores… Clock speed: 3.5 GHz

Memory Hierarchy Data access occurs over a memory bus that is typically much slower than the processor… Clock speed: 3.5 GHz Bus speed: 1.0 GHz

Memory Hierarchy Data is stored in a cache to decrease access times for commonly used items… Clock speed: 3.5 GHz Bus speed: 1.0 GHz L1 cache access: 6 cycles

Memory Hierarchy Caches are also cached… Clock speed: 3.5 GHz Bus speed: 1.0 GHz L2 cache miss: 350 cycles L1 cache access: 6 cycles

Memory Hierarchy For big problems, multiple computers must communicate over a network… Clock speed: 3.5 GHz Bus speed: 1.0 GHz L2 cache miss: 350 cycles L1 cache access: 6 cycles Network: Gigabit Ethernet

Asymptotic Analysis, Revisted The effects of the memory hierarchy lead to an alternative method of performance characterization.  f(x) is the ratio of memory operations to compute operations  If f(x) is small, the algorithm is compute bound and will achieve good performance  If f(x) is large, the algorithm is memory bound and limited by the memory system f(x) = number of reads + number of writes number of operations

Dotplot Analysis, Memory Version 1.function dotplot(qseq, sseq, win, strig): for q in qseq: 4. for s in sseq: 5. if CompareWindow(qseq[q:q+win], 6. s[s:s+win], strig): 7. AddDot(q, s) The number of communication to computation ratio is based on the sequence lengths, size of the DNA alphabet, and number of matches: |∑ DNA | n + m +  nm nm  is the percentage of dots that are recorded as matches and is the asymptotic limit of the algorithm.

Parallel Scaling  What can we do with our time?  Strong Fix problem size, vary N procs, measure speedup Goal: Faster execution  Weak Vary problem size, vary N procs, fix execution time Goal: Higher resolution, more data

Dotplot: Parallel Decomposition Each block can be assigned to a different processor. Overlap prevents gaps by fully computing each possible window. Dotplot is naturally parallel and is easily decomposed into blocks for parallel processing. Scaling is strong until the work unit approaches the window size.

Practice

Application Architecture  Algorithm selection (linear search vs. binary search)  Data structures (arrays vs. linked lists)  Abstraction penalties (functions, objects, templates… oh my!)  Data flow (blocking, comp/comm overlap, references/copies)  Language (Scripting vs. compiled vs. mixed) Software architecture and implementation decisions have a large impact on the overall performance of an application, regardless of the complexity. Theory matters: Experience matters:

Abstraction Penalties Well-designed abstractions lead to more maintainable code. But, compilers can’t always optimize English. Programmers must code to the compiler to achieve high performance. Abstraction Penalty 1.A = Matrix(); 2.B = Matrix(); 3.C = Matrix(); 4.// Initialize matrices… 5.C = A + B * 2.0; Matrix class: 1.A = double[SIZE]; 2.B = double[SIZE]; 3.C = double[SIZE]; 4.// Initialize matrices… 5.for(i = 0; i < SIZE; ++i) 6. C[i] = A[i] + B[i] * 2.0; Arrays of double: 1.Matrix tmp = B.operator*(2.0); 2.C = A.operator+(tmp); Transform operation expands to two steps and adds a temporary Matrix instance: Transform loop is unrolled and the operation uses the fused multiply add instruction to run at near peak performance. 1.… 2.fmadd c, a, b, TWO 3.…

Data Flow Algorithms can be implemented to take advantage of a processor’s memory hierarchy. 1. def gemm(A, B, C): 2. for i in range(M): 3. for j in range(N): 4. for k in range(K): 5. C[i,j] += A[i,k] * B[k,j] 1. def gemm(A, B, C, mc, kc, nc, mr, nr): 2. for kk in range(0, K, kc): 3. tA[:,:] = A[:,kk:kk+nc] for jj in range(0, N, nc): 6. tB[:,:] = B[kk:kk+kc, jj:jj+nc] 7. for i in range(0, M, mr): 8. for j in range(0, nc, nr): 9. for k in range(0, kc): for ai in range(mr): 12. a[ai].load(tA, ai * A_row_stride) for bj in range(nr): 15. b[bj].load(tB, bj * B_col_stride) for ci in range(mr): 18. for cj in range(nr): 19. c[ci][cj].v = fmadd(a[ci], b[cj], c[ci][cj]) 20. store(c, C_aux[:,j:j_nr]) 21. C[jj:j+nc] += C_aux 22. return Naïve Matrix-Matrix Multiplication……Optimized for PowerPC 970

Language Any language can be used as a high-performance language.* But, it takes a skilled practitioner to properly apply a language to a particular problem. *(ok, any language plus a few carefully designed native libraries) Python-based matrix-matrix multiplication versus hand coded assembly.

Dotplot: Application Architecture  Algorithm Pre-computed score vectors Running window  Data structures Indexed arrays for sequences std::vector for dots  (row, col, score)  Abstractions Inline function for saving results  Data flow Blocked for parallelism  Language C++

System Architecture  Processor Execution Units Scalar, Superscalar Instruction order Cache Multi-core Commodity vs exotic  Node Memory bandwidth  Network Speed Bandwidth Features of the execution environment have a large impact on application performance.

Porting for Performance  Recompile (cheap) Limited by the capabilities of the language and compiler  Refactor (expensive) Can arch. fundamentally change the performance characteristics? Targeted optimizations (“accelerator”)  Algorithm (SIMD, etc)  Data structures  System resources (e.g., malloc vs. mmap, GPU vs. CPU) Backhoe  Complete redesign (high cost, potential for big improvement)

Dotplot: System-specific Optimizations  SIMD instructions allowed the algorithm to process 16 streams in parallel  A memory mapped file replaced std::vector for saving the results.  All input and output files were stored on the local system. Results were aggregated as a post-processing step.

Dotplot: Results BaseSIMD 1SIMD 2Thread Ideal NFS NFS Touch Local Local Touch Base is a direct port of the DOTTER algorithm SIMD 1 is the SIMD algorithm using a sparse matrix data structure based on STL vectors SIMD 2 is the SIMD algorithm using a binary format and memory mapped output files Thread is the SIMD 2 algorithm on 2 Processors Ideal SpeedupReal SpeedupIdeal/Real Throughput SIMD8.3x9.7x75% Thread15x18.1x77% Thread (large data) %

Application “Clock”  System clock (GHz) Compute bound Example: Matrix-matrix multiplication  Memory bus (MHz) Memory bound Example: Data clustering, GPU rendering, Dotplot  Network (Kbs-Mbs) Network bound Example: Parallel simulations, Web applications  User (FPS) User bound Example: Data entry, small-scale visualization Application performance can be limited by a number of different factors. Identifying the proper clock is essential for efficient use of development resources.

Conclusions  Theory is simple and useful  Practice is messy Many factors and decisions affect final performance  Application architecture  System implementation  Programmer skill  Programmer time

Thank You! Questions?