1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://www.cs.utk.edu/~dongarra/

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Introductory Courses in High Performance Computing at Illinois David Padua.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
Introduction CS 524 – High-Performance Computing.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
9/13/20151 Threads ICS 240: Operating Systems –William Albritton Information and Computer Sciences Department at Leeward Community College –Original slides.
Makoto Kudoh*1, Hisayasu Kuroda*1,
Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
Full and Para Virtualization
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
Performance Data Standard and API Shirley Browne, Jack Dongarra, and Philip Mucci University of Tennessee from the Ptools Annual Meeting, May 1998.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Chapter 16 Client/Server Computing Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Vector Processing => Multimedia
Chapter 4: Threads.
Adaptive Strassen and ATLAS’s DGEMM
STUDY AND IMPLEMENTATION
Multithreaded Programming
Chapter 12 Pipelining and RISC
Maximizing Speedup through Self-Tuning of Processor Allocation
Presentation transcript:

1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp:// Workshop on the Performance Characterization of Algorithms

2 LINPACK Benchmark  Accidental benchmarking  Designed to help users extrapolate execution time for Linpack software  First benchmark report from 1979 My iPAQ running the benchmark in Java comes in here today.

3 Accidental Benchmarking  Portable, runs on any system  Easy to understand  Content changed over time  n=100, 300, 1000, as large as possible (Top500)  Allows for restructuring algorithm  Performance data with the same arithmetic precision  Benchmark checks to see if “correct solution” achieved  Not intended to measure entire machine performance.  In the benchmark report, “One further note: The following performance data should not be taken too seriously.”

4 LINPACK Benchmark  Historical data  For n=100 same software for the last 22 years  Unbiased reporting  Freely available sw/results worldwide  Should be able to achieve high performance on this problem, if not…  Compiler test at n=100, heavily hand optimized at TPP (Modified ScaLAPACK implementation)  Scalable benchmark, size and parallel  Pressure on vendors to optimize my software and provide a set of kernels that benefit others  Run rules very important  Today, n =.5x10 6 at 7.2 TFlop/s requires 3.3 hours  On a Petaflops machine, at n=5x10 6 will require 1 day.

5 Benchmark  Machine signatures  Algorithm characteristics  Make improvements in applications  Users looking for performance portability  Many of the things we do are specific to one system’s parameters  Need a way understand and rapidly develop software which has a chance at high performance

6 Self-Adapting Numerical Software (SANS)  Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.  Simple operations like Matrix-Vector ops require many man-hours / platform Software lags far behind hardware introduction Only done if financial incentive is there  Compilers not up to optimization challenge  Hardware, compilers, and software have a large design space w/many parameters  Blocking sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction schedules.  Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.  Need for quick/dynamic deployment of optimized routines.  ATLAS - Automatic Tuned Linear Algebra Software

7 Software Generation Strategy - BLAS  Takes ~ 20 minutes to run.  “New” model of high performance programming where critical code is machine generated using parameter optimization.  Designed for RISC arch  Super Scalar  Need reasonable C compiler  Today ATLAS in use by Matlab, Mathematica, Octave, Maple, Debian, Scyld Beowulf, SuSE, …  Parameter study of the hw  Generate multiple versions of code, w/difference values of key performance parameters  Run and measure the performance for various versions  Pick best and generate library  Level 1 cache multiply optimizes for:  TLB access  L1 cache reuse  FP unit usage  Memory fetch  Register reuse  Loop overhead minimization

ATLAS (DGEMM n = 500)  ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.

9 Related Tuning Projects  PHiPAC  Portable High Performance ANSI C initial automatic GEMM generation project  FFTW Fastest Fourier Transform in the West   UHFFT  tuning parallel FFT algorithms   SPIRAL  Signal Processing Algorithms Implementation Research for Adaptable Libraries maps DSP algorithms to architectures   Sparsity  Sparse-matrix-vector and Sparse-matrix-matrix multiplication tunes code to sparsity structure of matrix more later in this tutorial  University of Tennessee

10 Experiments with C, Fortran, and Java for ATLAS (DGEMM kernel)

11 Machine-Assisted Application Development and Adaptation  Communication libraries  Optimize for the specifics of one’s configuration.  Algorithm layout and implementation  Look at the different ways to express implementation

12 Work in Progress: ATLAS-like Approach Applied to Broadcast (PII 8 Way Cluster with 100 Mb/s switched network) Message Size Optimal algorithm Buffer Size (bytes) (bytes) 8 binomial 8 16 binomial binary binomial binomial binomial binomial 512 1K sequential 1K 2K binary 2K 4K binary 2K 8K binary 2K 16K binary 4K 32K binary 4K 64K ring 4K 128K ring 4K 256K ring 4K 512K ring 4K 1M binary 4K Root Sequential Binary Binomial Ring

13 Conjugate Gradient Variants by Dynamic Selection at Run Time  Variants combine inner products to reduce communication bottleneck at the expense of more scalar ops.  Same number of iterations, no advantage on a sequential processor  With a large number of processor and a high-latency network may be advantages.  Improvements can range from 15% to 50% depending on size.

14 Conjugate Gradient Variants by Dynamic Selection at Run Time  Variants combine inner products to reduce communication bottleneck at the expense of more scalar ops.  Same number of iterations, no advantage on a sequential processor  With a large number of processor and a high-latency network may be advantages.  Improvements can range from 15% to 50% depending on size.

15 Reformulating/Rearranging/Reuse  Example is the reduction to narrow band from for the SVD  Fetch each entry of A once  Restructure and combined operations  Results in a speedup of > 30%

16 Tools for Performance Evaluation  Timing and performance evaluation has been an art  Resolution of the clock  Issues about cache effects  Different systems  Can be cumbersome and inefficient with traditional tools  Situation about to change  Today’s processors have internal counters

17 Performance Counters  Almost all high performance processors include hardware performance counters.  Some are easy to access, others not available to users.  On most platforms the APIs, if they exist, are not appropriate for the end user or well documented.  Existing performance counter APIs  Compaq Alpha EV 6 & 6/7  SGI MIPS R10000  IBM Power Series  CRAY T3E  Sun Solaris  Pentium Linux and Windows  IA-64  HP-PA RISC  Hitachi  Fujitsu  NEC

18 Directions  Need tools that allow us to examine performance and identify problems.  Should be simple to use  Perhaps in an automatic fashion  Machine assisted optimization of key components  Think of it as a higher level compiler  Done via experimentation