High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.

Slides:



Advertisements
Similar presentations
The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05.
Advertisements

Introduction to Parallel Programming & Cluster Computing Scientific Libraries & I/O Libraries Joshua Alexander, U Oklahoma Ivan Babic, Earlham College.
Lecture 2c: Benchmarks. Benchmarking Benchmark is a program that is run on a computer to measure its performance and compare it with other machines Best.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Introduction CS 524 – High-Performance Computing.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Supercomputers Daniel Shin CS 147, Section 1 April 29, 2010.
1 Computer Science, University of Warwick Metrics  FLOPS (FLoating point Operations Per Sec) - a measure of the numerical processing of a CPU which can.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Lecture 8: Caffe - CPU Optimization
Introduction to FORTRAN
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
Joshua Alexander University of Oklahoma – IT/OSCER ACI-REF Virtual Residency Workshop Monday June 1, 2015 Deploying Community Codes.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Enhancing GPU for Scientific Computing Some thoughts.
Effective User Services for High Performance Computing A White Paper by the TeraGrid Science Advisory Board May 2009.
Computer Performance Computer Engineering Department.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Supercomputing in Plain English Scientific Libraries and I/O Libraries National Computational Science Institute Intermediate Parallel Programming & Cluster.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
FFT: Accelerator Project Rohit Prakash Anand Silodia.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.
- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.
GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Survey of Program Compilation and Execution Bangor High School Ali Shareef 2/28/06.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Published in ACM SIGPLAN, 2010 Heidi Pan MassachusettsInstitute of Technology Benjamin Hindman UC Berkeley Krste Asanovi´c UC Berkeley 1.
FORTRAN Beginnings: FORTRAN is one of the oldest programming languages, originally developed by John Backus and a team of programmers working at IBM. It.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Benchmarking, Performance Evaluation, Modeling and Prediction Erich Strohmaier.
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Intro to Scientific Libraries Intro to Scientific Libraries Blue Waters Undergraduate Petascale Education Program May 29 – June
Vector computers.
1 Programming and problem solving in C, Maxima, and Excel.
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Khang Lam Daniel Limas Kevin Castillo Juan Battini
MatLab Programming By Kishan Kathiriya.
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
for more information ... Performance Tuning
Adaptive Strassen and ATLAS’s DGEMM
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
VSIPL++: Parallel Performance HPEC 2004
FOUNDATIONS OF MODERN SCIENTIFIC PROGRAMMING
Computer Organization & Compilation Process
Presentation transcript:

High Performance Computing The GotoBLAS Library

HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to perform common operations: Linear algebra operators (e.g., dot products, matrix-vector multiplies) Fast Fourier transforms Linear solvers  To maximize application performance (and throughput), we want these libraries to be highly optimized for each computer architecture  One commonly used numerical library is BLAS: Contains routines that provide standard building blocks for performing basic vector and matrix operations Commonly used in scientific and engineering software and graphics processing “High-profile” since it is used with the Linpack benchmark, used to rank the fastest supercomputers in the world (Top 500 list)

HPC: GotoBLAS  GotoBLAS is an implementation of the BLAS library developed by TACC researcher Kazushige Goto.  Kazushige has been called “the Michael Jordan of high- performance linear algebra kernels.”  Software is designed for all common chipset architectures, including: Power 4, Power 5 Opteron Blue Gene/L Pentium 4/Xeon (32-bit and 64-bit) Itanium 2

HPC: GotoBLAS  Most vendors provide their own BLAS implementation: Significant development overhead incurred for new architectures Large code base with many switching branches based on input sizing  Kazushige’s approach uses a simplified model No major context switching Functions separated based on performance impact  Non-performance bits written in C  Crucial performance kernels written in assembly  GotoBLAS tries to minimize assembler codes Actual assembler code is really small Easy to improve and debug  Benefit: It takes only 3 to 7 days to develop a tuned BLAS for a new architecture

GotoBLAS DGEMM performance ArchitectureEfficiency Itanium298.9% PPC440 FP298.2% Alpha % POWER596.2% Pentium495.7% Opteron92.8% PPC970MP92.0% SPARC IV92.0% Efficiency indicates the ratio of observed performance to the maximum theoretical value. DGEMM is one of the most widely used BLAS functions; it performs matrix-matrix multiplies.

Example GotoBLAS comparisons DGEMM POWER5 1.9GHz Size MFlops GOTOESSLATLAS

HPC: GotoBLAS  In April 2006, TACC released the latest version of GotoBLAS: Free to use for academic and research purposes Supports a wide range of Fortran compiler interfaces Available to commercial users through UT’s Office of Technology Commercialization  Source code for the library is now available.  Redistribution rights are also available.

Thanks for your time! Karl W. Schulz, Kazushige Goto,