Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Optimizing Matrix Multiplication with a Classifier Learning System Xiaoming Li (presenter) María Jesús Garzarán University of Illinois at Urbana-Champaign.
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
The view from space Last weekend in Los Angeles, a few miles from my apartment…
Carnegie Mellon Performance Portable Tracking of Evolving Surfaces Wei Yu Citadel Investment Group Franz Franchetti, James C. Hoe, José M. F. Moura Carnegie.
Autotuning at Illinois María Jesús Garzarán University of Illinois.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
Offline Adaptation Using Automatically Generated Heuristics Frédéric de Mesmay, Yevgen Voronenko, and Markus Püschel Department of Electrical and Computer.
1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Introduction CS 524 – High-Performance Computing.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
A Fast Fourier Transform Compiler Silvio D Carnevali.
Portability Issues. The MPI standard was defined in May of This standardization effort was a response to the many incompatible versions of parallel.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Programming Systems for a Digital Human Kathy Yelick EECS Department U.C. Berkeley.
February 21, 2008 Center for Hybrid and Embedded Software Systems Mapping A Timed Functional Specification to a Precision.
Automatic Performance Tuning of Sparse Matrix Kernels Berkeley Benchmarking and OPtimization (BeBOP) Project James.
1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to.
Generative Programming. Generic vs Generative Generic Programming focuses on representing families of domain concepts Generic Programming focuses on representing.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
EECE **** Embedded System Design
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.
November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Spiral: an empirical search system for program generation and optimization David Padua Department of Computer Science University of Illinois at Urbana-
Generative Programming. Automated Assembly Lines.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Carnegie Mellon High-Performance Code Generation for FIR Filters and the Discrete Wavelet Transform Using SPIRAL Aca Gačić Markus Püschel José M. F. Moura.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
Library Generators and Program Optimization David Padua University of Illinois at Urbana-Champaign.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
In Search of the Optimal WHT Algorithm J. R. Johnson Drexel University Markus Püschel CMU
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Fang Fang James C. Hoe Markus Püschel Smarahara Misra
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
STUDY AND IMPLEMENTATION
A Comparison of Cache-conscious and Cache-oblivious Codes
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
VSIPL++: Parallel Performance HPEC 2004
Logistics Project proposals are due by midnight tonight (two pages)
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff Bilmes (Univ. of Washington, EE) December 10, 2000 Workshop on Feedback-Directed Dynamic Optimization

Context: High Performance Libraries n Libraries can isolate performance issues –BLAS/LAPACK/ScaLAPACK (linear algebra) –VSIPL (signal and image processing) –MPI (distributed parallel communications) n Can we implement libraries … –automatically and portably –incorporating machine-dependent features –matching performance of hand-tuned implementations –leveraging compiler technology –using domain-specific knowledge

Generate and Search: An Automatic Tuning Methodology n Given a library routine n Write parameterized code generators –parameters machine (e.g., registers, cache, pipeline) input (e.g., problem size) problem-specific transformations –output high-level source (e.g., C code) n Search parameter spaces –generate an implementation –compile using native compiler –measure performance (feedback)

Tuning System Examples n Linear algebra –PHiPAC (Bilmes, et al., 1997) –ATLAS (Whaley and Dongarra, 1998) –Sparsity (Im and Yelick, 1999) n Signal Processing –FFTW (Frigo and Johnson, 1998) –SPIRAL (Moura, et al., 2000) n Parallel Communcations –Automatically tuned MPI collective operations (Vadhiyar, et al. 2000) n Related: Iterative compilation (Bodin, et al., 1998)

Road Map n Context n The Search Problem n Problem 1: Stopping searches early n Problem 2: High-level run-time selection n Summary

The Search Problem in PHiPAC n PHiPAC (Bilmes, et al., 1997) –produces dense matrix multiply (matmul) implementations –generator parameters include size and depth of fully unrolled core matmul rectangular, multi-level cache tile sizes 6 flavors of software pipelining scaling constants, transpose options, precisions, etc. n An experiment –fix software pipelining method –vary register tile sizes –500 to 2500 reasonable implementations on 6 platforms

A Needle in a Haystack, Part I

Road Map n Context n The Search Problem n Problem 1: Stopping searches early n Problem 2: High-level run-time selection n Summary

Problem 1: Stopping Searches Early n Assume –dedicated resources limited –near-optimal implementation okay n Recall the search procedure –generate implementations at random –measure performance n Can we stop the search early? –how early is early? –guarantees on quality?

An Early Stopping Criterion n Performance scaled from 0 (worst) to 1 (best) Goal: Stop after t implementations when Prob[ M t <= 1- ] < –M t max observed performance at t – proximity to best – error tolerance –example: find within top 5% with error 10% =.05, =.1 Can show probability depends only on F(x) = Prob[ performance <= x ] Idea: Estimate F(x) using observed samples

Stopping time (300 MHz Pentium-II)

Stopping Time (Cray T3E Node)

Road Map n Context n The Search Problem n Problem 1: Stopping searches early n Problem 2: High-level run-time selection n Summary

Problem 2: Run-Time Selection n Assume –one implementation is not best for all inputs –a few, good implementations known –can benchmark n How do we choose the best implementation at run-time? n Example: matrix multiply, tuned for small (L1), medium (L2), and large workloads CA M K B K N C = C + A*B

Truth Map (Sun Ultra-I/170)

A Formal Framework n Given –m implementations –n sample inputs (training set) –execution time n Find –decision function f(s) –returns best implementation on input s – f(s) cheap to evaluate

Solution Techniques (Overview) n Method 1: Cost Minimization –minimize overall execution time on samples (boundary modeling) pro: intuitive, f(s) cheap con: ad hoc, geometric assumptions n Method 2: Regression (Brewer, 1995) –model run-time of each implementation e.g., T a (N) = b 3 N 3 + b 2 N 2 + b 1 N + b 0 pro: simple, standard con: user must define model n Method 3: Support Vector Machines –statistical classification pro: solid theory, many successful applications con: heavy training and prediction machinery

Results 1: Cost Minimization

Results 2: Regression

Results 3: Classification

Quantitative Comparison Note: Cost of regression and cost-min prediction ~O(3x3 matmul) Cost of SVM prediction ~O(32x32 matmul)

Road Map n Context n The Search Problem n Problem 1: Stopping searches early n Problem 2: High-level run-time selection n Summary

Conclusions n Search beneficial n Early stopping –simple (random + a little bit) –informative criteria n High-level run-time selection –formal framework –error metrics n To do –other stopping models (cost-based) –large design space for run-time selection

Extra Slides More detail (time and/or questions permitting)

PHiPAC Performance (Pentium-II)

PHiPAC Performance (Ultra-I/170)

PHiPAC Performance (IBM RS/6000)

PHiPAC Performance (MIPS R10K)

Needle in a Haystack, Part II

Performance Distribution (IBM RS/6000)

Performance Distribution (Pentium II)

Performance Distribution (Cray T3E Node)

Performance Distribution (Sun Ultra-I)

Cost Minimization n Decision function n Minimize overall execution time on samples n Softmax weight (boundary) functions

Regression n Decision function n Model implementation running time (e.g., square matmul of dimension N) n For general matmul with operand sizes (M, K, N), we generalize the above to include all product terms –MKN, MK, KN, MN, M, K, N

Support Vector Machines n Decision function n Binary classifier

Proximity to Best (300 MHz Pentium-II)

Proximity to Best (Cray T3E Node)