Performance Models for Application Optimization

Slides:



Advertisements
Similar presentations
QPRC June Wookyeon Hwang Univ. of South Carolina George Runger Industrial Engineering Industrial, Systems, and Operations Engineering.
Advertisements

Introductory Circuit Analysis Robert L. Boylestad
Chapter 7 Multicores, Multiprocessors, and Clusters.
CS 268: Packet Scheduling Ion Stoica March 18/20, 2003.
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Boyce/DiPrima 9th ed, Ch 2.8: The Existence and Uniqueness Theorem Elementary Differential Equations and Boundary Value Problems, 9th edition, by William.
Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.
Price Of Anarchy: Routing
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Intel Core2 GHz Q6700 L2 Cache 8 Mbytes (4MB per pair) L1 Cache: (128 KB Instruction +128KB Data at the core level???) L3 Cache: None? CPU.
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
Supercomputing Challenges at the National Center for Atmospheric Research Dr. Richard Loft Computational Science Section Scientific Computing Division.
A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.
SC’07 7 th International APART Workshop Panel 11 November 2007 David Koester, Ph.D.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
COMP25212 SYSTEM ARCHITECTURE Antoniu Pop Jan/Feb 2015COMP25212 Lecture 1.
Measuring Network Performance of Multi-Core Multi-Cluster (MCMCA) Norhazlina Hamid Supervisor: R J Walters and G B Wills PUBLIC.
Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.
Flow Models and Optimal Routing. How can we evaluate the performance of a routing algorithm –quantify how well they do –use arrival rates at nodes and.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
1 NETE4631 Managing the Cloud and Capacity Planning Lecture Notes #8.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
CSC 7080 Graduate Computer Architecture Lec 12 – Advanced Memory Hierarchy 2 Dr. Khalaf Notes adapted from: David Patterson Electrical Engineering and.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.
Department of Electrical and Computer Engineering University of Massachusetts, Amherst Xin Huang and Tilman Wolf A Methodology.
Lecture 1: Introduction. Course Outline The aim of this course: Introduction to the methods and techniques of performance analysis of computer systems.
Offloading to the GPU: An Objective Approach
Rev PA1 1 Exascale Node Model Following-up May 20th DMD discussion Updated, June 13th Sébastien Rumley, Robert Hendry, Dave Resnick, Anthony Lentine.
Jacquard: Architecture and Application Performance Overview NERSC Users’ Group October 2005.
A Cyclic-Executive-Based QoS Guarantee over USB Chih-Yuan Huang,Li-Pin Chang, and Tei-Wei Kuo Department of Computer Science and Information Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
A Measurement Based Memory Performance Evaluation of Streaming Media Servers Garba Isa Yau and Abdul Waheed Department of Computer Engineering King Fahd.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
MiddleMan: A Video Caching Proxy Server NOSSDAV 2000 Brian Smith Department of Computer Science Cornell University Ithaca, NY Soam Acharya Inktomi Corporation.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
Tackling I/O Issues 1 David Race 16 March 2010.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
Core Architecture Optimization for Heterogeneous CMPs R. Kumar, D. M. Tullsen, and N.P. Jouppi İlker YILDIRIM
CS427 Multicore Architecture and Parallel Computing
Welcome: Intel Multicore Research Conference
Chapter 6 Parallel Processors from Client to Cloud
Extreme Big Data Examples
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 14 The Roofline Visual Performance Model Prof. Zhang Gang
PA an Coordinated Memory Caching for Parallel Jobs
The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Parallel Computers Today
Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
Chapter 3: Principles of Scalable Performance
BlueGene/L Supercomputer
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
SDP Kernels Workshop – The Role of Kernels
Memory System Performance Chapter 3
Week1 software - Lecture outline & Assignments
Presentation transcript:

Performance Models for Application Optimization Walid Abu-Sufah abusufah@illinois.edu Visiting Scholar, University of Illinois Associate Professor, University of Jordan

Outline Objective Overview Relate roofline/capacity Open Issues Roofline model Capacity model Relate roofline/capacity Open Issues Discussion: How could PMUs help www.upcrc.illinois.edu

1. Objective Explore how a model for a target architecture could be used for application tuning (may be in a compiler?). All approaches, including autotuning, have involved running experiments. Library guys may be willing to do that but compiler guys don't want to make a compiler that 'searches'. 99% of developers wouldn't use it. www.upcrc.illinois.edu

2.1 Roofline Model For applications where off-chip memory bandwidth is the constraining resource (limit) in system performance. Relates processor performance to off-chip memory traffic. Bound and Bottleneck Model good enough to understand which optimizations to try to get next level of performance So far, demonstrated for several HPC dwarfs and multicore systems. www.upcrc.illinois.edu

Bounds = Peak Processing Bandwidth; MFLOP/sec = Peak DRAM Bandwidth; Mbytes/sec “Operational Intensity”: Average number of Floating Point Operations per Byte to DRAM, FLOPs/Byte Varies by multicore design (cache org.) and dwarf Characterize dwarf for a particular multicore design

Performance Model Graph Y axis is GFLOPs/sec X-axis is FLOPs/Byte (i.e. Operational Intensity) Can plot peak DRAM BW, since (GFLOPs/sec) (FLOPs/Byte) “Roofline” = GBytes/sec

Roofline Visual Performance Model “Ridge Point”: minimum Operational Intensity to get Peak Performance Compute Bound Memory Bound Ridge Point

Roofline model for AMD Opteron X2

Roofline model for Opteron X2 vs. Opteron X4

Roofline model with ceilings for Opteron X2

Roofline model with ceilings for Opteron X2.

Roofline model with ceilings for Opteron X2

What is next for Roofline Non-floating point kernels would be interesting e.g., Sort (potential exchanges/sec vs GB/s), Graph Traversal (nodes traversed/sec vs. GB/s) Opportunities for others to help investigate: many kernels, multicores, metrics, … For example, Jike Chong ported two financial PDE solvers to four other multicore computers: the Intel Penryn and Larrabee and NVIDIA G80 and GTX280.[9] He used the Roofline model to keep track the platforms' peak arithmetic throughput and L1, L2, and DRAM bandwidths. By analyzing an algorithm's working set and operational intensity, he was able to use the Roofline model to quickly estimate the needs for algorithmic improvements. Specifically, for the option-pricing problem with an implicit PDE solver, the working set is small enough to fit into L1 and the L1 bandwidth is sufficient to support peak arithmetic throughput, so the Roofline model indicates that no optimization is necessary. For option pricing with an explicit PDE formulation, the working set is too large to fit into cache, and the Roofline model helps to indicate the extent to which cache blocking is necessary to extract peak arithmetic performance

2.2 Capacity Model HW represented as nodes with “peak” BW In this talk & for illustration purposes, we assume only two nodes, a memory and a processing node with BWs: System is represented as graph of HW nodes

Performance Depends on: System Characteristics Peak BWs of nodes Memory hierarchy (cache) organization/ size Operational overlap Application Characteristics Relative demands on BWs Overheads www.upcrc.illinois.edu

Definitions Ration of peak BWs, BW-used per node: , Ratio of BWs-used Ratio of BW-used per node to system bandwidth-used: www.upcrc.illinois.edu

Capacity of A Node Average node BW utilized by an application A function of Application characteristics Node BW www.upcrc.illinois.edu

Saturated Node Capacity Assume that at least one of the nodes is saturated, then processor capacity, , is given by A similar expression applies for memory capacity, System capacity, Similar argument holds for unsaturated node pair www.upcrc.illinois.edu

Saturated Node Capacity Expression – Example For αp,m = ½ www.upcrc.illinois.edu

Processor, Memory, and System Capacity Curves ( ) www.upcrc.illinois.edu

3. Relating Roofline/ Capacity A processing optimization ceiling, x , in Roofline corresponds to a used processing BW A memory optimization ceiling , y, in Roofline corresponds to a used memory BW, If an application is optimized using optimizations x and y then www.upcrc.illinois.edu

Roofline model with ceilings for Opteron X2

4. Open Issues Modeling with different performance limiting factors Cache resident client applications (i.e. memory BW is not the limit) Introduce additional bounds: Network BW and IO BW Development of tools based on models for use in application optimization www.upcrc.illinois.edu

5. Discussion: How could PMUs help www.upcrc.illinois.edu

References: Roofline Model S. Williams, A. Waterman, D. Patterson, "Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, Volume 52 , Issue 4 (April 2009), Pages 65-76. David Patterson,” The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem?“, April 8, 2009 lecture in the Parallel@Illinois Distinguished Lecture Series (http://www.parallel.illinois.edu/dls_archive.html ) www.upcrc.illinois.edu

References: Capacity Model D. J. Kuck, "Computer System Capacity Fundamentals,” National Bureau of Standards, Technical Note 851, Oct. 1974. D. J. Kuck, B. Kumar, A system model for computer performance evaluation, March 1976 SIGMETRICS 76: Proceedings of the 1976 ACM SIGMETRICS Conference on computer performance modeling measurement and evaluation. D.J. Kuck, The Structure of Computers and Computations, Vol. I, John Wiley & Sons, Inc., 1978. www.upcrc.illinois.edu

David J. Kuck “Capacity-based Codesign of Computer HW and SW“, January 26, 2009 lecture in the Parallel@Illinois Distinguished Lecture Series (http://www.parallel.illinois.edu/dls_archive.html ) www.upcrc.illinois.edu