1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://www.cs.utk.edu/~dongarra/

Slides:



Advertisements
Similar presentations
TeraGrid Community Software Areas (CSA) JP (John-Paul) Navarro TeraGrid Grid Infrastructure Group Software Integration University of Chicago and Argonne.
Advertisements

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Introduction To Java Objectives For Today â Introduction To Java â The Java Platform & The (JVM) Java Virtual Machine â Core Java (API) Application Programming.
Beowulf Supercomputer System Lee, Jung won CS843.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Performance of multiprocessing systems: Benchmarks and performance counters Miodrag Bolic ELG7187 Topics in Computers: Multiprocessor Systems on Chip.
Introduction CS 524 – High-Performance Computing.
Operating Systems Parallel Systems and Threads (Soon to be basic OS knowledge)
1 Enabling autonomic behavior in systems software with hot swapping By J. Appavoo, K. Hui, C. A. N. Soules, R. W. Wisniewski, D. M. Da Silva, O. Krieger,
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Operating System Kernels1 Operating System Support for Performance Monitoring Witawas Srisa-an Chapter: not in the book.
3.5 Interprocess Communication
Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.
Understanding and Managing WebSphere V5
Hossein Bastan Isfahan University of Technology 1/23.
STRATEGIES INVOLVED IN REMOTE COMPUTATION
An Automated Component-Based Performance Experiment and Modeling Environment Van Bui, Boyana Norris, Lois Curfman McInnes, and Li Li Argonne National Laboratory,
PAPI Tool Evaluation Bryan Golden 1/4/2004 HCS Research Laboratory University of Florida.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.
9/13/20151 Threads ICS 240: Operating Systems –William Albritton Information and Computer Sciences Department at Leeward Community College –Original slides.
Virtual Machines: Versatile Platforms for Systems and Processes
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
1 Jack Dongarra University of Tennesseehttp://
DISTRIBUTED COMPUTING
November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Chapter 34 Java Technology for Active Web Documents methods used to provide continuous Web updates to browser – Server push – Active documents.
André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.
March 17, 2005 Roadmap of Upcoming Research, Features and Releases Bart Miller & Jeff Hollingsworth.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Issues in (Financial) High Performance Computing John Darlington Director Imperial College Internet Centre Fast Financial Algorithms and Computing 4th.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Jini Architecture Introduction System Overview An Example.
Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
Beginning Snapshots Chapter 0. C++ An Introduction to Computing, 3rd ed. 2 Objectives Give an overview of computer science Show its breadth Provide context.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University.
Introduction Why are virtual machines interesting?
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
Performance Data Standard and API Shirley Browne, Jack Dongarra, and Philip Mucci University of Tennessee from the Ptools Annual Meeting, May 1998.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
October 18, 2001 LACSI Symposium, Santa Fe, NM1 Towards Scalable Cross-Platform Application Performance Analysis -- Tool Goals and Progress Shirley Moore.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages
Introduction to Operating Systems Concepts
Introduction to threads
Shirley Moore Towards Scalable Cross-Platform Application Performance Analysis -- Tool Goals and Progress Shirley Moore
For Massively Parallel Computation The Chaotic State of the Art
Virtual Machines: Versatile Platforms for Systems and Processes
Tracing and Performance Analysis Tools for Heterogeneous Multicore System by Soon Thean Siew.
What we need to be able to count to tune programs
Virtual Machines (Introduction to Virtual Machines)
What Are Performance Counters?
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

2 Four Components for the University of Tennessee’s  Performance Capturing Tools  PAPI  Self adapting numerical software  Automatic performance enhancement  SANS/AEOS/ATLAS  Performance repository for apps, kernels, machines, etc  NETLIB, Repository in a Box (RIB)  Modeling, predictability

3 Tools for Performance Evaluation  Timing and performance evaluation has been an art  Resolution of the clock  Issues about cache effects  Different systems  Can be cumbersome and inefficient with traditional tools  Situation about to change  Today’s processors have internal counters

4 Performance Counters  Almost all high performance processors include hardware performance counters.  Some are easy to access, others not available to users.  On most platforms the APIs, if they exist, are not appropriate for the end user or well documented.  Existing performance counter APIs  Compaq Alpha EV 6 & 6/7  SGI MIPS R10000  IBM Power Series  CRAY T3E  Sun Solaris  Pentium Linux and Windows  IA-64  HP-PA RISC  Hitachi  Fujitsu  NEC

5 Overview of PAPI  Performance Application Programming Interface  The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors

6 Performance Data from PAPI  Execution Rate (MIPS, Flop/s)  Bandwidth Utilization  Main Memory  L2 cache  L1 cache  Cache Miss Statistics: Icache, Dcache, and L2 cache  TLB misses  Mispredicted Branches  Instruction Mix (FP, branch, LD/ST, other)  Load/store instruction issue rate

7 Implementation  Counters exist as a small set of registers that count events.  PAPI provides three interfaces to the underlying counter hardware: 1.The low level interface manages hardware events in user defined groups called EventSet. 2.The high level interface simply provides the ability to start, stop and read the counters for a specified list of events. 3.Graphical tools to visualize information.

8 PAPI - Supported Processors  Intel Pentium,Pro,II,III,4  Linux 2.4, 2.2, 2.0 and perf kernel patch  IBM Power 3,604,604e  For AIX 4.3 and pmtoolkit (in available)   Sun UltraSparc I, II, & III  Solaris 2.8  MIPS R10K, R12K  AMD Athlon  Linux 2.4 and perf kernel patch  Cray T3E, SV1, SV2  Soon: Windows 2K, Compaq Alpha EV6 & 67 and Intel IA-64

9 Go To Demo

10 PAPI’s Parallel Interface

11 PAPI Development  Extensions to PAPI to support collection and analysis of hardware performance counter data in the context of shared and distributed memory parallel programs  Allowing for straightforward instrumentation of multithreaded and multiprocessor applications.  Tools will include graphical tools extended with dynamic instrumentation capabilities.  Framework for using Dyninst with parallel programs, the Free Probe Class Server (FPCS) and IBM’s Dynamic Probe Class Library (DPCL)  Port PAPI to Compaq Alpha and HP machines  Summary information on problem spots within applications  Integration with other tools, SvPablo, Dyninst, etc  Help with setting up PAPI at various sites.

12 Repository Development  Repository of Tools and Data on Performance Evaluation  A network-based catalog that will serve as a “road map” to important Performance Evaluation enabling technologies  A methodology for evaluation and measurement of the success of the tools.  SciDAC outreach: Start a community effort for the collection and dissemination of performance data

13 Self-Adapting Numerical Software (SANS)  Today’s processors can achieve high-performance, but this requires extensive machine-specific hand tuning.  Simple operations like Matrix-Vector ops require many man-hours / platform Software lags far behind hardware introduction Only done if financial incentive is there  Compilers not up to optimization challenge  Hardware, compilers, and software have a large design space w/many parameters  Blocking sizes, loop nesting permutations, loop unrolling depths, software pipelining strategies, register allocations, and instruction schedules.  Complicated interactions with the increasingly sophisticated micro-architectures of new microprocessors.  Need for quick/dynamic deployment of optimized routines.  ATLAS - Automatic Tuned Linear Algebra Software

14 SANS Extensions  BLAS  Sparse matrix operations  Message passing  Algorithm selection at a higher level

15 Repository In a Box (RIB)  Metadata objects are stored in repositories.  A repository automatically generates a web site for displaying customizable views of its metadata - search, browse, join, etc.  Metadata objects are also made available to network applications via the RIB API.

16 Repository Interoperation My Repository Our Virtual Repository Metadata objects Your Repository Metadata objects HTML Catalog

17 Tools Integration  PAPI, Dyninst, SVPablo  Intelligent Adaptation  Rose and SANS (ATLAS)  Repository-in-a-Box effort provides a toolkit for building and maintaining meta-data repositories

18 Interaction with Other Efforts  SciDAC - TOPS  David Keyes, ICASE/ODU/LLNL  SciDAC - Astrophysics  Tony Mezzacappa, ORNL  DOE - Cross-Platform Infrastructure for Scalable Runtime Application Performance Analysis  Bart Miller, U Wisc  Jeff H., U of Maryland

19 High-End Computer System Performance: Science and Engineering  Activities for UTennessee  Performance Capturing Tools  PAPI  Automatic performance enhancement  SANS/AEOS/ATLAS  Performance repository for apps, kernels, machines, etc  NETLIB, RIB  Modeling, predictability