Productive Performance Tools for Heterogeneous Parallel Computing

Slides:



Advertisements
Similar presentations
Executional Architecture
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Productive Performance Tools for Heterogeneous Parallel Computing Allen D. Malony Department of Computer and Information Science University of Oregon Shigeo.
Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Contemporary Languages in Parallel Computing Raymond Hummel.
Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.
What is Concurrent Programming? Maram Bani Younes.
Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,
KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
LIGO-G Z 8 June 2001L.S.Finn/LDAS Camp1 How to think about parallel programming.
HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Programmability Hiroshi Nakashima Thomas Sterling.
Contemporary Languages in Parallel Computing Raymond Hummel.
Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
Background Computer System Architectures Computer System Software.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Introduction to threads
Computer Engg, IIT(BHU)
Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Introduction to Parallel Processing
GPU Computing CIS-543 Lecture 10: Streams and Events
Parallel Programming By J. H. Wang May 2, 2017.
Department of Computer and Information Science
Parallel Programming in C with MPI and OpenMP
Chapter 4 Multithreading programming
Software Connectors – A Taxonomy Approach
Chapter 4: Threads.
MASS CUDA Performance Analysis and Improvement
Chapter 15, Exploring the Digital Domain
Chapter 4: Threads.
What is Concurrent Programming?
Hybrid Programming with OpenMP and MPI
Allen D. Malony Computer & Information Science Department
COMP60621 Fundamentals of Parallel and Distributed Systems
Multithreaded Programming
The Challenge of Cross - Language Interoperability
Chapter 5 Architectural Design.
Basic organizations and memories in distributed computer systems
Database System Architectures
COMP60611 Fundamentals of Parallel and Distributed Systems
Programming Parallel Computers
Presentation transcript:

Productive Performance Tools for Heterogeneous Parallel Computing Allen D. Malony Department of Computer and Information Science University of Oregon Shigeo Fukuda, Lunch with a Helmet On

Parallel Performance Engineering and Productivity Scalable, optimized applications deliver HPC promise Optimization through performance engineering process Understand performance complexity and inefficiencies Tune application to run optimally at scale How to make the process more effective and productive? What performance technology should be used? Performance technology part of larger environment Programmability, reusability, portability, robustness Application development and optimization productivity Process, performance technology, and use will change as parallel systems evolve and grow to extreme scale Goal is to deliver effective performance with high productivity value now and in the future

Heterogeneous Parallel Systems and Performance What does it mean to be heterogeneous? Diverse in character or content (New Oxford America) Not homogeneous (Anonymous) Diversity in what? Hardware processors/cores, memory, interconnection, … different in computing elements and how they are used Software how heterogeneous hardware is programmed different software models, libraries, frameworks, … Diversity how? when? space? time? Variation is operation, interaction, concurrency, ...

Why Do We Care? Heterogeneity has been around for a long time Different programmable components in computer systems Long history of specialized hardware (e.g., ASPs) Heterogeneous (computing) technology more accessible Multicore processors (e.g., 4-core, 6-core, 8-core, ...) Manycore accelerators (e.g., Tesla, Fermi, Larrabee) High-performance processing engines (e.g., IBM Cell BE) Performance is the main driving concern Heterogeneity is arguably the only path to extreme scale Heterogeneous software technology required Greater performance enables more powerful software Will give rise to more sophisticated software environments

Implications for Productive Performance Tools Current status quo is comfortable Mostly homogeneous parallel systems and software Shared-memory multithreading – OpenMP Distributed-memory message passing – MPI Parallel computational models are relatively stable (simple) Corresponding performance models are relatively tractable Parallel performance tools are just keeping up Heterogeneity creates richer computational potential Results in greater performance diversity and complexity Heterogeneous systems will utilize more sophisticated programming and runtime environments Performance tools have to support richer computation models and more versatile performance perspectives

Heterogeneous Performance Views Want to create performance views that capture heterogeneous concurrency and execution behavior Reflect execution logic beyond standard actions Capture performance semantics at multiple levels Allow for compatible perspectives that do not conflict Performance views may have to work with reduced data Consider “host” path and “external” external paths Want to capture performance for all execution paths External execution may be more difficult to measure What perspective does the host have of external entities? Determines the semantics of the measurement data Determines assumptions about entity behavior

Task-based Performance View Consider the “task” abstraction for accelerator scenario Host regards external execution as a task Tasks operate concurrently with respect to the host Requires support for tracking asynchronous execution Host creates measurement perspective for external task Maintains local and remote performance data Tasks may have limited measurement support May depend on host for performance data I/O Performance data might be received from external task How to create a view of external performance Host GPU

Simple Master-Worker Scenario Master sends tasks to N workers Workers report back their performance to master Done for each piece of work In general, the external performance data can be arbitrary single time value more complete representation of external performance Build a worker performance perspective in the master Each work task will appear as a separate “thread”

CPU – GPU Execution Scenarios

Heterogeneous Complexity Issues Asynchronous execution (concurrency) Memory transfer and memory sharing Interactions between heterogeneous components Interference between heterogeneous components Different programming languages/libraries/tools Availability of performance information Performance measurement and analysis models ...

Trace of LINPACK with GPUs (TAUcuda, Jumpshot)

Trace of LINPACK with GPUs (TAUcuda, Vampir)

NAMD with CUDA NAMD is a molecular dynamics application (Charm++) NAMD has been accelerated with CUDA TAU integrated in Charm++ Apply TAUcuda to NAMD Four processes with one Tesla GPU for each

NAMD with CUDA (4 processes) GPU kernel

Scaling NAMD with CUDA good GPU performance

HMPP Workbench with TAUcuda Host process Compute kernel Transfer kernel