Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL www.nersc.gov/~oliker.

Slides:



Advertisements
Similar presentations
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Advertisements

Problem Uncertainty quantification (UQ) is an important scientific driver for pushing to the exascale, potentially enabling rigorous and accurate predictive.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
June 11, 2002 SS-SQ-W: 1 Stanford Streaming Supercomputer (SSS) Spring Quarter Wrapup Meeting Bill Dally, Computer Systems Laboratory Stanford University.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Michael L. Norman, UC San Diego and SDSC
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Towards Petascale Computing for Science Horst Simon Lenny Oliker, David Skinner, and Erich Strohmaier Lawrence Berkeley National Laboratory The Salishan.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Computer System Architectures Computer System Software
Mixed MPI/OpenMP programming on HPCx Mark Bull, EPCC with thanks to Jake Duthie and Lorna Smith.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP Patrick H. Worley Computer Science.
Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker Julian Borrill, Jonathan Carter.
1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher.
Presented by High Productivity Language and Systems: Next Generation Petascale Programming Wael R. Elwasif, David E. Bernholdt, and Robert J. Harrison.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California.
Evaluation of Modern Parallel Vector Architectures Lenny Oliker.
BG/Q vs BG/P—Applications Perspective from Early Science Program Timothy J. Williams Argonne Leadership Computing Facility 2013 MiraCon Workshop Monday.
Parallel I/O Performance: From Events to Ensembles Andrew Uselton National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Parallel Computing Department Of Computer Engineering Ferdowsi University Hossain Deldari.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Evaluation of Ultra-Scale Applications on Leading Scalar and Vector Platforms Leonid Oliker Computational.
1 CCSM Component Performance Benchmarking and Status of the CRAY X1 at ORNL Patrick H. Worley Oak Ridge National Laboratory Computing in Atmospheric Sciences.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI --- Yoshikatsu Yoshida and.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Andrew Canning, Jonathan Carter, John Shalf Lawrence.
Design Strategies for Irregularly Adapting Parallel Applications Leonid Oliker Future Technologies Group Computational Research Division LBNL
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker, Jonathan Carter, Andrew Canning, John Shalf Lawrence Berkeley National Laboratories.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
J.-N. Leboeuf V.K. Decyk R.E. Waltz J. Candy W. Dorland Z. Lin S. Parker Y. Chen W.M. Nevins B.I. Cohen A.M. Dimits D. Shumaker W.W. Lee S. Ethier J. Lewandowski.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
Outline Why this subject? What is High Performance Computing?
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
Evaluation of Cache-based Superscalar and Cacheless Vector Architectures for Scientific Computations Leonid Oliker Future Technologies Group Lawrence Berkeley.
Performance Comparison of Winterhawk I and Winterhawk II Systems Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Programming Models for SimMillennium
Design Strategies for Irregularly Adapting Parallel Applications
Hybrid Programming with OpenMP and MPI
MFE Simulation Data Management
Immersed Boundary Method Simulation in Titanium Objectives
Parallel Programming in C with MPI and OpenMP
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Presentation transcript:

Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL

Previous Research  Examined complex interactions between high-level algorithms, leading programming paradigms, and modern architectural platforms  Evaluated three parallelization strategies of a dynamic unstructured mesh adaptation algorithm  Examined two major classes of adaptive applications under three parallel programming model (UMA and N-Body)  Investigated effects of algorithmic orderings on sparse matrix computations  Evaluated performance of shared-virtual memory systems on PC-SMP clusters using six application kernels (structured and unstructured)  Architectures Examined: T3E, Origin2000, SP, PC Cluster, MTA  Examined scientific kernels on emerging microarchitectures: VIRAM (Berkeley PIM) and Imagine (Stanford Stream arch)  Programming Paradigms: MPI, OpenMP, hybrid, SHMEM, shared- memory, multithreading, vectorization, streaming

New Evaluation Project: Modern Parallel Vector Systems  Vector Architectures: SX6, X1, and ES  Plan to study key factors of modern parallel vector systems: runtime, scalability, programmability, portability, and memory overhead while identifying potential bottlenecks  Examine microbenchmarks, kernels, and application codes  What fraction of scientific codes suitable for these arch? What best programming paradigm? What required algorithmic modifications? What are scalability limiting factors? What migration issues in terms of performance portability?

Microbenchmark and Kernel Codes  Examine memory bandwidth within a node for simple and complex array addressing.  Examine low level message-passing characteristics:point-to- point, intra-node, extra-node, aggregate operations, and one- sided performance, as well as I/O  Task and thread performance: thread creation, task management locks, semaphores, and barriers. Explicit threads vs. implicit OpenMP  Evaluate NAS Parallel Benchmarks using MPI, OpenMP, and Hybrid programming. New class D and E size problems being developed by Rob Wijngaar at NASA Ames

Application Codes  Astrophysics: MADCAP Microwave Anisotropy Dataset Computational Analysis Package. Analyses cosmic microwave background radiation datasets to extract the maximum likelihood angular power spectrum. Julian Borrill LBNL CACTUS Direct evolution of Einstein's equations. Involves a coupled set of non-linear hyperbolic, elliptic equations with thousands of terms. John Shalf LBNL  Climate: CCM3 Community Climate Model Michael Wehner LBNL  Fluid Dynamics OverflowD Overset Navier-Stokes grid solver. Simulates complex rotorcraft vortex dynamics problems. Mohammad Djomehri NASA

Application Codes (cont)  Fusion GTC Gyrokinetic Toroidal Code. 3D particle-in-cell code to study microturbulence in magnetic confinement fusion. Stephane Ethier Princeton Plasma Physics Laboratory TLBE Thermal Lattice Boltzmann equation solver for modeling turbulence and collisions in plasma. Jonathan Carter LBNL  Material Science PARATEC PARAllel Total Energy Code. Electronic structure code which performs ab-initio quantum-mechanical total energy calculations. Andrew Canning LBNL  Molecular Dynamics NAMD Object-oriented molecular dynamics code designed for simulation of large biomolecular systems. David Skinner LBNL

Benchmarking Timeline and Evaluation Goals  Currently porting codes to single node SX6 (USA)  Will soon have multi-node SX6 access from DKRZ (Germany)  Early System Access to the Cray X1 expected in early February (ORNL)  Hope to gain Earth Simulator access summer 2003  Opportunity will allow us to compare performance and programmability with leading conventional architectures (Power4, Alpha EV67)  Allow comparison with significantly different X1 system: X1 vector pipes are “distributed” within the X1 multistreaming processor Cache based architecture and support for globally addressable memory Compiler must identify both streaming (microtasking) and vectorization, while maximizing cache reuse Is the same programming style effective on both X1 and ES  Help guide future system acquisition and scientific code development  Potential to run applications at unprecedented scale