Towards Petascale Computing for Science Horst Simon Lenny Oliker, David Skinner, and Erich Strohmaier Lawrence Berkeley National Laboratory The Salishan.

Slides:



Advertisements
Similar presentations
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.
Advertisements

Andrew Canning and Lin-Wang Wang Computational Research Division LBNL
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Molecular Dynamics at Constant Temperature and Pressure Section 6.7 in M.M.
U.S. Department of Energy’s Office of Science Basic Energy Sciences Advisory Committee Dr. Daniel A. Hitchcock October 21, 2003
PARATEC and the Generation of the Empty States (Starting point for GW/BSE) Andrew Canning Computational Research Division, LBNL and Chemical Engineering.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
April “ Despite the increasing importance of mathematics to the progress of our economy and society, enrollment in mathematics programs has been.
CS 240A Applied Parallel Computing John R. Gilbert Thanks to Kathy Yelick and Jim Demmel at UCB for.
MA5233: Computational Mathematics
1 NERSC User Group Business Meeting, June 3, 2002 High Performance Computing Research Juan Meza Department Head High Performance Computing Research.
CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
1 Aug 7, 2004 GPU Req GPU Requirements for Large Scale Scientific Applications “Begin with the end in mind…” Dr. Mark Seager Asst DH for Advanced Technology.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
Large-Scale Density Functional Calculations James E. Raynolds, College of Nanoscale Science and Engineering Lenore R. Mullin, College of Computing and.
Application Performance Analysis on Blue Gene/L Jim Pool, P.I. Maciej Brodowicz, Sharon Brunett, Tom Gottschalk, Dan Meiron, Paul Springer, Thomas Sterling,
Numerical methods for PDEs PDEs are mathematical models for –Physical Phenomena Heat transfer Wave motion.
Iterative and direct linear solvers in fully implicit magnetic reconnection simulations with inexact Newton methods Xuefei (Rebecca) Yuan 1, Xiaoye S.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
Massively Parallel Magnetohydrodynamics on the Cray XT3 Joshua Breslau and Jin Chen Princeton Plasma Physics Laboratory Cray XT3 Technical Workshop Nashville,
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Domain Applications: Broader Community Perspective Mixture of apprehension and excitement about programming for emerging architectures.
Simulating Quarks and Gluons with Quantum Chromodynamics February 10, CS635 Parallel Computer Architecture. Mahantesh Halappanavar.
The Nuts and Bolts of First-Principles Simulation Durham, 6th-13th December : DFT Plane Wave Pseudopotential versus Other Approaches CASTEP Developers’
S.S. Yang and J.K. Lee FEMLAB and its applications POSTEC H Plasma Application Modeling Lab. Oct. 25, 2005.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Challenging problems in kinetic simulation of turbulence and transport in tokamaks Yang Chen Center for Integrated Plasma Studies University of Colorado.
Chicago, July 22-23, 2002DARPA Simbiosys Review 1 Monte Carlo Particle Simulation of Ionic Channels Trudy van der Straaten Umberto Ravaioli Beckman Institute.
R. Ryne, NUG mtg: Page 1 High Energy Physics Greenbook Presentation Robert D. Ryne Lawrence Berkeley National Laboratory NERSC User Group Meeting.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Evaluation of Modern Parallel Vector Architectures Lenny Oliker.
Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity Approach: –Discretize the.
BG/Q vs BG/P—Applications Perspective from Early Science Program Timothy J. Williams Argonne Leadership Computing Facility 2013 MiraCon Workshop Monday.
High Energy and Nuclear Physics Collaborations and Links Stu Loken Berkeley Lab HENP Field Representative.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Evaluation of Ultra-Scale Applications on Leading Scalar and Vector Platforms Leonid Oliker Computational.
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL
Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.
 Advanced Accelerator Simulation Panagiotis Spentzouris Fermilab Computing Division (member of the SciDAC AST project)
Brent Gorda LBNL – SOS7 3/5/03 1 Planned Machines: BluePlanet SOS7 March 5, 2003 Brent Gorda Future Technologies Group Lawrence Berkeley.
Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.
Scientific Computations on Modern Parallel Vector Systems Leonid Oliker, Jonathan Carter, Andrew Canning, John Shalf Lawrence Berkeley National Laboratories.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
1 DOE Office of Science October 2003 SciDAC Scientific Discovery through Advanced Computing Alan J. Laub.
Cracow Grid Workshop, November 5-6, 2001 Concepts for implementing adaptive finite element codes for grid computing Krzysztof Banaś, Joanna Płażek Cracow.
Computational Science & Engineering meeting national needs Steven F. Ashby SIAG-CSE Chair March 24, 2003.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Memory-Efficient Optimization of Gyrokinetic Particle-to-Grid Interpolation for Multicore.
Preliminary CPMD Benchmarks On Ranger, Pople, and Abe TG AUS Materials Science Project Matt McKenzie LONI.
J.-N. Leboeuf V.K. Decyk R.E. Waltz J. Candy W. Dorland Z. Lin S. Parker Y. Chen W.M. Nevins B.I. Cohen A.M. Dimits D. Shumaker W.W. Lee S. Ethier J. Lewandowski.
1 OFFICE OF ADVANCED SCIENTIFIC COMPUTING RESEARCH The NERSC Center --From A DOE Program Manager’s Perspective-- A Presentation to the NERSC Users Group.
Presented by Adaptive Hybrid Mesh Refinement for Multiphysics Applications Ahmed Khamayseh and Valmor de Almeida Computer Science and Mathematics Division.
The Performance Evaluation Research Center (PERC) Participating Institutions: Argonne Natl. Lab.Univ. of California, San Diego Lawrence Berkeley Natl.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
On the Path to Trinity - Experiences Bringing Codes to the Next Generation ASC Platform Courtenay T. Vaughan and Simon D. Hammond Sandia National Laboratories.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
Unstructured Meshing Tools for Fusion Plasma Simulations
BlueGene/L Supercomputer
Introduction and History of Cray Supercomputers
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Towards Petascale Computing for Science Horst Simon Lenny Oliker, David Skinner, and Erich Strohmaier Lawrence Berkeley National Laboratory The Salishan Conference on High-Speed Computing April 19, 2005

Outline Science Driven Architecture Performance on today’s ( ) platforms Challenges with scaling to the Petaflop/s level Two tools that can help: IPM and APEX/MAP

Scientific Applications and Underlying Algorithms Drive Architectural Design 50 Tflop/s Tflop/s sustained performance on applications of national importance Process: –identify applications –identify computational methods used in these applications –identify architectural features most important for performance of these computational methods Reference: Creating Science-Driven Computer Architecture: A New Path to Scientific Leadership, (Horst D. Simon, C. William McCurdy, William T.C. Kramer, Rick Stevens, Mike McCoy, Mark Seager, Thomas Zacharia, Jeff Nichols, Ray Bair, Scott Studham, William Camp, Robert Leland, John Morrison, Bill Feiereisen), Report LBNL-52713, May (see

Capability Computing Applications in DOE/SC Accelerator modeling Astrophysics Biology Chemistry Climate and Earth Science Combustion Materials and Nanoscience Plasma Science/Fusion QCD Subsurface Transport

Capability Computing Applications in DOE/SC (cont.) These applications and their computing needs have been well-studied in the past years: “A Science-Based Case for Large-scale Simulation”, David Keyes, Sept ( “Validating DOE’s Office of Science “Capability” Computing Needs”, E. Barsis, P. Mattern, W. Camp, R. Leland, SAND , July 2004.

Science Breakthroughs Enabled by Leadership Computing Capability Science AreasGoalsComputational Methods Breakthrough Target ( Tflop/s) Nanoscience Simulate the synthesis and predict the properties of multi-component nanosystems Quantum molecular dynamics Quantum Monte Carlo Iterative eigensolvers Dense linear algebra Parallel 3D FFTs Simulate nanostructures with hundreds to thousands of atoms as well as transport and optical properties and other parameters Combustion Predict combustion processes to provide efficient, clean and sustainable energy Explicit finite difference Implicit finite difference Zero-dimensional physics Adaptive mesh refinement Lagrangian particle methods Simulate laboratory scale flames with high fidelity representations of governing physical processes Fusion Understand high-energy density plasmas and develop an integrated simulation of a fusion reactor Multi-physics, multi-scale Particle methods Regular and irregular access Nonlinear solvers Adaptive mesh refinement Simulate the ITER reactor Climate Accurately detect and attribute climate change, predict future climate and engineer mitigation strategies Finite difference methods FFTs Regular and irregular access Simulation ensembles Perform a full ocean/ atmosphere climate model with degree spacing, with an ensemble of 8-10 runs Astrophysics Determine through simulations and analysis of observational data the origin, evolution and fate of the universe, the nature of matter and energy, galaxy and stellar evolutions Multi-physics, multi-scale Dense linear algebra Parallel 3D FFTs Spherical transforms Particle methods Adaptive mesh refinement Simulate the explosion of a supernova with a full 3D model

Opinion Slide One reason why we have failed so far to make a good case for increased funding in supercomputing is that we have not yet made a compelling science case. A better example: “The Quantum Universe” “It describes a revolution in particle physics and a quantum leap in our understanding of the mystery and beauty of the universe.”

How Science Drives Architecture State-of-the-art computational science requires increasingly diverse and complex algorithms Only balanced systems that can perform well on a variety of problems will meet future scientists’ needs! Data-parallel and scalar performance are both important Science AreasMulti- Physics and Multi- Scale Dense Linear Algebra FFTsParticle Methods AMRData Parallelism Irregular Control Flow NanoscienceXXXXXX CombustionXXXXX FusionXXXXXX ClimateXXXXX AstrophysicsXXXXXXX

Phil Colella’s “Seven Dwarfs” Algorithms that consume the bulk of the cycles of current high-end systems in DOE: Structured Grids Unstructured Grids Fast Fourier Transform Dense Linear Algebra Sparse Linear Algebra Particles Monte Carlo (Should also include optimization / solution of nonlinear systems, which at the high end is something one uses mainly in conjunction with the other seven)

“Evaluation of Leading Superscalar and Vector Architectures for Scientific Computations” Leonid Oliker, Andrew Canning, Jonathan Carter LBNL Stephane Ethier PPPL (see SC04 paper at )

Material Science: PARATEC PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set Density Functional Theory to calc structure & electronic properties of new materials DFT calc are one of the largest consumers of supercomputer cycles in the world PARATEC uses all-band CG approach to obtain wavefunction of electrons Part of calc. in real space other in Fourier space using specialized 3D FFT to transform wavefunction Generally obtains high percentage of peak on different platforms Developed with Louie and Cohen’s groups (UCB, LBNL), Raczkowski

PARATEC: Code Details Code written in F90 and MPI (~50,000 lines) 33% 3D FFT, 33% BLAS3, 33% Hand coded F90 Global Communications in 3D FFT (Transpose) 3D FFT handwritten, minimize comms. reduce latency (written on top of vendor supplied 1D complex FFT ) Code has setup phase then performs many (~50) CG steps to converge the charge density of the system (data on speed is for 5CG steps, does not include setup)

–3D FFT done via 3 sets of 1D FFTs and 2 transposes –Most communication in global transpose (b) to (c) little communication (d) to (e) –Many FFTs done at the same time to avoid latency issues –Only non-zero elements communicated/calculated –Much faster than vendor supplied 3D-FFT PARATEC: 3D FFT (a)(b) (e) (c) (f) (d)

PARATEC: Performance Data Size P Power 3Power4AltixESX1 Gflops/ P % peak Gflops/ P % peak Gflops/ P % peak Gflops/ P %pea k Gflops/ P % peak 432 Atom %2.039%3.762%4.760%3.024% %1.733%3.254%4.759%2.620% %1.529% %1.915% %1.121% % % % Atom %3.024% %1.310%

Magnetic Fusion: GTC Gyrokinetic Toroidal Code: transport of thermal energy (plasma microturbulence) Goal magnetic fusion is burning plasma power plant producing cleaner energy GTC solves gyroaveraged gyrokinetic system w/ particle-in-cell approach (PIC) PIC scales N instead of N 2 – particles interact w/ electromag field on grid Allows solving equation of particle motion with ODEs (instead of nonlinear PDEs) Main computational tasks: –Scatter: deposit particle charge to nearest grid points –Solve the Poisson eqn to get potential at each grid point –Gather: Calc force on each particle based on neighbors potential –Move particles by solving eqn of motion along the characteristics –Find particles moved outside local domain and update Developed at Princeton Plasma Physics Laboratory, vectorized by Stephane Ethier

GTC: Performance Numbe r Particl es P Power 3Power4AltixESX1 Gflops /P %pea k Gflops/ P %pea k Gflops/ P %pea k Gflops /P %pea k Gflops/ P %pea k 10/cell 20M %0.295%0.295%1.1514%1.008% %0.325%0.264%1.0013%0.806% 100/cell 200M %0.295%0.336%1.6220%1.5012% %0.295%0.315%1.5620%1.3611% % GTC is now scaling to 2048 processors on the ES for a total of 3.7 TFlops/s

Issues in Applications Scaling Applications Status in 2005 A few Teraflop/s sustained performance Scaled to processors Applications on Petascale Systems need to deal with 100,00 processors (assume nominal Petaflop/s system with 100,000 processors of 10 Gflop/s each) Multi-core processors Topology sensitive interconnection network

Integrated Performance Monitoring (IPM) brings together multiple sources of performance metrics into a single profile that characterizes the overall performance and resource usage of the application maintains low overhead by using a unique hashing approach which allows a fixed memory footprint and minimal CPU usage open source, relies on portable software technologies and is scalable to thousands of tasks developed by David Skinner at NERSC (see )

Scaling Portability: Profoundly Interesting A high level description of the performance of a well known cosmology code on four well known architectures.

16 Way for 4 seconds (About 20 timestamps per second per task) *( 1…4 contextual variables)

64 way for 12 seconds

256 Way for 36 Seconds

Application Topology 1024 way MILC 1024 way MADCAP 336 way FVCAM If the interconnect is topology sensitive, mapping will become an issue (again)

Interconnect Topology

DARPA HPCS will characterize applications

APEX-Map: A Synthetic Benchmark to Explore the Space of Application Performances Erich Strohmaier, Hongzhang Shan Future Technology Group, LBNL Co-sponsored by DOE/SC and NSA

Apex-MAP characterizes architectures through a synthetic benchmark

Apex-Map Sequential

Parallel APEX-Map

Summary Three sets of tools (applications benchmarks, performance monitoring, quantitative architecture characterization) have been shown to provide critical insight into applications performance Need better quantitative data and measurements (like the ones discussed here) to help applications to scale to the next generation of platforms