Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.

Slides:



Advertisements
Similar presentations
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Advertisements

Distributed Systems CS
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Sparse Triangular Solve in UPC By Christian Bell and Rajesh Nishtala.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Reference: Message Passing Fundamentals.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Impact of the Cardiac Heart Flow Alpha Project Kathy Yelick EECS Department U.C. Berkeley.
Programming Systems for a Digital Human Kathy Yelick EECS Department U.C. Berkeley.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
UPC and Titanium Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence.
Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and NERSC/LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke,
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Scientific Computing Topics for Final Projects Dr. Guy Tel-Zur Version 2,
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Parallel Simulation of Continuous Systems: A Brief Introduction
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea,
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Extreme Computing’05 Parallel Graph Algorithms: Architectural Demands of Pathological Applications Bruce Hendrickson Jonathan Berry Keith Underwood Sandia.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick.
Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.
Towards a Digital Human Kathy Yelick EECS Department U.C. Berkeley.
1 Titanium Review: Immersed Boundary Armando Solar-Lezama Biological Simulations Using the Immersed Boundary Method in Titanium Ed Givelberg, Armando Solar-Lezama,
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Xing Cai University of Oslo
Programming Models for SimMillennium
Amir Kamil and Katherine Yelick
UPC and Titanium Kathy Yelick University of California, Berkeley and
Hybrid Programming with OpenMP and MPI
Immersed Boundary Method Simulation in Titanium Objectives
Amir Kamil and Katherine Yelick
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Presentation transcript:

Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley

Algorithm Space Regularity Reuse Two-sided dense linear algebra One-sided dense linear algebra FFTs Sparse iterative solvers Sparse direct solvers Asynchronous discrete even simulation Grobner Basis (“Symbolic LU”) Search Sorting

Scaling Applications Machine Parameters -Floating point performance -Application dependent, not theoretical peak -Amount of memory per processor -Use 1/10 th for algorithm data -Communication Overhead -Time processor is busy sending a message -Cannot be overlapped -Communication Latency -Time across the network (can be overlapped) -Communication Bandwidth -Single node and bisection Back-of-the envelope calculations !

Running Sparse MVM on a Pflop -1 GHz * 8 pipes * 8 ALUs/Pipe = 64 GFLOPS/node peak -8 Address generators limit performance to 16 Gflops -500ns latency, 1 cycle put/get overhead, 100 cycle MP overhead -Programmability differences too: packing vs. global address space

Effect of Memory Size Low overhead is important for -Small memory nodes or smaller problem sizes -Programmability

Parallel Applications in Titanium Genome Application Heart simulation AMR elliptic and hyperbolic solvers Scalable Poisson for infinite domains Genome application Several smaller benchmarks: EM3D, MatMul, LU, FFT, Join

MOOSE Application Problem: Microarray construction -Used for genome experiments -Possible medical applications long-term Microarray Optimal Oligo Selection Engine (MOOSE) -A parallel engine for selecting the best oligonucleotide sequences for genetic microarray testing -Uses dynamic load balancing within Titanium

Heart Simulation Problem: compute blood flow in the heart -Model as elastic structure in incompressible fluid. -“Immersed Boundary Method” [Peskin and McQueen] -Particle/Mesh method stress communication performance -20 years of development in model -Many other applications: blood clotting, inner ear, insect flight, embryo growth,… Can be used for design of prosthetics -Artificial heart valves -Cochlear implants

Scalable Poisson Solver MLC for Finite-Differences by Balls and Colella Poisson equation with infinite boundaries -arise in astrophysics, some biological systems, etc. Method is scalable -Low communication Performance on -SP2 (shown) and t3e -scaled speedups -nearly ideal (flat) Currently 2D and non-adaptive Point charge example shown -Rings & star charges -Relative error shown -6.47x x10 -9

AMR Gas Dynamics Developed by McCorquodale and Colella 2D Example (3D supported) -Mach-10 shock on solid surface at oblique angle Future: Self-gravitating gas dynamics package

UPC Application Investigations Pyramid -3D Mesh generation [Shewchuk] -2D version (triangle) critical in Quake project -Written in C, challenge to parallelize SuperLU -Sparse direct solver [Li,Demmel] -Written in C+MPI or threads -UPC may enable new algorithmic techniques N-Body simulation -“Simulating the Universe”

Summary UPC Killer App should -Leverage programmability: hard in MPI -Use fine-grained, irregular, asynchronous communication Libraries -Must allow for interface to libraries -MPI libraries, multithreaded libraries, serial libraries Compilation needs -High performance on at least one machine -Portability across many machines