Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
CS267, Yelick1 Cosmology Applications N-Body Simulations Credits: Lecture Slides of Dr. James Demmel, Dr. Kathy Yelick, University of California, Berkeley.
Summary Background –Why do we need parallel processing? Applications Introduction in algorithms and applications –Methodology to develop efficient parallel.
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.
Trends in Cluster Architecture Steve Lumetta David Culler University of California at Berkeley Computer Science Division.
Daniel Blackburn Load Balancing in Distributed N-Body Simulations.
1 Parallel multi-grid summation for the N-body problem Jesús A. Izaguirre with Thierry Matthey Department of Computer Science and Engineering University.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Programming Todd C. Mowry CS 740 October 16 & 18, 2000 Topics Motivating Examples Parallel Programming for High Performance Impact of the Programming.
Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :
Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Parallel Computer Architecture and Interconnect 1b.1.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
(Short) Introduction to Parallel Computing CS 6560: Operating Systems Design.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Dynamic Scenes Paul Arthur Navrátil ParallelismJustIsn’tEnough.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
High-Performance Computing 12.2: Parallel Algorithms.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Parallel Computing Presented by Justin Reschke
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Barnes Hut N-body Simulation Martin Burtscher Fall 2009.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
ChaNGa CHArm N-body GrAvity. Thomas Quinn Graeme Lufkin Joachim Stadel Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Setup distribution of N particles
ChaNGa: Design Issues in High Performance Cosmology
Course Outline Introduction in algorithms and applications
Programming Models for SimMillennium
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Department of Computer Science University of California,Santa Barbara
Summary Background Introduction in algorithms and applications
Course Outline Introduction in algorithms and applications
Distributed Systems CS
Cosmology Applications N-Body Simulations
High Performance Computing
Setup distribution of N particles
Vrije Universiteit Amsterdam
N-Body Gravitational Simulations
Presentation transcript:

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, many-cores Programming methods, languages, and environments Message passing (SR, MPI, Java) Higher-level language: HPF Applications N-body problems, search algorithms Many-core (GPU) programming (Rob van Nieuwpoort)

N-Body Methods Source: Load Balancing and Data Locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Radiosity by Singh, Holt, Totsuka, Gupta, and Hennessy (except Sections , 4.2, 9, and 10)

N-body problems Given are N bodies (molecules, stars,...) The bodies exert forces on each other (Coulomb, gravity,...) Problem: simulate behavior of the system over time Many applications: Astrophysics (stars in a galaxy) Plasma physics (ion/electrons) Molecular dynamics (atoms/molecules) Computer graphics (radiosity)

Example from current practice Leiden astrophysics group (Simon Portegies-Zwart) AMUSE: Astrophysical Multipurpose Software Environment Couple different independent simulations into one multi-model simulation –Gravity, Radiative transport, Stellar Evolution... –Fortran code written in 1960 –CUDA code written yesterday

Basic N-body algorithm for each timestep do Compute forces between all bodies Compute new positions and velocities od O(N 2 ) compute time per timestep Too expensive for realistics problems (e.g., galaxies) Barnes-Hut is O(N log N) algorithm for hierarchical N- body problems

Hierarchical N-body problems Exploit physics of many applications: Forces fall very rapidly with distance between bodies Long-range interactions can be approximated Key idea: group of distant bodies is approximated by a single body with same mass and center-of-mass

Data structure Octree (3D) or quadtree (2D): Hierarchical representation of physical space Building the tree: - Start with one cell with all bodies (bounding box) - Recursively split cells with multiple bodies into sub-cells Example (Fig. 5 from paper)

Barnes-Hut algorithm for each timestep do Build tree Compute center-of-mass for each cell Compute forces between all bodies Compute new positions and velocities od Building the tree: recursive algorithm (can be parallelized) Center-of-mass: upward pass through the tree Compute forces: 90% of the time Update positions and velocities: simple (given the forces)

for each body B do B.force := ComputeForce(tree.root, B) od function ComputeForce(cell, B): float; if distance(B, cell.CenterOfMass) > threshold then return DirectForce(B.position, B.Mass, cell.CenterOfMass, cell.Mass) else sum := 0.0 for each subcell C in cell do sum +:= ComputeForce(C, B) return sum Force computation of Barnes-Hut

Parallelizing Barnes-Hut Distribute bodies over all processors In each timestep, processors work on different bodies Communication/synchronization needed during Tree building Center-of-mass computation Force computation Key problem is efficient parallelization of force-computation Issues: Load balancing Data locality

Load balancing Goal: Each processor must get same amount of work Problem: Amount of work per body differs widely

Data locality Goal: - Each CPU must access a few bodies many times - Reduces communication overhead Problems - Access patterns to bodies not known in advance - Distribution of bodies in space changes (slowly)

Simple distribution strategies Distribute iterations Iteration = computations on 1 body in 1 timestep Strategy-1: Static distribution Each processor gets equal number of iterations Strategy-2: Dynamic distribution Distribute iterations dynamically Problems –Distributing iterations does not take locality into account –Static distribution leads to load imbalances

More advanced distribution strategies Load balancing: cost model Associate a computational cost with each body Cost = amount of work (# interactions) in previous timestep Each processor gets same total cost Works well, because system changes slowly Data locality: costzones Observation: octree more or less represents spatial (physical) distribution of bodies Thus: partition the tree, not the iterations Costzone: contiguous zone of costs

Example costzones Optimization: improve locality using clever child numbering scheme

Experimental system-DASH DASH multiprocessor Designed at Stanford university One of the first NUMAs (Non-Uniform Memory Access) DASH architecture Memory is physically distributed Programmer sees shared address space Hardware moves data between processors and caches it Uses directory-based cache coherence protocol

DASH prototype 48-node DASH system 12 clusters of 4 processors (MIPS R3000) each Shared bus within each cluster Mesh network between clusters Remote reads 4x more expensive than local reads Also built a simulator More flexible than real hardware Much slower

Performance results on DASH Costzones reduce load imbalance and communication overhead Moderate improvement in speedups on DASH - Low communication/computation ratio

Speedups measured on DASH (fig. 17)

Different versions of Barnes-Hut Static –Each CPU gets equal number of bodies, partitioned arbitrarily Load balance –Use replicated workers (job queue, as with TSP) to do dynamic load balancing, sending 1 body at a time Space locality –Each CPU gets part of the physical space Costzones –As discussed above (Ignore ORB)

Simulator statistics (figure 18)

Conclusions Parallelizing efficient O(N log N) algorithm is much harder that parallelizing O(N 2 ) algorithm Barnes-Hut has nonuniform, dynamically changing behavior Key issues to obtain good speedups for Barnes-Hut Load balancing -> cost model Data locality -> costzones Optimizations exploit physical properties of the application