CS267 L10 Sources of Parallelism.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 10: Sources of Parallelism and Locality James Demmel.

Slides:

Advertisements

Similar presentations

Sharks and Fishes – The problem

Advertisements

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.

Game of Life Courtesy: Dr. David Walker, Cardiff University.

CS 584. Review n Systems of equations and finite element methods are related.

ECE669 L4: Parallel Applications February 10, 2004 ECE 669 Parallel Computer Architecture Lecture 4 Parallel Applications.

CSE351/ IT351 Modeling And Simulation Choosing a Mesh Model Dr. Jim Holten.

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

CS267 L10 Sources of Parallelism.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 10: Sources of Parallelism and Locality James Demmel.

CS 240A: Sources of Parallelism in Physical Simulation Based on slides from David Culler, Jim Demmel, Kathy Yelick, et al., UCB CS267.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Why parallel computing matters (and what makes it hard)  Challenges.

Multiscalar processors

CS267 L11 Sources of Parallelism(2).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 11: Sources of Parallelism and Locality (Part 2)

CS240A: Conjugate Gradients and the Model Problem.

1 Sources of Parallelism and Locality in Simulation.

CS267 L24 Solving PDEs.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 24: Solving Linear Systems arising from PDEs - I James Demmel.

02/14/2007CS267 Lecture 91 CS 267 Sources of Parallelism and Locality in Simulation Kathy Yelick

Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.

Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :

Parallel Programming: Case Studies Todd C. Mowry CS 495 September 12, 2002.

CS267 L10 Sources of Parallelism.1 Demmel Sp 1999 Sources of Parallelism and Locality Slide source

CS 267 Sources of Parallelism and Locality in Simulation Lecture 4

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Conjugate gradients, sparse matrix-vector multiplication, graphs, and meshes Thanks to Aydin Buluc, Umit Catalyurek, Alan Edelman, and Kathy Yelick for.

Parallel Adaptive Mesh Refinement Combined With Multigrid for a Poisson Equation CRTI RD Project Review Meeting Canadian Meteorological Centre August.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

CS 240A: Parallelism in Physical Simulation

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Wim Schoenmaker ©magwel2005 Electromagnetic Modeling of Back-End Structures on Semiconductors Wim Schoenmaker.

Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

High Performance Computing 1 Load-Balancing. High Performance Computing 1 Load-Balancing What is load-balancing? –Dividing up the total work between processes.

1 Introduction to Model Order Reduction Thanks to Jacob White, Kin Sou, Deepak Ramaswamy, Michal Rewienski, and Karen Veroy I.2.a – Assembling Models from.

Computation on meshes, sparse matrices, and graphs Some slides are from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

Parallel Simulation of Continuous Systems: A Brief Introduction

Lecture 7: Design of Parallel Programs Part II Lecturer: Simon Winberg.

CO1301: Games Concepts Dr Nick Mitchell (Room CM 226) Material originally prepared by Gareth Bellaby.

Elliptic PDEs and the Finite Difference Method

PARLab Parallel Boot Camp Sources of Parallelism and Locality in Simulation Jim Demmel EECS and Mathematics University of California, Berkeley.

1 CS240A: Parallelism in CSE Applications Tao Yang Slides revised from James Demmel and Kathy Yelick

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Determine the mathematical models that capture the behavior of an electrical system 1.Elements making up an electrical system 2.First-principles modeling.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Discretization Methods Chapter 2. Training Manual May 15, 2001 Inventory # Discretization Methods Topics Equations and The Goal Brief overview.

CS267 Lecture 41 CS 267 Sources of Parallelism and Locality in Simulation James Demmel and Kathy Yelick

ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.

Data Structures and Algorithms in Parallel Computing Lecture 10.

1 Data Structures for Scientific Computing Orion Sky Lawlor /04/14.

CS267 Lecture 41 CS 267 Sources of Parallelism and Locality in Simulation Lecture 4 James Demmel

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning

Auburn University

Courtesy: Dr. David Walker, Cardiff University

High Altitude Low Opening?

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Lecture 19 MA471 Fall 2003.

Finite Element Method To be added later 9/18/2018 ELEN 689.

Jim Demmel EECS and Mathematics University of California, Berkeley

Course Outline Introduction in algorithms and applications

Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University

CS6068 Applications: Numerical Methods

Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Oct 14, 2014 slides6b.ppt 1.

Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson StencilPattern.ppt Oct 14,

Parallel Programming in C with MPI and OpenMP

Presentation transcript:

CS267 L10 Sources of Parallelism.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 10: Sources of Parallelism and Locality James Demmel

CS267 L10 Sources of Parallelism.2 Demmel Sp 1999 Recap: Parallel Models and Machines °Machine modelsProgramming models shared memory - threads distributed memory - message passing SIMD - data parallel - shared address space °Steps in creating a parallel program decomposition assignment orchestration mapping °Performance in parallel programs try to minimize performance loss from -load imbalance -communication -synchronization -extra work

CS267 L10 Sources of Parallelism.3 Demmel Sp 1999 Outline °Simulation models °A model problem: sharks and fish °Discrete event systems °Particle systems °Lumped systems (Ordinary Differential Equations, ODEs) °(Next time: Partial Different Equations, PDEs)

CS267 L10 Sources of Parallelism.4 Demmel Sp 1999 Simulation Models and A Simple Example

CS267 L10 Sources of Parallelism.5 Demmel Sp 1999 Sources of Parallelism and Locality in Simulation °Real world problems have parallelism and locality Many objects do not depend on other objects Objects often depend more on nearby than distant objects Dependence on distant objects often “simplifies” °Scientific models may introduce more parallelism When continuous problem is discretized, may limit effects to timesteps Far-field effects may be ignored or approximated, if they have little effect °Many problems exhibit parallelism at multiple levels e.g., circuits can be simulated at many levels and within each there may be parallelism within and between subcircuits

CS267 L10 Sources of Parallelism.6 Demmel Sp 1999 Basic Kinds of Simulation °Discrete event systems e.g.,”Game of Life”, timing level simulation for circuits °Particle systems e.g., billiard balls, semiconductor device simulation, galaxies °Lumped variables depending on continuous parameters ODEs, e.g., circuit simulation (Spice), structural mechanics, chemical kinetics °Continuous variables depending on continuous parameters PDEs, e.g., heat, elasticity, electrostatics °A given phenomenon can be modeled at multiple levels °Many simulations combine these modeling techniques

CS267 L10 Sources of Parallelism.7 Demmel Sp 1999 A Model Problem: Sharks and Fish °Illustration of parallel programming Original version (discrete event only) proposed by Geoffrey Fox Called WATOR °Basic idea: sharks and fish living in an ocean rules for movement (discrete and continuous) breeding, eating, and death forces in the ocean forces between sea creatures °6 problems (S&F1 - S&F6) Different sets of rule, to illustrate different phenomena °Available in Matlab, Threads, MPI, Split-C, Titanium, CMF, CMMD, pSather (not all problems in all languages) ° (being updated)

CS267 L10 Sources of Parallelism.8 Demmel Sp 1999 Discrete Event Systems

CS267 L10 Sources of Parallelism.9 Demmel Sp 1999 Discrete Event Systems °Systems are represented as finite set of variables each variable can take on a finite number of values the set of all variable values at a given time is called the state each variable is updated by computing a transition function depending on the other variables °System may be synchronous: at each discrete timestep evaluate all transition functions; also called a finite state machine asynchronous: transition functions are evaluated only if the inputs change, based on an “event” from another part of the system; also called event driven simulation °E.g., functional level circuit simulation state is represented by a set of boolean variables (high & low voltages) set of logical rules defining state transitions (and, or, etc.) synchronous: only interested in state at clock ticks

CS267 L10 Sources of Parallelism.10 Demmel Sp 1999 Sharks and Fish as Discrete Event System °Ocean modeled as a 2D toroidal grid °Each cell occupied by at most one sea creature °S&F3, 4 and 5 are variations on this

CS267 L10 Sources of Parallelism.11 Demmel Sp 1999 The Game of Life (Sharks and Fish 3) °Fish only, no sharks °An new fish is born if a cell is empty exactly 3 (of 8) neighbors contain fish °A fish dies (of overcrowding) if cell contains a fish 4 or more neighboring cells are full °A fish dies (of loneliness) if cell contains a fish less than 2 neighboring cells are full °Other configurations are stable

CS267 L10 Sources of Parallelism.12 Demmel Sp 1999 Parallelism in Sharks and Fish °The simulation is synchronous use two copies of the grid (old and new) the value of each new grid cell depends only on 9 cells (itself plus 8 neighbors) in old grid simulation proceeds in timesteps, where each cell is updated at every timestep °Easy to parallelize using domain decomposition °Locality is achieved by using large patches of the ocean boundary values from neighboring patches are needed °If only cells next to occupied ones are visited (an optimization), then load balance is more difficult. The activities in this system are discrete events P4 P1P2P3 P5P6 P7P8P9

CS267 L10 Sources of Parallelism.13 Demmel Sp 1999 Parallelism in Synchronous Circuit Simulation °Circuit is a graph made up of subcircuits connected by wires component simulations need to interact if they share a wire data structure is irregular (graph), but simulation may be synchronous parallel algorithm is bulk-synchronous -compute subcircuit outputs -propagate outputs to other circuits °Graph partitioning assigns subgraphs to processors Determines parallelism and locality Want even distribution of nodes (load balance) With minimum edge crossing (minimize communication) Nodes and edges may both be weighted by cost NP-complete to partition optimally, but many good heuristics (later lectures) edge crossings = 6 edge crossings = 10

CS267 L10 Sources of Parallelism.14 Demmel Sp 1999 Parallelism in Asynchronous Circuit Simulation °Synchronous simulations may waste time simulate even when the inputs do not change, little internal activity activity varies across circuit °Asynchronous simulations update only when an event arrives from another component no global timesteps, but events contain time stamp Ex: Circuit simulation with delays (events are gates changing) Ex: Traffic simulation (events are cars changing lanes, etc.)

CS267 L10 Sources of Parallelism.15 Demmel Sp 1999 Scheduling Asynchronous Circuit Simulation °Conservative: Only simulate up to (and including) the minimum time stamp of inputs May need deadlock detection if there are cycles in graph, or else “null messages” Ex: Pthor circuit simulator in Splash1 from Stanford °Speculative: Assume no new inputs will arrive and keep simulating, instead of waiting May need to backup if assumption wrong Ex: Parswec circuit simulator of Yelick/Wen Ex: Standard technique for CPUs to execute instructions °Optiziming load balance and locality is difficult Locality means putting tightly coupled subcircuit on one processor Since “active” part of circuit likely to be in a tightly coupled subcircuit, this may be bad for load balance

CS267 L10 Sources of Parallelism.16 Demmel Sp 1999 Particle Systems

CS267 L10 Sources of Parallelism.17 Demmel Sp 1999 Particle Systems °A particle system has a finite number of particles moving in space according to Newton’s Laws (i.e. F = ma ) time is continuous °Examples stars in space with laws of gravity electron beam semiconductor manufacturing atoms in a molecule with electrostatic forces neutrons in a fission reactor cars on a freeway with Newton’s laws plus model of driver and engine °Reminder: many simulations combine techniques such as particle simulations with some discrete events

CS267 L10 Sources of Parallelism.18 Demmel Sp 1999 Forces in Particle Systems °Force on each particle can be subdivided °External force ocean current to sharks and fish world (S&F 1) externally imposed electric field in electron beam °Nearby force sharks attracted to eat nearby fish (S&F 5) balls on a billiard table bounce off of each other Van der Wals forces in fluid °Far-field force fish attract other fish by gravity-like force (S&F 2) gravity, electrostatics, radiosity forces governed by elliptic PDE force = external_force + nearby_force + far_field_force

CS267 L10 Sources of Parallelism.19 Demmel Sp 1999 Parallelism in External Forces °These are the simplest °The force on each particle is independent °Known as “embarrassingly parallel” °Evenly distribute particles on processors Any distribution works Locality is not an issue, no communication °For each particle on processor, apply the external force

CS267 L10 Sources of Parallelism.20 Demmel Sp 1999 Parallelism in Nearby Forces °Nearby forces require interaction => communication °Force may depend on other particles example is collisions simplest algorithm is O(n^2), for all pairs see if they collide °Usual parallel model is to decompose space O(n^2/p) particles per processor if evenly distributed °Challenge 1: interactions of particles near processor boundary need to communicate particles near boundary to neighboring processors surface to volume effect means low communication Which communicates less: squares (as below) or slabs? °Challenge 2: load imbalance, if particles cluster galaxies, electrons hitting a device wall Need to check for collisions between regions

CS267 L10 Sources of Parallelism.21 Demmel Sp 1999 Parallelism in Nearby Forces: Tree Decomposition °To reduce load imbalance, divide space unevenly °Each region contains roughly equal number of particles °Quad tree in 2D, Oct-tree in 3D Example: each square contains at most 3 particles

CS267 L10 Sources of Parallelism.22 Demmel Sp 1999 Parallelism in Far-Field Forces °Far-field forces involve all-to-all interaction => communication °Force depends on all other particles Example is gravity Simplest algorithm is O(n^2) Cannot just decompose space, since far-away particles still have effect °Use more clever algorithms

CS267 L10 Sources of Parallelism.23 Demmel Sp 1999 Far-field forces: Particle-Mesh Methods °One technique for computing far-field effects Superimpose a regular mesh “Move” particles to nearest grid point Use divide-and-conquer algorithm on regular mesh, e.g., -FFT -Multigrid Accuracy depends on how fine the grid is and uniformity of particles

CS267 L10 Sources of Parallelism.24 Demmel Sp 1999 Far-field Forces: Tree Decomposition °Based on approximation a group of far-away particles act like a larger single particle use tree; each node contains approximation of descendents to compute force on some particle -walk over parts of tree -siblings for nearby, (great) aunts/uncles for far away two standard algorithms -Barnes-Hut -Fast Multipole

CS267 L10 Sources of Parallelism.25 Demmel Sp 1999 Lumped Systems ODEs

CS267 L10 Sources of Parallelism.26 Demmel Sp 1999 System of Lumped Variables °Many systems approximated by system of lumped variables each depends on continue parameters °Example: circuit approximate as graph wires are edges nodes are connections between 2 or more wires each edge has resistor, capacitor, inductor or voltage source system is “lumped” because we are not computing the voltage/current at every point in space along a wire, just endpoint (discrete) °Form a system of Ordinary Differential Equations, ODEs

CS267 L10 Sources of Parallelism.27 Demmel Sp 1999 Circuit Example °State of the system is represented by v_n(t) node voltages i_b(t) branch currentsall at time t v_b(t) branch voltages °Equations include Kirchoff’s current Kirchoff’s voltage Ohm’s law Capacitance Inductance °Write as 1 large system of equations 0A0v_n0 A’0-I *i_b =S 0R-Iv_b0 0-I C*d/dt0 0L*d/dtI0

CS267 L10 Sources of Parallelism.28 Demmel Sp 1999 Systems of Lumped Variables °Another example is structural analysis in Civil Eng. Variables are displacement of points in a building Newton and Hook’s (spring) laws apply Static modeling: extert force and determine displacement Dynamic modeling: apply continuous force (earthquake) °The system in these case (and many) will be sparse i.e., most array elements are 0 neither store nor compute on these 0’s

CS267 L10 Sources of Parallelism.29 Demmel Sp 1999 Solving ODEs °Standard techniques are Euler’s method -simple algorithm: sparse matrix vector multiply -need to take very small timesteps Backward Euler’s Method -larger timesteps -more difficult algorithm: solve sparse system °Both cases reduce to sparse matrix problems direct solvers, LU factorization iterative solvers, use sparse matrix-vector multiply

CS267 L10 Sources of Parallelism.30 Demmel Sp 1999 Parallelism in Sparse Matrices °Consider simpler problem of matrix-vector multiply y = A*x, where A is sparse and nxn °Questions which processor store -y[i], x[i], and A[i,j] which processors compute -x[i] = sum from 1 to n or A[i,j] * x[j] °Constraints balance load balance storage minimize communication °Graph partitioning

CS267 L10 Sources of Parallelism.31 Demmel Sp 1999 Partitioning °Relationship between matrix and graph °A “good” partition of the graph has equal number of nodes in each part minimum number of edges crossing between °Can reorder the rows/columns of the matrix by putting all the nodes in one partition together

CS267 L10 Sources of Parallelism.32 Demmel Sp 1999 Matrix Reordering °Rows and columns can be reordered to improve locality parallelism °“Ideal” matrix structure for parallelism: block diagonal p (number of processors) blocks few non-zeros outside these blocks, since these require communication = * P0 P1 P2 P3 P4