Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM,

Slides:



Advertisements
Similar presentations
Lazy Asynchronous I/O For Event-Driven Servers Khaled Elmeleegy, Anupam Chanda and Alan L. Cox Department of Computer Science Rice University, Houston,
Advertisements

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Requirements Engineering Processes – 2
October 31, 2005Copyright © by Erik D. Demaine and Charles E. LeisersonL13.1 Introduction to Algorithms LECTURE 11 Amortized Analysis Dynamic tables.
4 Control Statements: Part 1.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Chapter 4 Computation Bjarne Stroustrup
Copyright © 2003 Pearson Education, Inc. Slide 1.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
UNITED NATIONS Shipment Details Report – January 2006.
Dynamic Programming Introduction Prof. Muhammad Saeed.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of Chapter 5: Repetition and Loop Statements Problem Solving & Program.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
1 An Introduction to MPI Parallel Programming with the Message Passing Interface William Gropp Ewing Lusk Argonne National Laboratory.
Database Systems: Design, Implementation, and Management
Anthony Delprete Jason Mckean Ryan Pineres Chris Olszewski.
Accelerated Linear Algebra Libraries James Wynne III NCCS User Assistance.
Chapter 17 Linked Lists.
Linked List A linked list consists of a number of links, each of which has a reference to the next link. Adding and removing elements in the middle of.
Linked Lists Chapter 4.
Chapter 24 Lists, Stacks, and Queues
11 Data Structures Foundations of Computer Science ã Cengage Learning.
Double-Linked Lists and Circular Lists
Chapter 1 Object Oriented Programming 1. OOP revolves around the concept of an objects. Objects are created using the class definition. Programming techniques.
Briana B. Morrison Adapted from William Collins
IBM Research: Software Technology © 2006 IBM Corporation 1 Programming Language X10 Christoph von Praun IBM Research HPC WPL Sandia National Labs December.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
Object-Oriented Programming. 2 An object, similar to a real-world object, is an entity with certain properties, and with the ability to react in certain.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Kurt Jensen Lars M. Kristensen 1 Coloured Petri Nets Department of Computer Science Coloured Petri Nets Modelling and Validation of Concurrent Systems.
A Third Look At ML 1. Outline More pattern matching Function values and anonymous functions Higher-order functions and currying Predefined higher-order.
Procedures. 2 Procedure Definition A procedure is a mechanism for abstracting a group of related operations into a single operation that can be used repeatedly.
CS 240 Computer Programming 1
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
Chapter 9 Interactive Multimedia Authoring with Flash Introduction to Programming 1.
Lilian Blot CORE ELEMENTS SELECTION & FUNCTIONS Lecture 3 Autumn 2014 TPOP 1.
Pointers and Arrays Chapter 12
1 Chapter 3:Operators and Expressions| SCP1103 Programming Technique C | Jumail, FSKSM, UTM, 2006 | Last Updated: July 2006 Slide 1 Operators and Expressions.
PSSA Preparation.
VSCSE Summer School Programming Heterogeneous Parallel Computing Systems Lecture 6: Basic CUDA and MPI.
11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.
Reliable and Efficient Programming Abstractions for Sensor Networks Nupur Kothari, Ramki Gummadi (USC), Todd Millstein (UCLA) and Ramesh Govindan (USC)
FMCO 2005 / UpSTAIRS with Sequence Diagrams Ragnhild Kobro Runde UpSTAIRS with Sequence Diagrams Øystein Haugen, Ragnhild Kobro Runde, Ketil Stølen University.
Evaluation of Abstraction Techniques. Uses for the complexity metrics in our framework Comparing the complexity of the reference model with the abstracted.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
ICE1341 Programming Languages Spring 2005 Lecture #16 Lecture #16 In-Young Ko iko.AT. icu.ac.kr iko.AT. icu.ac.kr Information and Communications University.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Introduction to Openmp & openACC
1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;
Practical techniques & Examples
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Asynchronous I/O with MPI Anthony Danalis. Basic Non-Blocking API  MPI_Isend()  MPI_Irecv()  MPI_Wait()  MPI_Waitall()  MPI_Waitany()  MPI_Test()
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
COP4020 Programming Languages Names, Scopes, and Bindings Prof. Xin Yuan.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Reflections on Dynamic Languages and Parallelism David Padua University of Illinois at Urbana-Champaign 1.
An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.
MATLAB HPCS Extensions
Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger, Gheorghe Almasi
Presentation transcript:

Programming with Tiles Jia Guo, Ganesh Bikshandi*, Basilio B. Fraguela +, Maria J. Garzaran, David Padua University of Illinois at Urbana-Champaign *IBM, India + Universidade da Coruna, Spain

2 Motivation The importance of tiles A natural way to express many algorithms Partitioning data is an effective way to enhance locality Pervasive in parallel computing Limitations of today s programming language Lack of programming construct to express tiles directly Complicated indexing and loop structure to traverse an array by tiles No support for storage by tile. Mixed the design and detailed implementation.

3 Contributions Our group developed Hierarchically Tiled Arrays (HTAs) to support tiles to enhance locality and parallelism ( PPoPP 06 ). Designed new language constructs based on our true experiences. Dynamic partitioning Overlapped tiling Evaluated both the productivity and performance

4 Outline Introduction of HTA Dynamic partitioning Overlapped tiling Impact on programming productivity Related work Conclusions

5 HTA overview A A(0,0) A(1, 0:1) A(0,1)[0,1] op A + B A(0, 0:1) = B (1, 0:1)

6 HTA library implementation In Matlab In C++ Support sequential and parallel execution on top of MPI and TBB. Support for linear, non-linear data layouts Execution model SPMD Programming model Single threaded

7 Outline Introduction of HTA Dynamic partitioning Overlapped tiling Impact on programming productivity Related work Conclusions

8 Why dynamic partitioning? Some linear algebra algorithms Add and remove partitions to sweep through matrices. Cache oblivious algorithms Create the partition following a divide and conquer strategy

9 Syntax and semantics Partition lines and partition Two methods void part( Tuple sourcePartition, Tuple offset ); void rmPart( Tuple partition ); AA.part((1,1),(2,2))A.rmPart((1,1)) NONE for not partitioning fixed

10 LU algorithm with dynamic partitioning A.part((1,1),(nb, nb)) A.rmPart((1,1)) done Partially updated A 11 A 21 A 22 A 12 done Partially updated Beginning of iteration RepartitionUpdate End of iteration In the loop nb

11 LU algorithm represented in FLAME [ Geijn, et al ] UPD In the loop The FLAME algorithm

12 void lu(HTA A, HTA p,int nb) { A.part((0,0),(0,0)); p.part((0), (0)); while(A(0,0).lsize(1)<A.lsize(1)){ int b = min(A(1,1).lsize(0), nb); A.part((1,1),(b,b)); p.part((1), (b)); dgetf2 (A(1:2,1), p(1)); dlaswp (A(1:2,0), p(1)); dlaswp (A(1:2,2), p(1)); trsm (HtaRight, HtaUpper, HtaNoTrans, HtaUnit, One, A(1,1),A(1,2)); gemm (HtaNoTrans, HtaNoTrans, MinusOne, A(2,1), A(1,2), One, A(2,2)); A.rmPart((1,1)); p.rmPart((1)); } } From algorithm to HTA program The FLAME algorithm The HTA implementation

13 FLAME API vs. HTA A.part((0,0), (0,0)); A.part((1,1), (b,b)); A.rmPart((1,1)); A(1:2, 1); HTA: 1). General. 2). Fewer variables. 3). Simple index. 4). Flexible range selection

14 A cache oblivious algorithm: parallel merge in1 in2 1. Take middle element in in1 2. Find lower bound in in out 3. Calculate partition in out 4. Partition (logically) Merge in parallel

15 HTA vs. Treading Building Blocks (TBB) HTA codeTBB code HTA: partition tiles and operate on tiles in parallel TBB: prepare the merge range and operate on smaller ranges in parallel map(PMerge(), out, in1, in2); parallel_for(PMergeRange(begin1, end1, begin2, end2, out), PMergeBody());

16 Evaluation BenchmarkCategory Execution mode Platform MMMCache oblivious algorithm Sequential 3.0 GHz Intel Pentium 4, 16KB L1, 1MB L2 and 1GB RAM. Recursive LUCache oblivious algorithm Dynamic LU FLAME s algorithm Sylvester FLAME s algorithm Parallel mergeCache oblivious algorithmParallel Two Quad core 2.66 GHz Xeon processors

17 Recursive MMM Recursive LU Dynamic LU Sylvester

18 Parallel merge

19 Outline Introduction of HTA Dynamic partitioning Overlapped tiling Impact on programming productivity Related work Conclusions

20 Programs that have computations based on neighboring points E.g. iterative PDE solvers Benefit from tiling Shadow regions Problems with current approach Explicit allocation and update for shadow regions Example: NAS MG (Lines of code) Motivation Versionscomm3ResidualInversionProjection MPI CAF OpenMP HTA

21 Overlapped tiling Objective Automate the allocation and update of shadow regions Allow the access of neighbors across neighboring tiles Allow programmer to specify overlapped region at creation time Overlap ol = Overlap (Tuple negativeDir, Tuple positiveDir, BoundaryMode mode, bool autoUpdate=true);

22 Example of overlapped tiling Shadow region for Tuple::seq tiling = ((4), (3)); Overlap ol((1), (1), zero); A=HTA ::alloc(1, tiling, array, ROW, ol); 00 overlap Boundary

23 Indexing Operations (+,-, map, =, etc) The conformability rules only apply to owned regions Enable operations with non-overlapped HTA. Indexing and operations T[0:3]=T[ALL] T[-1]T[4] owned region

24 Shadow region consistency Ensure shadow regions to be properly updated and consistent with the corresponding data in the owned tiles. Use update on read policy Bookkeeping the status of each tile. SPMD mode: no communication is needed. Allow manual and automatic updates.

25 Evaluation BenchmarkDescription Execution mode Sequential 3D Jacobi Stencil computation on 6 neighborssequential NAS MGMultigrid V-cycle algorithmparallel NAS LUNavier Strokes equation solverparallel A cluster consisting of 128 nodes each with two 2 GHz G5 processors and 4 GB of RAM. We used one processor per node in our experiments with Myrinet connection. NAS code (Fortran + MPI) was compiled with g77 and HTA code was compiled with g++ (3.3). O3 flag is used for both cases.

26 MG comm3 in HTA without overlapped tiling void comm3 (Grid& u) { int NX = u.shape().size()[0] - 1; int NY = u.shape().size()[1] - 1; int NZ = u.shape().size()[2] - 1; int nx = u(T(0,0,0)).shape().size()[0] - 1; int ny = u(T(0,0,0)).shape().size()[1] - 1; int nz = u(T(0,0,0)).shape().size()[2] - 1; //north-south Traits::Default::async(); if (NX > 0) u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(1, NX), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx, nx), R(1, ny-1), R(1, nz-1))) = u((R(0,0), R(0, NY), R(0, NZ)))((R(1,1), R(1, ny-1), R(1,nz-1))); if (NX > 0) u((R(1, NX), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(0, NX-1), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1))); u((R(0, 0), R(0, NY), R(0, NZ)))((R(0,0), R(1, ny-1), R(1, nz-1))) = u((R(NX, NX), R(0, NY), R(0, NZ)))((R(nx-1,nx-1), R(1, ny-1), R(1,nz-1))); Traits::Default::sync(); Traits::Default::async(); //east-west if (NY > 0) u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(1, 1), R(1, nz-1))); u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0, nx), R(ny, ny), R(1, nz-1))) = u((R(0, NX), R(0,0), R(0, NZ)))((R(0,nx), R(1, 1), R(1, nz-1))); if (NY > 0) u((R(0, NX), R(1, NY), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(0, NY-1), R(0, NZ)))((R(0, nx), R(ny-1, ny-1), R(1, nz-1))); u((R(0, NX), R(0, 0), R(0, NZ)))((R(0, nx), R(0, 0), R(1, nz-1))) = u((R(0, NX), R(NY, NY), R(0, NZ)))((R(0,nx), R(ny-1, ny-1), R(1, nz-1))); Traits::Default::sync(); Traits::Default::async(); //front-back if (NZ > 0) u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(1, 1))); u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz, nz))) = u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(1, 1))); if (NZ > 0) u((R(0, NX), R(0, NY), R(1, NZ)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(0, NZ-1)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); u((R(0, NX), R(0, NY), R(0, 0)))((R(0, nx), R(0, ny), R(0, 0))) = u((R(0, NX), R(0, NY), R(NZ, NZ)))((R(0, nx), R(0, ny), R(nz-1, nz-1))); Traits::Default::sync(); } Overlap * ol = new Overlap (T (1,1,1),T (1,1,1), PERIODIC); MG comm3 in HTA with overlapped tiling

27 MG comm3 in NAS (Fortran + MPI) subroutine comm3(u,n1,n2,n3,kk) implicit none include 'mpinpb.h' include 'globals.h' integer n1, n2, n3, kk double precision u(n1,n2,n3) integer axis if(.not. dead(kk) )then do axis = 1, 3 if( nprocs.ne. 1) then call ready( axis, -1, kk ) call ready( axis, +1, kk ) call give3( axis, +1, u, n1, n2, n3, kk ) call give3( axis, -1, u, n1, n2, n3, kk ) call take3( axis, -1, u, n1, n2, n3 ) call take3( axis, +1, u, n1, n2, n3 ) else call comm1p( axis, u, n1, n2, n3, kk ) endif enddo else call zero3(u,n1,n2,n3) endif return end subroutine give3( axis, dir, u, n1, n2, n3, k ) implicit none include 'mpinpb.h' include 'globals.h' integer axis, dir, n1, n2, n3, k, ierr double precision u( n1, n2, n3 ) integer i3, i2, i1, buff_len,buff_id buff_id = 2 + dir buff_len = 0 if( axis.eq. 1 )then if( dir.eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len,buff_id ) = u( 2, i2,i3) enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) else if( dir.eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( n1-1, i2,i3) enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) endif if( axis.eq. 2 )then if( dir.eq. -1 )then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1, 2,i3) enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) else if( dir.eq. +1 ) then do i3=2,n3-1 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id )= u( i1,n2-1,i3) enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) endif if( axis.eq. 3 )then if( dir.eq. -1 )then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,2) enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) else if( dir.eq. +1 ) then do i2=1,n2 do i1=1,n1 buff_len = buff_len + 1 buff(buff_len, buff_id ) = u( i1,i2,n3-1) enddo call mpi_send( > buff(1, buff_id ), buff_len,dp_type, > nbr( axis, dir, k ), msg_type(axis,dir), > mpi_comm_world, ierr) endif return end subroutine take3( axis, dir, u, n1, n2, n3 ) implicit none include 'mpinpb.h' include 'globals.h' integer axis, dir, n1, n2, n3 double precision u( n1, n2, n3 ) integer buff_id, indx integer status(mpi_status_size), ierr integer i3, i2, i1 call mpi_wait( msg_id( axis, dir, 1 ),status,ierr) buff_id = 3 + dir indx = 0 if( axis.eq. 1 )then if( dir.eq. -1 )then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(n1,i2,i3) = buff(indx, buff_id ) enddo else if( dir.eq. +1 ) then do i3=2,n3-1 do i2=2,n2-1 indx = indx + 1 u(1,i2,i3) = buff(indx, buff_id ) enddo endif if( axis.eq. 2 )then if( dir.eq. -1 )then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,n2,i3) = buff(indx, buff_id ) enddo else if( dir.eq. +1 ) then do i3=2,n3-1 do i1=1,n1 indx = indx + 1 u(i1,1,i3) = buff(indx, buff_id ) enddo endif if( axis.eq. 3 )then if( dir.eq. -1 )then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,n3) = buff(indx, buff_id ) enddo else if( dir.eq. +1 ) then do i2=1,n2 do i1=1,n1 indx = indx + 1 u(i1,i2,1) = buff(indx, buff_id ) enddo endif return end subroutine ready( axis, dir, k ) implicit none include 'mpinpb.h' include 'globals.h' integer axis, dir, k integer buff_id,buff_len,i,ierr buff_id = 3 + dir buff_len = nm2 do i=1,nm2 buff(i,buff_id) = 0.0D0 enddo msg_id(axis,dir,1) = msg_type(axis,dir) +1000*me call mpi_irecv( buff(1,buff_id), buff_len, > dp_type, nbr(axis,-dir,k), msg_type(axis,dir), > mpi_comm_world, msg_id(axis,dir,1), ierr) return end ready give3take3

28 3D Jacobi NAS MG class C NAS LU class B NAS LU class C

29 Outline Introduction of HTA Dynamic partitioning Overlapped tiling Impact on programming productivity Related work Conclusions

30 Three metrics Programming effort [ Halstead, 1977] Program volume V A function of the number of operators and operands and their total number of occurrences. Potential volume V* The most succinct form in a language which has defined or implemented the required operations. Program complexity [ McCabe,1976] C = P + 1, where P is the number of decision points in the program Source lines of codes L

31 Evaluation ProgramsEffortComplexityLOC staticLU HTA61, staticLU NDS208, staticLU LAPACK160, dynamicLU HTA51, dynamicLU FLAME170, recLU HTA85, recLU ATLAS186, Sylvester HTA423, Sylvester FLAME700,629595

32 Related work FLAME API [Bientinesi et al., 2005] Ad-hoc notations Sequoia [Fatahalian et al., 2006] Principal construct: task HPF [Hiranandani et al., 1992] and Co-Array Fortran [Numrich and Reid, 1998] Tiles are used for distribution Do not address different levels of memory hierarchy POOMA [Reynders,1996] Tiles and shadow regions are accessed as a whole Global Arrays [Nieplocha et al., 2006] SPMD programming model

33 Conclusion HTA makes tiles part of a language It provides generalized framework to express tiles. It increases productivity Less index calculation, fewer variables, loops, simpler function interface Dynamic partitioning Overlapped tiling Little performance degradation

34

35 Code example: 2D Jacobi Without overlapped tilingWith overlapped tiling HTA takes care of allocation of shadow regions, data consistency Clean indexing syntax

36 Two types of stencil computation Concurrent computation Each tile can be executed independently Example: Jacobi, MG Wavefront computation The execution sequence of tiles follows a certain order. Example: LU, SSOR With the update on read policy, minimal number of communications is achieved. computationShadow region update in every iteration computation Shadow region update in second iteration