October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

Slides:

Advertisements

Similar presentations

Lists and the Collection Interface Chapter 4. Chapter Objectives To become familiar with the List interface To understand how to write an array-based.

Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Introduction to Openmp & openACC

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

1 PGAS Languages and Halo Updates Will Sawyer, CSCS.

IBM’s X10 Presentation by Isaac Dooley CS498LVK Spring 2006.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

Compiler Challenges for High Performance Architectures

 2006 Michigan Technological University CS /15/6 1 Shared Memory Programming for Large Scale Machines C. Barton 1, C. Cascaval 2, G. Almasi 2,

1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Overview *Unified Parallel C is an extension to ANSI C. *UPC is a global address space language for parallel programming. * UPC extends C by providing.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

IBM TJ Watson Research Center © 2006 IBM Corporation Compiler Workshop - CASCON 2006 October 16, 2006 Productivity and Performance using the IBM XLUPC.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

 2006 Michigan Technological UniversityIPDPS200616/2/6 1 Zhang Zhang, Steve Seidel Department of Computer Science Michigan Technological University

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

Parallel Programming in Java with Shared Memory Directives.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.

Compiler, Languages, and Libraries ECE Dept., University of Tehran Parallel Processing Course Seminar Hadi Esmaeilzadeh

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

Data Strcutures.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.

Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.

Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.

NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

CacheMiner : Run-Time Cache Locality Exploitation on SMPs CPU On-chip cache Off-chip cache Interconnection Network Shared Memory CPU On-chip cache Off-chip.

Assembly - Arrays תרגול 7 מערכים.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.

1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.

Parallel Computing Presented by Justin Reschke

1 ENERGY 211 / CME 211 Lecture 4 September 29, 2008.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

OMP2MPI A. Saà-Garriga et al., CAIAC (UAB)HIP3ES 1/28 A. Saà-Garriga, D. Castells-Rufas and J. Carrabina {Albert.saa, David.castells,

Loop Parallelism and OpenMP CS433 Spring 2001

Programming Models for SimMillennium

Introduction to OpenMP

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

A Practical Stride Prefetching Implementation in Global Optimizer

MATLAB HPCS Extensions

MATLAB HPCS Extensions

Memory System Performance Chapter 3

Shared-Memory Paradigm & OpenMP

Presentation transcript:

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral, Montse Farreras

University of Alberta © 2007 IBM Corporation 2Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Overview Current data distribution in UPC Multiblock extension to UPC Locality analysis Results Conclusions and Future Work

University of Alberta © 2007 IBM Corporation 3Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Unified Parallel C Parallel extension to C – Allow the declaration of shared variables – Programmer specifies shared data distribution – Work-sharing construct used to distribute loop iterations – Synchronization primitives: barriers, fences, and locks Data can be private, shared local and shared remote –Shared data can be accessed by all threads –Shared space is divided into portions with affinity to a thread –Latency to access local shared data is usually lower than latency to access remote shared data All threads execute the same program (SPMD style)‏ Convenient programming model for clusters of SMP

University of Alberta © 2007 IBM Corporation 4Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Data Distributions in UPC Data distributions are an essential part of a PGAS language Affinity directives are already in the language –shared [B] int A[N] This distribution is not particularly well-suited for applications in specific domains –Does not allow easy interface with existing high performance libraries

University of Alberta © 2007 IBM Corporation 5Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Multidimensional Blocking in UPC Extend shared data layout directive to support multidimensional distributions Programmer specifies n-dimensional tiles for shared data shared [2][2] int X[10][10]; New Blocking factorArray size #define BLK_SIZE (sizeof(double)*b*b)‏ shared [b][b] double A[M][P], B[P][N], C[M][N]; upc_forall(int ii = 0; ii < M; ii += BS; continue)‏ upc_forall(int jj = 0l jj < N; jj += BS; &C[ii][jj]) { double scratchA[b][b], scratchB[b][b]; upc_memget(scratchA, &A[ii][jj], BLK_SIZE); upc_memget(scratchB, &B[ii][jj], BLK_SIZE); dgemm(&C[ii][jj], &scratchA, &scratchB, b, b, b); }

University of Alberta © 2007 IBM Corporation 6Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Drawbacks to Multidimensional Blocking Accessing shared data becomes more expensive –Computing indexes requires sum and products across each shared dimension Compiler can identify accesses to local shared data and privatize them –Remove the overhead of calls to the runtime library –Expose the indexing computations to the compiler so they can be optimized

University of Alberta © 2007 IBM Corporation 7Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de CatalunyaGoal Compute symbolically the node ID of each shared reference in the upc_forall loop and compare it to reference in the upc_forall loop and compare it to the node ID of the affinity expression. the node ID of the affinity expression. Locality Analysis Shared reference inside the loop has a displacement with respect to the affinity expression –Specified by the distance vector k = [k 0, k 1,..., k n-1 ] upc_forall (int i=0; i < N; i++; &A[i])‏

University of Alberta © 2007 IBM Corporation 8Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Locality Analysis Observation 1 –A block shifted by a distance vector k can span at most two threads in each dimension –Locality can change in only one place in this dimension, which we call the cut Observation 2 –Only need to test the locality at the 2 n corners of the block –All elements from the corner to the cut will have the same owner

University of Alberta © 2007 IBM Corporation 9Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya For a given upc_forall loop –Split the loop into multiple iteration spaces where the shared reference is either local or remote Use cuts to determine split points –For each iteration space, keep a position relative to the affinity test Local shared references are privatized Remote shared references are transformed into calls to the UPC Runtime Identifying Local Shared Accesses

University of Alberta © 2007 IBM Corporation 10Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Computing Cuts T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T2T2 T2T2 T2T2 T0T0 T0T0 T2T2 T3T3 T3T3 T2T2 T2T2 T0T0 T1T1 T1T1 T1T1 T3T3 T4T4 T4T4 T4T4 T5T5 T5T5 T5T5 T3T3 T3T3 T3T3 T4T4 T4T4 T4T4 T5T5 T5T5 T5T5 T6T6 T6T6 T6T6 T7T7 T7T7 T7T7 T0T0 T0T0 T0T0 T6T6 T6T6 T6T6 T7T7 T7T7 T7T7 T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T2T2 T2T2 T2T2 T3T3 T3T3 T3T3 T1T1 T1T1 T1T1 T2T2 T2T2 T2T2 T3T3 T3T3 T3T k = [1,2] T2T2 T2T2 T0T0 T1T1 T1T1 T1T1 T3T3 T4T4 T4T4 T4T4 T5T5 T5T5 /*8 threads and 2 threads/node */ shared [2][3] int A[8][9]; for(v0=0; v0 <7 ; v0++) { upc_forall(v1=0; v1<6; v1++; &A[v0][v1]){ A[v0+1][v1+2] = v0*v1; }

University of Alberta © 2007 IBM Corporation 11Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Computing Cuts k = [1,2] /*8 threads and 2 threads/node */ shared [2][3] int A[8][9]; for(v0=0; v0 <7 ; v0++) { upc_forall(v1=0; v1<6; v1++; &A[v0][v1]){ A[v0+1][v1+2] = v0*v1; } T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T2T2 T2T2 T2T2 T0T0 T0T0 T2T2 T3T3 T3T3 T2T2 T2T2 T0T0 T1T1 T1T1 T1T1 T3T3 T4T4 T4T4 T4T4 T5T5 T5T5 T5T5 T3T3 T3T3 T3T3 T4T4 T4T4 T4T4 T5T5 T5T5 T5T5 T6T6 T6T6 T6T6 T7T7 T7T7 T7T7 T0T0 T0T0 T0T0 T6T6 T6T6 T6T6 T7T7 T7T7 T7T7 T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T2T2 T2T2 T2T2 T3T3 T3T3 T3T3 T1T1 T1T1 T1T1 T2T2 T2T2 T2T2 T3T3 T3T3 T3T T2T2 T2T2 T0T0 T1T1 T1T1 T1T1 T3T3 T4T4 T4T4 T4T4 T5T5 T5T Cut i = b i – k i % b i Cut 0 = 2 – 1 % 2 Cut 0 = 1 Dimensions 0 to n

University of Alberta © 2007 IBM Corporation 12Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T2T2 T2T2 T2T2 T0T0 T0T0 T2T2 T3T3 T3T3 T2T2 T2T2 T0T0 T1T1 T1T1 T1T1 T3T3 T4T4 T4T4 T4T4 T5T5 T5T5 T5T5 T3T3 T3T3 T3T3 T4T4 T4T4 T4T4 T5T5 T5T5 T5T5 T6T6 T6T6 T6T6 T7T7 T7T7 T7T7 T0T0 T0T0 T0T0 T6T6 T6T6 T6T6 T7T7 T7T7 T7T7 T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T2T2 T2T2 T2T2 T3T3 T3T3 T3T3 T1T1 T1T1 T1T1 T2T2 T2T2 T2T2 T3T3 T3T3 T3T T2T2 T2T2 T0T0 T1T1 T1T1 T1T1 T3T3 T4T4 T4T4 T4T4 T5T5 T5T Computing Cuts Dimension n-1 Cut Upper = (tpn–course(k))xb n-1 - k n-1 %b n-1 Cut Upper = (2 – 0)x3 – 2 % 3 Cut Upper = 4 Cut Lower = (tpn – course(k'))xb n-1 - k n-1 %b n-1 Cut Lower = (2 – 1) x 3 – 2 % 3 Cut Lower = 1 /*8 threads and 2 threads/node */ shared [2][3] int A[8][9]; for(v0=0; v0 <7 ; v0++) { upc_forall(v1=0; v1<6; v1++; &A[v0][v1]){ A[v0+1][v1+2] = v0*v1; } k = [1,2] k' = [k 0 +b 0 -1,k1] k' = [2,2]

University of Alberta © 2007 IBM Corporation 13Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Algorithm – Phase 1: Collection Collect shared references in upc_forall loop 1 shared [5][5] int A[20][20]; 2 for (i=0; i < 19; i++) 3 upc_forall (j=0; j < 20; j++; &A[i][j]) { 4 A[i+1][j] = MYTHREAD; 5 } 8 UPC Threads 2 Nodes 4 Threads per node T0T0 T0T0 T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T4T4 T4T4 T4T4 T4T4 T4T4 T5T5 T5T5 T4T4 T4T4 T4T4 T4T4 T5T5 T5T T4T4 Compute distance vector A[i+1][j] = MYTHREAD;[1,0]

University of Alberta © 2007 IBM Corporation 14Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Phase 2: Restructure Loop Starting at outer loop, iteratively compute cuts for each loop 1 shared [5][5] int A[20][20]; 2 for (i=0; i < 19; i++) 3 upc_forall (j=0; j < 20; j++; &A[i][j]) { 4 A[i+1][j] = MYTHREAD; 5 } 8 UPC Threads 2 Nodes 4 Threads per node Outer Loop: Cut i = b i – k i % b i Cut 0 = 4 T0T0 T0T0 T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T4T4 T4T4 T4T4 T4T4 T4T4 T5T5 T5T5 T4T4 T4T4 T4T4 T4T4 T5T5 T5T T4T4 Distance Vector: [1,0]

University of Alberta © 2007 IBM Corporation 15Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Phase 2: Restructure Loop 1 shared [5][5] int A[20][20]; 2 for (i=0; i < 19; i++) 3 if ( (i%5) < 4) { 4 upc_forall (j=0; j < 20; j++; &A[i][j]) { 5 A[i+1][j] = MYTHREAD; 6 } 7 } 8 else { 9 upc_forall (j=0; j < 20; j++; &A[i][j]) { 10 A[i+1][j] = MYTHREAD; 11 } 12 } 8 UPC Threads 2 Nodes 4 Threads per node Inner Loops: –Offset is 0 so Cut Upper and Cut Lower are 0 T0T0 T0T0 T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T4T4 T4T4 T4T4 T4T4 T4T4 T5T5 T5T5 T4T4 T4T4 T4T4 T4T4 T5T5 T5T T4T4 Distance Vector: [1,0]

University of Alberta © 2007 IBM Corporation 16Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Phase 3: Privatize Local Accesses 8 UPC Threads 2 Nodes 4 Threads per node Distance Vector: [1,0] Region Position: [0,0] Reference position: [1,0] Node ID: 0 (Local)‏ Region Position: [4,0] Reference Position: [5,0] Node ID: 1 (Remote)‏ T0T0 T0T0 T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T1T1 T1T1 T5T5 T5T5 T4T4 T4T4 T4T4 T4T4 T5T5 T5T T4T4 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T4T4 T4T4 T4T4 T4T4 T4T4 1 shared [5][5] int A[20][20]; 2 for (i=0; i < 19; i++) 3 if ( (i%5) < 4) { 4 upc_forall (j=0; j < 20; j++; &A[i][j]) { 5 A[i+1][j] = MYTHREAD; 6 } 7 } 8 else { 9 upc_forall (j=0; j < 20; j++; &A[i][j]) { 10 A[i+1][j] = MYTHREAD; 11 } 12 }

University of Alberta © 2007 IBM Corporation 17Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Phase 3: Privatize Local Accesses 8 UPC Threads 2 Nodes 4 Threads per node Distance Vector: [1,0] T0T0 T0T0 T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T1T1 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T1T1 T1T1 T1T1 T1T1 T1T1 T5T5 T5T5 T4T4 T4T4 T4T4 T4T4 T5T5 T5T T4T4 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T0T0 T4T4 T4T4 T4T4 T4T4 T4T4 1 shared [5][5] int A[20][20]; 2 for (i=0; i < 19; i++) 3 if ( (i%5) < 4) { 4 upc_forall (j=0; j < 20; j++; &A[i][j]) { 5 offset = ComputeOffset(i,j); 6 *(base_A+offset) = MYTHREAD; 7 } 9 } 9 else { 10 upc_forall (j=0; j < 20; j++; &A[i][j]) { 11 A[i+1][j] = MYTHREAD; 12 } 13 }

University of Alberta © 2007 IBM Corporation 18Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Preliminary Results Machine –4 nodes of an development IBM Squadron Cluster Each node: 8 SMP POWER5 processors, 1.9GHz; 16GB of memory Benchmarks –Cholesky factorization and Matrix Multiply Showcase multiblocking for shared arrays Implementation uses ESSL library for multiplying matrices –Dense matrix-vector multiply and 5-point stencil Demonstrate performance impact of locality analysis and privatization

University of Alberta © 2007 IBM Corporation 19Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Cholesky Factorization Theoretical Peak: 6.9GFlops x TPN x nodes

University of Alberta © 2007 IBM Corporation 20Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Matrix Multiplication Theoretical Peak: 6.9GFlops x TPN x nodes

University of Alberta © 2007 IBM Corporation 21Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Locality Analysis and Privatization shared [*] double A[14400][14400]; Sequential C Time: 0.47s

University of Alberta © 2007 IBM Corporation 22Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Locality Analysis and Privatization shared [100][100] double A[6000][6000]; Sequential C Time: 0.17s

University of Alberta © 2007 IBM Corporation 23Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Conclusions Multiblocked arrays fine control over data layout –Allows programmer to obtain better performance while simplifying expression of computations –Provides ability to integrate with existing libraries (C and Fortran) that require specific memory layouts Locality analysis –Compiler is able to reduce the overheads introduced by accessing shared arrays

University of Alberta © 2007 IBM Corporation 24Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Future Work Multiblocked arrays –Add processor tiling to increase programmer's ability to write code that scales to a large number of processors –Define a set of collectives that are optimized for the UPC programming model that will address scalability issues Compiler –Reduce overhead of computing indexes of privatized accesses –Investigate how traditional loop optimizations can be used with upc_forall loops to further improve performance

University of Alberta © 2007 IBM Corporation 25Multidimensional Blocking in UPCOctober 11, 2007 Universitat Politècnica de Catalunya Questions