Communication Optimizations in Titanium Programs Jimmy Su.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
1 Optimizing compilers Managing Cache Bercovici Sivan.
© 2009 IBM Corporation July, 2009 | PADTAD Chicago, Illinois A Proposal of Operation History Management System for Source-to-Source Optimization.
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
Cache Optimization Summary
Thoughts on Shared Caches Jeff Odom University of Maryland.
Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.
Compiler Challenges for High Performance Architectures
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Unified Parallel C at LBNL/UCB Message Strip-Mining Heuristics for High Speed Networks Costin Iancu, Parry Husbans, Wei Chen.
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
CSE 160/Berman Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6, 9.2.8, , Kumar Berman, F., Wolski, R.,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
Evaluation of Memory Consistency Models in Titanium.
Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel programs“ (Ian Foster, 1995)
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
Trace-Based Optimization for Precomputation and Prefetching Madhusudan Raman Supervisor: Prof. Michael Voss.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Parallel Solution of the Poisson Problem Using MPI
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.
Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=
The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.
+ Clusters Alternative to SMP as an approach to providing high performance and high availability Particularly attractive for server applications Defined.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
CR18: Advanced Compilers L07: Beyond Loop Transformations Tomofumi Yuki.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,
Multiprocessor Cache Coherency
Mahesh Ravishankar1, John Eisenlohr1,
Making Sequential Consistency Practical in Titanium
A Practical Stride Prefetching Implementation in Global Optimizer
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Register Pressure Guided Unroll-and-Jam
Gary M. Zoppetti Gagan Agrawal
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Presentation transcript:

Communication Optimizations in Titanium Programs Jimmy Su

Study Communication Optimization Benchmarks —Gups —Sparse Matvec —Jacobi —Particle in Cell Machines used in experiments —Seaborg (IBM SP) —Millennium

Hand Optimizations Prefetching (moving reads up) Moving syncs down C code generated by the Titanium compiler is modified manually to do the above optimizations

Characteristics of the Benchmarks Source code was not optimized There are more remote reads than remote writes Source code uses small messages instead of pack/unpack

Observations Pros —Hand optimization does pay off Gups14% speed up Jacobi5% speed up Sparse Matvec45% speed up Cons —The optimizations can only be done automatically on regular problems Alias analysis too conservative —Alternative solution for regular problems uses array copy Titanium has highly optimized array copy routines

Inspector Executor Developed by Joel Saltz and others at University of Maryland in the early 90’s Goal is to hide latency for problems with irregular accesses A loop is compiled into two phases, an inspector and an executor —The inspector examines the data access pattern in the loop body and creates a schedule for fetching the remote values —The executor retrieves remote values according to the schedule and executes the loop A schedule may be reused if the access pattern is the same for multiple iterations

Inspector Executor Example a b c myStartIndexmyEndIndex

Inspector Executor Pseudo Code for iteration = 1 to n for i = myStartIndex to myEndIndex a[i] = b[i] + c[ia[i]] end c.copy(a) end //inspector phase for i = myStartIndex to myEndIndex a[i] = b[i] + c.inspect(ia[i]) end //create the communication schedule c.schedule() for iteration = 1 to n //fetch the remote values according to the //communication schedule c.fetch() for i = myStartIndex to myEndIndex a[i] = b[i] + c.execute(ia[i]) end c.copy(a) end

Roadmap Introduced distributed array type First implemented by hand Currently working on a prototype in the compiler

Conjugate Gradient 4096x4096 matrices 0.07% of matrix entries are non-zeros Varies the percent of non-local accesses from 0% to 64% 8 processors on 2 nodes with 4 processors on each node Only the sparse matvec part is modified to use inspector executor The running time of 500 iterations was measured Seaborg (IBM SP)

Synthetic Matrices For Benchmark

Description of the Benchmark Compiler generated —Block copy broadcast —Compiler inspector executor —One at a time blocking Hand edited —Hand written inspector executor —One at a time non- blocking

Sparse Matvec lower is better Problem size: 4096x4096 matrix 0.07% fill rate

Full Conjugate Gradient lower is better Problem size: 4096x4096 matrix 0.07% fill rate

Future Work Analysis on when the inspector executor transformation is legal Investigate the uniprocessor performance of sparse matvec Apply inspector executor in UPC Run benchmark on matrices with different structures Automatically finding a location to place the communication code More benchmarks that utilize inspector executor Alternative scheduling strategies