Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

Slides:



Advertisements
Similar presentations
Anthony Delprete Jason Mckean Ryan Pineres Chris Olszewski.
Advertisements

Introduction to Openmp & openACC
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Introductions to Parallel Programming Using OpenMP
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Introductory Courses in High Performance Computing at Illinois David Padua.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Reference: Message Passing Fundamentals.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Telescoping Languages: A Compiler Strategy for Implementation of High-Level Domain-Specific Programming Systems Ken Kennedy Rice University.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Fortress John Burgess and Richard Chang CS691W University of Massachusetts Amherst.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
For Massively Parallel Computation The Chaotic State of the Art
CS427 Multicore Architecture and Parallel Computing
Computer Engg, IIT(BHU)
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
© 2012 Elsevier, Inc. All rights reserved.
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Presentation transcript:

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware

Overview Background Motivation A new idea: Tile Reduction Experimental Results Conclusion Related Work Future Work 1

Tile/Tiling Natural representation of data objects that are heavily used in scientific algorithms Tiling improves data locality Tiling can increase parallelism and reduce synchronization in parallel programs It is an effective compiler optimizing technique Essentially a program design paradigm Supported in many parallel programming languages: ZPL, CAF, HTA, etc. 2

OpenMP OpenMP is the de facto standard for shared- memory parallel programming Provides a simple and flexible interface for developing portable and scalable parallel application Support incremental parallelization Maintain sequential consistency “tile oblivious”, no directive or clause can be used to annotate data tile and carry such information to compiler 3

A Motivating Example 4

Parallelizing: the traditional way(1) 5

Parallelizing: the traditional way(2) Can only leverage the traditional scalar reduction in OpenMP Parallelism is trivial Data locality is not bad Not natural and intuitive 6

The Expected Parallelization 7 View the inner most two loops as a macro operation performing on the 2x2 data tiles Aggregate the data tiles in parallel More parallelism Better data locality

Tile Reduction Interface 8

Terms Reduction Tile: the data tile under reduction Tile descriptor: the “multi-dimensional array” in the list construct Reduction kernel loops: the loops involved in performing “one” recursive calculation Tile name Dimension descriptor: the tuples following the tile name 9

A Use Case 10 Tiled Matrix Multiplication Tile Reduction Applied on the Tiled Matrix Multiplication Code

Code Generation (1) 11 Distribute the iterations of the parallelized loop among the threads Allocate memory for the private copy of the tile used in the local recursive calculation Perform the local recursive calculation which is specified by the reduction kernel loops Update the global copy of the reduction tile

Code Generation (2) 12

Experimental Results (1) 13 2D Histogram Reduction

Experimental Results (2) 14 Matrix-Matrix Multiplication

Experimental Results (3) 15 Matrix-Vector Multiplication

Conclusions 16 As one of the building block of the tile aware parallelization theory, tile reduction brings more opportunities to parallelize dense matrix applications For some benchmarks, tile reduction is a more natural and intuitive way to reason about the best parallelization decision For some benchmarks, tile reduction not only can improve data locality, but also can expose more parallelism Amiable to programmers Code generation is as simple as the scalar reduction in the current OpenMP Runtime overhead is trivial

Similar Works 17 Parallel reduction is supported in: C**: Viswanathan, G., Larus, J.R.: User-defined reductions for efficient communication in data-parallel languages. Technical Report 1293, University of Wisconsin-Madison (Jan 1996) SAC: Scholz, S.B.: On defining application-specific high-level array operations by means of shape invariant programming facilities. In: APL ’98: Proceedings of the APL98 conference on Array processing language, New York, NY, USA, ACM (1998) 32–38 ZPL: Deitz, S.J., Chamberlain, B.L., Snyder, L.: High-level language support for user- defined reductions. J. Supercomput. 23(1) (2002) 23–37 UPC Consortium: UPC Collective Operations Specifications V1.0 A publication of the UPC Consortium (2003) Forum, M.P.I.: MPI: A message-passing interface standard (version 1.0). Technical report (May 1994) URL Kambadur, P., Gregor, D., Lumsdaine, A.: Openmp extensions for generic libraries. In: Lecture Notes in Computer Science: OpenMP in a New Era of Parallelism, IWOMP’08, International Workshop on OpenMP. Volume 5004/2008., Springer Berlin / Heidelberg (2008) 123–133

Future Works 18 Design and develop OpenMP pragma directives that can be used to help compiler to generate efficient data movement code for parallel applications running on many-core platforms with highly non-uniform memory system, like the Cyclops-64 processor