AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS Chirag Dave and Rudolf Eigenmann Purdue University.

Slides:



Advertisements
Similar presentations
Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.
Advertisements

OpenMP.
Houssam Haitof Technische Univeristät München
For(int i = 1; i
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
On the Interaction of Tiling and Automatic Parallelization Zhelong Pan, Brian Armstrong, Hansang Bae Rudolf Eigenmann Purdue University, ECE
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
PROBLEMSOLUTION TECHNOLOGY Traceability relations between requirements and code are generally derived manually, and must be manually updated when software.
Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Idiom Recognition in the Polaris Parallelizing Compiler Bill Pottenger and Rudolf Eigenmann Presented by Vincent Yau.
OpenMP 3.0 Feature: Error Detection Capability Kang Su Gatlin Visual C++ Program Manager.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Automatic Generation of Operation Tables for Fast Exploration of Bypasses in Embedded Systems Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Sanghyun.
Parallelizing Compilers Presented by Yiwei Zhang.
Automatic Generation of Parallel OpenGL Programs Robert Hero CMPS 203 December 2, 2004.
Generations of Test Automation COMP551 Week 3 Dr Mark Utting
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College
INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.
1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.
Optimizing the trace transform Using OpenMP and CUDA Tim Besard
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 Chapter 1 Introduction to Accounting Information Systems Chapter 18 Systems Implementation and Operation.
11 MANAGING AND MONITORING DHCP Chapter 2. Chapter 2: MANAGING AND MONITORING DHCP2 MANAGING DHCP: COMMON DHCP ADMINISTRATIVE TASKS  Configure or modify.
MERCURY BUSINESS PROCESS TESTING. AGENDA  Objective  What is Business Process Testing  Business Components  Defining Requirements  Creation of Business.
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Threaded Programming Lecture 4: Work sharing directives.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Code Motion for MPI Performance Optimization The most common optimization in MPI applications is to post MPI communication earlier so that the communication.
1 Becoming More Effective with C++ … Day Two Stanley B. Lippman
Porting processes to threads with MPC instead of forking Some slides from Marc Tchiboukdjian (IPDPS’12) : Hierarchical Local Storage Exploiting Flexible.
Revisiting Loop Transformations with X10 Clocks Tomofumi Yuki Inria / LIP / ENS Lyon X10 Workshop 2015.
Using Compiler Directives Paraguin Compiler 1 © 2013 B. Wilkinson/Clayton Ferner SIGCSE 2013 Workshop 310 session2a.ppt Modification date: Jan 9, 2013.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
HCI 575X Project Madhuri Rapaka Sachin Chopra Trevor Garson.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
Parallelisation of Desktop Environments Nasser Giacaman Supervised by Dr Oliver Sinnen Department of Electrical and Computer Engineering, The University.
4 Starting Tips to Keep Your Car in Top Condition
Introduction to OpenMP
Loop Parallelism and OpenMP CS433 Spring 2001
Designing Database Solutions for SQL Server
SHARED MEMORY PROGRAMMING WITH OpenMP
Lab. 3 (May 6st) You may use either cygwin or visual studio for using OpenMP Compiling in cygwin “> gcc –fopenmp ex1.c” will generate a.exe Execute : “>
OpenMP Quiz B. Wilkinson January 22, 2016.
OpenMP 3.0 Feature: Error Detection Capability
Multi-core CPU Computing Straightforward with OpenMP
2016 Maintenance Innovation Challenge ENTERPRISE SERVICE BUS
Allen D. Malony Computer & Information Science Department
Introduction to OpenMP
OpenMP Quiz.
Parallel Computing Explained How to Parallelize a Code
Lab. 3 (May 1st) You may use either cygwin or visual studio for using OpenMP Compiling in cygwin “> gcc –fopenmp ex1.c” will generate a.exe Execute : “>
Applications Development - Unit Testing
Distributed Availability Groups
Self-Managed Systems: an Architectural Challenge
Maximizing Speedup through Self-Tuning of Processor Allocation
Shared-Memory Paradigm & OpenMP
Presentation transcript:

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS Chirag Dave and Rudolf Eigenmann Purdue University

GOALS Automatic parallelization without loss of performance – Use automatic detection of parallelism – Parallelization is overzealous – Remove overhead-inducing parallelism – Ensure no performance loss over original program Generic tuning framework – Empirical approach – Use program execution to measure benefits – Offline tuning

AUTO Vs. MANUAL PARALLELIZATION Source Program Hand parallelized Parallelizing Compiler Parallel Program Significant development time State-of-the-art auto- parallelization in the order of minutes User tunes the program for performance

AUTO-PARALLELISM OVERHEAD int foo() { #pragma omp private(i,j,t) for (i=0; i<10; i++) { a[i] = c; #pragma omp private(j,t) #pragma omp parallel for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } fork join Fork/Join overheads Load balancing Work in parallel section Fork/Join overheads Load balancing Work in parallel section Loop level parallelism

NEED FOR AUTOMATIC TUNING Identify, at compile time, the optimization strategy for maximum performance Beneficial parallelism – Which loops to parallelize – Parallel loop coverage

OUR APPROACH Best combination of loops to parallelize Offline tuning Decisions based on actual execution time Best combination of loops to parallelize Offline tuning Decisions based on actual execution time

CETUS: VERSION GENERATION Cetus Version Generator Symbolic Data Dependence Analysis Induction Variable Substitution Scalar and Array Privatization Reduction Recognition

SEARCH SPACE NAVIGATION Search Space -> The set of parallelizable loops Generic Tuning Algorithm – Capture Interaction – Use program execution time as decision metric COMBINED ELIMINATION – Each loop is an on/off optimization – Selective parallelization Pan, Z., Eigenmann, R.: Fast and eective orchestration of compiler optimizations for automatic performance tuning. In: The 4th Annual International Symposium on Code Generation and Optimization (CGO). (March 2006) 319–330

TUNING ALGORITHM BATCH ELIMINATIONITERATIVE ELIMINATION COMBINED ELIMINATION -Considers separately, the effects of each optimization -Instant elimination -Considers interactions -More tuning time New Base Case -Considers interactions amongst a subset -Iterates over the smaller subset and performs batch elimination

CETUNE INTERFACE int foo() { #pragma cetus parallel… for (i=0; i<50; i++) { t = a[i]; a[i+50] = t + (a[i+50] + b[i])/2.0; } for (i=0; i<10; i++) { a[i] = c; #pragma cetus parallel… for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } int foo() { #pragma cetus parallel… for (i=0; i<50; i++) { t = a[i]; a[i+50] = t + (a[i+50] + b[i])/2.0; } for (i=0; i<10; i++) { a[i] = c; #pragma cetus parallel… for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } cetus –ompGen –tune-ompGen=1,1 Parallelize both loops cetus –ompGen –tune-ompGen=1,0 cetus –ompGen –tune-ompGen=0,1 Parallelize one and serialize the other cetus –ompGen –tune-ompGen=0,0 Serialize both loops

EMPIRICAL MEASUREMENT Input source code (train data set) Version generation using tuner input Back end code generation Runtime performance measurement Train data set Decision based on RIP Next point in the search space Automatic parallelization using Cetus Start configuration Final configuration ICC Intel Xeon Dual Quad-core

RESULTS

CONTRIBUTIONS Described a compiler + empirical system that detects parallel loops in serial and parallel programs and selects the combination of parallel loops that gives highest performance Finding profitable parallelism can be done using a generic tuning method The method can be applied on a section-by-section basis, thus allowing fine-grained tuning of program sections Using a set of NAS and OMP 2001 benchmarks, we show that the auto-parallelized and tuned version near-equals or improves performance over the original serial or parallel program

THANK YOU!