Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Slides:

Advertisements

Similar presentations

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.

Advertisements

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Parallel Inclusion-based Points-to Analysis Mario Méndez-Lojo Augustine Mathew Keshav Pingali The University of Texas at Austin (USA) 1.

1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC pag. 1 Discovery of Locality-Improving Refactorings.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.

SCIENCES USC INFORMATION INSTITUTE An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Parallelizing Compilers Presented by Yiwei Zhang.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Vectorization of the 2D Wavelet Lifting Transform Using SIMD Extensions D. Chaver, C. Tenllado, L. Piñuel, M. Prieto, F. Tirado U C M.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Design Space Exploration

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

Multicore In Real-Time Systems – Temporal Isolation Challenges Due To Shared Resources Ondřej Kotaba, Jan Nowotsch, Michael Paulitsch, Stefan.

Presenter: Zong Ze-Huang Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi- Core Systems Stattelmann, S. ; Bringmann, O.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.

Pointer Analysis as a System of Linear Equations. Rupesh Nasre (CSA). Advisor: Prof. R. Govindarajan. Jan 22, 2010.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

1 Announcements  Homework 4 out today  Dec 7 th is the last day you can turn in Lab 4 and HW4, so plan ahead.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Parallelism without Concurrency Charles E. Leiserson MIT.

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

Computing Systems: Next Call for Proposals Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Sunpyo Hong, Hyesoon Kim

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita.

Parallel Computing Presented by Justin Reschke

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

Tuning Threaded Code with Intel® Parallel Amplifier.

First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

Code Optimization.

Dynamo: A Runtime Codesign Environment

Faster Data Structures in Transactional Memory using Three Paths

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Allen D. Malony Computer & Information Science Department

Programming with Shared Memory Specifying parallelism

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Pointer analysis John Rollinson & Kaiyuan Li

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by Samsung

2 Motivation (1/2)  Parallel programming is hard  What if there is a tool that helps parallel programming?  Already we have some tools like race detectors  However, not many tools on guiding parallel programming itself  A program wants to parallelize a serial code  Where to parallelize?  How to parallelize?

3 Motivation (2/2)  We propose Prospector  A set of dynamic program analyzers to help parallelization of serial code  Goals  Give information to find right parallelization targets  Provide advices on writing correct and optimized parallelized code

4 Overview of Prospector Func1(){ Loop1; Loop2; Func2(); } Func1(){ Loop1; Loop2; Func2(); } Loop3 { Statements; Lock(); Statements; Unlock(); Statements; } Loop3 { Statements; Lock(); Statements; Unlock(); Statements; } # of core Speedup248 CPUGPU Func1(){ Loop1; Loop2; Func2(); } Func2() { Loop3 } Func1(){ Loop1; Loop2; Func2(); } Func2() { Loop3 } Source code or Binary Input Loop3 { Statements; Lock(); Statements; Unlock(); Statements; } Loop3 { Statements; Lock(); Statements; Unlock(); Statements; } Loop1 Invocation: Iteration: Max Iter: Min Iter: 8 5,000 1,600 40

5 Prospector: Loop-Centric Profiler  Q: Which code section would good for parallelization?  Mostly frequently executed loops  Legacy profilers only report hot functions and instructions  We provide details of loop execution  # of trip count  Sufficient work?  # of invocation  Low fork/join overhead?  Stats of the length of loop iteration  Balanced?  Min, Max, Stdev Loop1 Invocation: Iteration: Max Iter: Min Iter: 8 5,000 1,600 40

6 Prospector: Parallel Speedup Predictor (1/2)  Q: What would be expected speedup?  Analytical models (e.g., Amdahl’s Law) are not practical to predict speedup in the presence of locks  Our approach  Dynamically predicting speedup based on light profiling  Challenges  How to model architecture factors (e.g., caches, memory)? # of core Speedup248

7 Prospector: Parallel Speedup Predictor (2/2)  Mechanisms  Programmers annotate the serial code  Describe the behaviors of parallel execution + locks  Fast and light profiling  Measure time between annotations  Emulation  Obtain estimated parallel execution time for speedup  Modeling architectural parameters  Sampling memory accesses  Using an analytical model for cache hit/miss prediction

8 Prospector: Parallelizable Section Finder (1/3)  Q: Is this code section parallelizable?  Data dependences determine the parallelizability  Compilers may not be good due to pointers and complex control flows  Our approach  Dynamic data-dependence profiling  Provides detailed dependence information for a given input  Challenges  Too much overhead; Smart algorithm is needed Func1(){ Loop1; Loop2; Func2(); } Func1(){ Loop1; Loop2; Func2(); } Parallelizable!Parallelizable!

9 Prospector: Parallelizable Section Finder (2/3)  Mechanisms  A dynamic profiler by using instrumentations  Instrumentation can be either binary and source level  At instrumentation time (or static time)  Analyzes control flow graphs and loop structures  At runtime  We observe memory addresses (no pointer-to analysis)  These memory addresses are stored and analyzed to discover data dependences

10 Prospector: Parallelizable Section Finder (3/3)  Mechanisms  Scalability  Current tools require too much memory and time to analyze data dependence  Prospector implements a new scalable algorithm for data dependence profiling  Key ideas  Using compression and parallelization (MICRO ‘10)

11 Prospector: Parallelism Pattern Advisor  Q: How can I transform the serial code?  If dependences are easily removable  I.e., Embarrassingly parallel loops with some reductions  Guide parallelization strategy directly  E.g., Use OpenMP pragma here  If severe dependences exist  Can we give advice on avoiding these dependences?  General solutions are extremely hard  Instead data-dependence pattern analysis  E.g., pipeline parallelism, a certain form of locking Loop3 { Statements; Lock(); Statements; Unlock(); Statements; } Loop3 { Statements; Lock(); Statements; Unlock(); Statements; }

12 Prospector: Parallel Architecture Advisor  Q: Which parallel hardware would be better?  Can we predict performances on different hardware?  E.g., Speedups on multicore and GPGPU  Challenges  Need to model more architectural factors SpeedupCPUGPU

13 Prospector: Parallel Performance Analyzer  Q: What is the reason of poor speedup?  There are a couple of profiler for this purpose  Analyzes the degree of concurrency  Profiles lock contentions (wait time)   Too low-level information to understand problems  Alternative  Macroscopic profiling of parallelized programs  An alternative form of visualizations Loop3 { Statements; Lock(); Statements; Unlock(); Statements; } Loop3 { Statements; Lock(); Statements; Unlock(); Statements; }

14 Related Work  State-of-the-art tools  Parallel Advisor from Intel Parallel Studio 2011  Speedup Predictor: cannot model architectures  Parallelizable Section Finder: scalability issues  vfAnalyst from VectorFabric  Parallelizable Section Finder: scalability issues

15 Current Status and Timeline  June 2010  Initial Prospector’s idea is presented in HotPar ‘10  Dec 2010  Scalable data-dependence profiling algorithm (for Parallelizable Section Finder and Pattern Advisor) will be presented in MICRO ’10  Beta version will be released as open source  Loop-centric profiler  Parallelizable Section Finder (i.e. Data-Dependence profiler)  Parallel speedup predictor  Mar 2010  Parallel Speedup Predictor will be released  Aug 2010  First Parallelism Pattern Advisor will be released

16 Conclusion  We need a new type of tool to help parallel programming  Prospector is a set of parallel programming advisor based on dynamic program analysis  Finds good parallelization target  Analyzes serial code to understand the behavior  Predicts speedup  Provides advice on code changes

17 Thank you!  Q&A  References  Overall tool architecture  Minjang Kim, Hyesoon Kim, Chi-Keung Luk, "Prospector: Helping Parallel Programming by A Data-Dependence Profiler", 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar '10), June  Scalable data-dependence profiling  Minjang Kim, Hyesoon Kim, Chi-Keung Luk, "SD3: A Scalable Approach To Dynamic Data-Dependence Profiling", Proceedings of the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010.