Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh

Slides:



Advertisements
Similar presentations
Thread-Level Speculation as a Memory Consistency Protocol for Software DSM? Marcelo Cintra University of Edinburgh
Advertisements

Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh.
Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh
Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas.
Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization for Multiprocessors Marcelo Cintra and Josep Torrellas University.
Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh
Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh
Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita Adve University of Illinois Acks: Mark Hill, Kourosh Gharachorloo,
University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.
Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.
Enabling Thread Level Speculation via A Transactional Memory System Richard M. YooGeorgia Tech Hsien-Hsin Sean LeeGeorgia Tech Helper Transactions In Workshop.
Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Instruction Level Parallelism (ILP) Colin Stevens.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Chapter Hardwired vs Microprogrammed Control Multithreading
Parallel Programming Models and Paradigms
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Course Map The Java Programming Language Basics Object-Oriented Programming Exception Handling Graphical User Interfaces and Applets Multithreading Communications.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.
Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.
Project Proposal (Title + Abstract) Due Wednesday, September 4, 2013.
© 2009 Mathew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Are New Languages Necessary for Manycore? David I. August Department of Computer Science Princeton University.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Programming Paradigms for Concurrency Part 2: Transactional Memories Vasu Singh
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 1. Prerequisites.
Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories UIUC.
PRET-OS for Biomedical Devices A Part IV Project.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
Group 3: Architectural Design for Enhancing Programmability Dean Tullsen, Josep Torrellas, Luis Ceze, Mark Hill, Onur Mutlu, Sampath Kannan, Sarita Adve,
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
Experts in numerical algorithms and HPC services Compiler Requirements and Directions Rob Meyer September 10, 2009.
Copyright © Mohamed Nuzrath Java Programming :: Syllabus & Chapters :: Prepared & Presented By :: Mohamed Nuzrath [ Major In Programming ] NCC Programme.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Lawrence Livermore National Laboratory BRdeS-1 Science & Technology Principal Directorate - Computation Directorate How to Stop Worrying and Learn to Love.
ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Single Node Optimization Computational Astrophysics.
COEN 311 Computer Organization & Software Chapter 1 Introduction and Terminology (Prof. Sofiène Tahar) Concordia University Electrical & Computer Engineering.
Distributed Real-time Systems- Lecture 01 Cluster Computing Dr. Amitava Gupta Faculty of Informatics & Electrical Engineering University of Rostock, Germany.
Gauss Students’ Views on Multicore Processors Group members: Yu Yang (presenter), Xiaofang Chen, Subodh Sharma, Sarvani Vakkalanka, Anh Vo, Michael DeLisi,
Compilers and Interpreters
Software Group © 2004 IBM Corporation Compiler Technology October 6, 2004 Experiments with auto-parallelizing SPEC2000FP benchmarks Guansong Zhang CASCON.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
USEIMPROVEEVANGELIZE ● Yue Chao ● Sun Campus Ambassador High-Performance Computing with Sun Studio 12.
Chapter 4: Multithreaded Programming
Models and Languages for Parallel Computation
Precision Timed Machine (PRET)
Single-Chip Multiprocessors: the Rebirth of Parallel Architecture
Multithreading Why & How.
Chapter 4: Threads & Concurrency
System Programming By Prof.Naveed Zishan.
Overview of Computer system
Presentation transcript:

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh

Moore for Less Keynote - September Compilers for Parallel Computers (Today)  Auto-parallelizing compilers –“Holy grail”: convert sequential programs into parallel programs with little or no user intervention –Only partial success, despite decades of work –No performance debugging tools  For explicitly parallel languages/annotations (e.g., OpenMP, Java Threads) –Main goal: correctly map high-level data and control flow to hardware/OS threads and communication –Secondary goal: perform simple optimizations specific to parallel execution –Simple correctness and performance debugging tools

Moore for Less Keynote - September Compilers for Parallel Computers (Future)  Data flow/dependence analysis tools – unsafe/speculative –Probabilistic approaches –Profile-based approaches  Multithreading-specific optimization toolbox –Including alternative/speculative parallel programming models (e.g., Transactional Memory (TM))  Auto-parallelizing compilers – with speculation –Thread-level speculation (TLS) –Helper threads Holistic parallelizing tool chain.

Moore for Less Keynote - September Why Be Speculative?  Performance of programs ultimately limited by control and data flows  Most compiler optimizations exploit knowledge of control and data flows  Techniques based on complete/accurate knowledge of control and data flows are reaching their limit –True for both sequential and parallel optimizations Future compiler optimizations must rely on incomplete knowledge: speculative execution

Moore for Less Keynote - September Compilers for Parallel Computers (Future) Dependence/Flow Analysis Tool Parallelizing Compiler Unsafe <P-way parallel Seq. P-way parallel TLS TM Auto-TLS Compiler Auto-TLS Compiler

Moore for Less Keynote - September Outline  Context and Motivation  History and status-quo of auto-parallelizing compilers –Data dependence analysis for array-based programs –Data dependence analysis for irregular programs  Auto-parallelizing compilers for TLS –TLS execution model (speculative parallelization) –Static compiler cost model (PACT’04, TACO’07)

Moore for Less Keynote - September Data Dependence Analysis for Arrays  Based on mathematical evaluation of array index expressions within loop nests  Progressively more capable analyses (e.g., GCD test, Banerjee test), but still restricted to affine loop index expressions  Coupled with mathematical framework to represent loop transformations (e.g., loop interchange, skewing) that can help expose more parallelism

Moore for Less Keynote - September Data Dependence Analysis for Arrays  What’s wrong with traditional data dependence? –Not all index expressions are affine or even statically defined (e.g., subscripted subscripts) –Not all loops are well structured (e.g., conditional exits, control flow) –Not all procedures are analyzable (e.g., unavailable code, aliasing, global data access) –Not all applications make intense use of arrays (e.g., trees, hash tables, linked lists, etc) and loop nests

Moore for Less Keynote - September Data Dependence Analysis for Irregular Programs  Based on ad-hoc analyses (e.g., pointer analysis, shape analysis, task graph analysis) There isn’t a comprehensive data dependence analysis framework for irregular applications

Moore for Less Keynote - September Outline  Context and Motivation  History and status-quo of auto-parallelizing compilers –Data dependence analysis for array-based programs –Data dependence analysis for irregular programs  Auto-parallelizing compilers for TLS –TLS execution model (speculative parallelization) –Static compiler cost model (PACT’04, TACO’07)

Moore for Less Keynote - September Thread Level Speculation (TLS)  Assume no dependences and execute threads in parallel  While speculating, buffer speculative data separately  Track data accesses and monitor cross-thread violations  Squash offending threads and restart them  All this can be done in hardware, software, or a combination for(i=0; i<100; i++) { … = A[L[i]] + … A[K[i]] = … } Iteration J+2 … = A[5]+… A[6] =... Iteration J+1 … = A[2]+… A[2] =... Iteration J … = A[4]+… A[5] =... RAW

Moore for Less Keynote - September  Squash & restart: re-executing the threads  Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative  Dispatch & commit: writing back speculative data into memory and starting next speculative thread  Load imbalance: processor waiting for thread to become non-speculative to commit TLS Overheads

Moore for Less Keynote - September Coping with overheads: Cost Model!  Compiler cost models are key to guide optimizations, but no such cost model exists for TLS  Speculative parallelization can deliver significant speedup or slowdown –Several speculation overheads –Overheads are hard to estimate (e.g., squash?)  A prediction of the value of speedup can be useful –e.g. multi-tasking environment  program A wants to run speculatively in parallel on 4 cores ( predicted speedup 1.8 )  other programs waiting to be scheduled  OS decides it does not pay off

Moore for Less Keynote - September  Squash & restart: re-executing the threads –Hard because violations are highly unpredictable  Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative –Hard because write-sets are somewhat unpredictable  Dispatch & commit: writing back speculative data into memory and starting next speculative thread –Hard because write-sets are somewhat unpredictable  Load imbalance: processor waiting for thread to become non-speculative to commit –Hard because workloads are very unpredictable and order does matter due to in-order commit requirement TLS Overheads

Moore for Less Keynote - September Our Compiler Cost Model: Highlights  First fully static compiler cost model for TLS  Can handle all TLS overheads in a single framework –Including loop imbalance, which is not handled by any other cost model  Produces not only a qualitative (“good” or “bad”) assessment of the TLS benefits but instead a quantitative value (i.e., expected speedup/slowdown)  Can be easily integrated into most compilers at the intermediate representation level  Simple and fast to compute

Moore for Less Keynote - September Speedup Distribution Very varied speedup/slowdown behavior

Moore for Less Keynote - September Model Accuracy (I): Outcomes Only 17% false positives (performance degradation) Negligible false negatives (missed opportunities) Most speedups/slowdowns correctly predicted by the model

Moore for Less Keynote - September Current Developments  Done: –Completed implementation of TLS code generator in GCC  Doing: –Implementing cost model in this TLS GCC –Profiling TLS program behavior (with IBM and U. of Manchester)  To Do: –Develop hybrid cost models based on static and profile information –Develop “intelligent” cost models based on Machine Learning (with U. of Manchester)

Moore for Less Keynote - September Summary  Paraphrasing M. Snir † (UIUC): “parallel programming will have to become synonymous with programming”  However, –Better (and unsafe) data dependence analysis tools –Explicit (and speculative) parallel models –Auto-parallelizing (speculative) compilers  Much work still needs to be done.  At U. of Edinburgh: –Auto-parallelizing TLS compilers –TLS hardware –STM (software TM) † Director of Intel+Microsoft’s UPCRC

Moore for Less Keynote - September Acknowledgments  Research Team and Collaborators –Jialin Dou –Salman Khan –Polychronis Xekalakis –Nikolas Ioannou –Fabricio Goes –Constantino Ribeiro –Dr. G. Brown, Dr. M. Lujan, Prof. I. Watson (U. of Manchester) –Prof. Diego Llanos (U. of Valladolid)  Funding –UK – EPSRC:GR/R65169/01 EP/G000697/1

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh