Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh

Slides:

Advertisements

Similar presentations

Progress Status of Subproject 6 VMC-PPO VMC-PPO Project Investigator.

Advertisements

fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.

Masters Programmes Stuart Anderson.

1 Artificial Intelligence Applications Institute, University of Edinburgh Institute for Human & Machine Cognition, University of West Florida CoSAR-TS.

Thread-Level Speculation as a Memory Consistency Protocol for Software DSM? Marcelo Cintra University of Edinburgh

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh

Combining Thread Level Speculation, Helper Threads, and Runahead Execution Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra University of Edinburgh.

Handling Branches in TLS Systems with Multi-Path Execution Polychronis Xekalakis and Marcelo Cintra University of Edinburgh

Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas.

Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization for Multiprocessors Marcelo Cintra and Josep Torrellas University.

Toward Efficient and Robust Software Speculative Parallelization on Multiprocessors Marcelo Cintra and Diego R. Llanos University of Edinburgh

Increasing the Energy Efficiency of TLS Systems Using Intermediate Checkpointing Salman Khan 1, Nikolas Ioannou 2, Polychronis Xekalakis 3 and Marcelo.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Software Exploits for ILP We have already looked at compiler scheduling to support ILP – Altering code to reduce stalls – Loop unrolling and scheduling.

Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.

Compiler techniques for exposing ILP

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Code optimization: –A transformation to a program to make it run faster and/or take up less space –Optimization should be safe, preserve the meaning of.

8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.

Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Searching Prudence Wong

Optimizing single thread performance Dependence Loop transformations.

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.

Control Flow Analysis (Chapter 7) Mooly Sagiv (with Contributions by Hanne Riis Nielson)

EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.

Instruction Level Parallelism (ILP) Colin Stevens.

Chair of Software Engineering Fundamentals of Program Analysis Dr. Manuel Oriol.

1 Fall 2008ACS-1805 Techniques for designing code Storyboards Text (aka pseudocode) Diagrams Flowcharts (a procedural technique) Text presents one of these.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.

My view of challenges faced by Open64 Xiaoming Li University of Delaware.

Optimization Compiler Optimization – optimizes the speed of compilation Execution Optimization – optimizes the speed of execution.

Final Code Generation and Code Optimization.

MaJIC: Compiling MATLAB for speed and responsiveness George Almasi and David Padua.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.

Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.

Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK

© Andrew IrelandDependable Systems Group Proof Automation for the SPARK Approach to High Integrity Ada Andrew Ireland Computing & Electrical Engineering.

Cosc 2150: Computer Organization

CPE 731: Advanced Computer Architecture Research Report and Presentation 1.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

System Software for Parallel Computing. Two System Software Components Hard to do the innovation Replacement for Tradition Optimizing Compilers Replacement.

GPU Architecture and Programming

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Lengthening Traces to Improve Opportunities for Dynamic Optimization Chuck Zhao, Cristiana Amza, Greg Steffan, University of Toronto Youfeng Wu Intel Research.

1 Fall 2007ACS-1903 Techniques for designing code Storyboards Text (aka pseudocode) Diagrams Flowcharts (a procedural technique) Text presents one of these.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Recursive Design for Embedded Real-Time Systems Minoru Yamaguchi Software Process Development section MSBC ・ CNC Sony Corporation Copyright 2001 Sony Corporation.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Code Motion for MPI Performance Optimization The most common optimization in MPI applications is to post MPI communication earlier so that the communication.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

亚洲的位置和范围吉林省白城市洮北区教师进修学校郑春艳. Q 宠宝贝神奇之旅 —— 亚洲 Q 宠快递你在网上拍的一套物理实验器材到了。 Q 宠宝贝打电话给你：你好，我是快递员，有你的邮件，你的收货地址上面写的是学校地址，现在学校放假了，能把你家的具体位置告诉我吗？请向快递员描述自己家的详细位置！

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

1 Removing Impediments to Loop Fusion Through Code Transformations Bob Blainey 1, Christopher Barton 2, and Jos’e Nelson Amaral 2 1 IBM Toronto Software.

A project is a set of series of tasks that need to be completed in order and in time to reach specific goals. A project is a temporary that means it has.

Automatic Thread Extraction with Decoupled Software Pipelining

Computer Architecture Principles Dr. Mike Frank

Optimization for the Linux kernel and Linux OS C. Tyler McAdams

High Performance Computing (CS 540)

Benjamin Goldberg Compiler Verification and Optimization

Efficient software checkpointing framework for speculative techniques

Суури мэдлэг Basic Knowledge

Introduction to CUDA.

MIT AI Lab: B. Williams, H. Shrobe, R. Laddaga

How to improve (decrease) CPI

rePLay: A Hardware Framework for Dynamic Optimization

Presentation transcript:

Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh

University of Karlsruhe - January Why Speculation?  Performance of programs ultimately limited by control and data flows  Most compiler and architectural optimizations exploit knowledge of control and data flows  Techniques based on complete knowledge of control and data flows are reaching their limit Future compiler and architectural optimizations must rely on incomplete knowledge: speculative execution

University of Karlsruhe - January Example: Loop Fusion Original codeOptimized code for (i=0; i<100; i++) { A[i] = … … = A[i] + … } for (i=0; i<100; i++) { A[i] = … } for (i=0; i<100; i++) { … = A[i] + … } unsafe B[i] > i ?? incorrect if (cond) A[B[i]] = … if (cond) A[B[i]] = …

University of Karlsruhe - January Example: Out-of-order Execution Original execution MUL R1, R2, R3 stall … stall ADD R5, R1, R4 ST 1000(R1), R5 stall … stall LD 500(R7), R6 stall … stall Optimized execution MUL R1, R2, R3 LD 500(R7), R6 stall … stall ADD R5, R1, R4 ST 1000(R1), R5 stall … stall 500+R7==1000+R1 ?? unsafe

University of Karlsruhe - January Solution: Speculative Execution  Identify potential optimization opportunities  Assume no data dependences and perform optimization  While speculating buffer unsafe data separately  Monitor actual data accesses at run time  Detect violations  Squash offending execution, discard speculative data, and re-execute  or, commit speculative execution and data

University of Karlsruhe - January Why Speculation at Thread Level?  Modern architectures support Instruction-Level Speculation, but  Depth of speculative execution only spans a few dozen instructions (instruction window)  No support for speculative memory operations (specially stores)  Speculation not exposed to compiler Must support speculative execution across much larger blocks of instructions (“threads”) and with compiler assistance

University of Karlsruhe - January Outline  Motivation  Speculative Parallelization  Compiler Cost Model –Evaluation –Related Work –Conclusions  Current and Future Directions

University of Karlsruhe - January Speculative Parallelization  Assume no dependences and execute threads in parallel  Track data accesses at run-time  Detect cross-thread violations  Squash offending threads and restart them for(i=0; i<100; i++) { … = A[L[i]] + … A[K[i]] = … } Iteration J+2 … = A[5]+… A[6] =... Iteration J+1 … = A[2]+… A[2] =... Iteration J … = A[4]+… A[5] =... RAW

University of Karlsruhe - January  Squash & restart: re-executing the threads  Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative  Inter-thread communication: waiting for value from predecessor thread  Dispatch & commit: writing back speculative data into memory  Load imbalance: processor waiting for thread to become non-speculative to commit Speculative Parallelization Overheads

University of Karlsruhe - January Squash Overhead PE0PE1 7 PE3 1 PE2 2  A particular problem in speculative parallelization  Data dependence cannot be violated  A store appearing after a “later” load (in sequential order) causes a squash  If squashed must restart from beginning st ld squash

University of Karlsruhe - January Speculative Buffer Overflow Overhead PE0PE1 7 PE3 1 PE2 2 commit  A particular problem in speculative parallelization  Speculatively modified state cannot be allowed to overwrite safe (non- speculative) state, but must be buffered instead  If buffer overflow remain idle waiting for predecessor to commit st overflow

University of Karlsruhe - January Load Imbalance Overhead PE0PE1 commit 7 PE3 1 PE2 commit 2  A different problem in speculative parallelization  Due to in-order-commit requirement, a processor cannot start new thread before the current thread commits  Remain idle waiting for predecessor to commit commit

University of Karlsruhe - January Factors Causing Load Imbalance  Difference in thread workload –different control path: (intrinsic load imbalance) –different data sizes –influence from other overheads.  e.g. speculative buffer overflow on a thread leads to longer waiting time on successor threads for() { if () { … } else { … } Workload 1 (W1) Workload 2 (W2)

University of Karlsruhe - January Factors Causing Load Imbalance (contd.)  Assignment (locations) of the threads on the processors PE0PE1 PE3 1 PE2 PE0PE1 PE3 1 PE2 commit

University of Karlsruhe - January Outline  Motivation  Speculative Parallelization  Compiler Cost Model –Evaluation –Related Work –Conclusions  Current and Future Directions

University of Karlsruhe - January Why a Compiler Cost Model?  Speculative parallelization can deliver significant speedup or slowdown –several speculation overheads –some code segments could slowdown the program –we need a smart compiler to choose which program regions to run speculatively based on the expected outcome  A prediction of the value of speedup can be useful –e.g. multi-tasking environment  program A wants to run speculatively in parallel ( predicted 1.2 )  other programs waiting to be scheduled  OS decides it does not pay off

University of Karlsruhe - January Proposed Compiler Model  Idea: (extended from [Dou and Cintra, PACT’04]) 1.Compute a table of thread sizes based on all possible execution paths (base thread sizes) 2.Generate new thread sizes for those execution paths which have speculative overheads (overheaded thread sizes) 3.Consider all possible assignments of above sizes on P processors, each weighted by its probability 4.Remove improper assignments and adjust probabilities 5.Compute expected sequential (Tseq est ) and parallel (Tpar est ) execution times 6. S est = Tseq est /Tpar est

University of Karlsruhe - January Compute Thread Sizes Based on Execution Paths for() { … if () { … …=X[A[i]] … X[B[i]]=… … } else { … Y[C[i]]=… … } W1, p1 ld 1 st w1 w2 p1 W1 = w1+ w2p1 W2, p2 2 st W2 = w1+ w3p2 w3 w1 W1, p1 ld 1 st w1 w2 p1 W1 = w1+ w2 p1

University of Karlsruhe - January Generating New Thread Sizes for Speculative Overheads W1, p1 ld 1 st W2, p2 2 st W2, p2 2 st W1, p1’ ld 1 st W3, p3 ld 3 st W1 W2 p1 p2 W1 = W1+ w W2 = W2 + w W3 = W3 + w p1’ p2 p3 ld 1 st

University of Karlsruhe - January Consider All Assignments: the Thread Tuple Model PE0PE1PE2PE PE0PE1PE2PE3 1 PE0PE1PE2PE PE0PE1PE2PE

University of Karlsruhe - January Consider All Assignments: the Thread Tuple Model (contd.)  Three thread sizes W1,W2 and W3, assigned onto 4 processors  81 variations, each called a tuple  In general: N thread sizes and P processors  N P tuples TupleAssignment Probability p1.p1.p1.p p1.p1.p1.p p1.p1.p1.p3 … … … p3.p3.p3.p p3.p3.p3.p3

University of Karlsruhe - January Remove Improper Assignments and Adjust Probabilities  Some assignments can never happen –e.g., squashed and overflowed threads cannot appear in PE0 –e.g., squashed thread can only appear if the “producer” thread appears in a predecessor processor –e.g., overflowed thread can only appear if a thread larger than the time of the stalling store appears in a predecessor processor  Probabilities vary across processors –e.g., probability of a squashed thread appearing increases from PE1 to PEP (increased chance that the producer may appear in a predecessor processor)

University of Karlsruhe - January Remove Improper Assignments and Adjust Probabilities (contd.) TupleAssignment Probability p1.p1.p1.p p1.p1.p1.p p1.p1.p1.p3 … … … p3.p3.p3.p p3.p3.p3.p3 Cannot appear p1,0. p1,1. p1,2. p1,3 p1,0. p1,1. p1,2. p2,3 p1,0. p1,1. p1,2. p3,3 Add up to 1

University of Karlsruhe - January Compute Sequential and Parallel Execution Times  Within a tuple: T seq tuple = ∑ WT par tuple = max( W ) TupleAssignment T seq tuple p1.p1.p1.p p1.p1.p1.p p1.p1.p1.p3 … … … p3.p3.p3.p p3.p3.p3.p3 ProbabilityT par tuple 4.W1 3.W1 + W2 3.W1 + W3 … W1 + 3.W3 4.W3 W1 W3... W3 ii i in tuple

University of Karlsruhe - January Compute Sequential and Parallel Execution Times (contd.) TupleAssignment T seq tuple p1.p1.p1.p p1.p1.p1.p p1.p1.p1.p3 … … … p3.p3.p3.p p3.p3.p3.p3 ProbabilityT par tuple 4.W1 3.W1 + W2 3.W1 + W3 … W1 + 3.W3 4.W3 W1 W3... W3 T seq est  Estimated sequential execution time: T par est  Estimated parallel execution time:

University of Karlsruhe - January Compute Sequential and Parallel Execution Times (contd.)  Estimated sequential execution time:  Estimated parallel execution time: where: T seq est =P. ∑ W. p T par est = ∑ p(T par tuple =W ). W p(T par tuple =W ) =  ( ∑ p l,k ) – ∑ p(T par tuple =W m ) i=1 NB i=1 N i i i ii O(NB) << enumeration, NB:number of base thread sizes k=0 P-1 l=1 i m=1 i-1 O(N*P+O(p i,j )) O(p i,j ):complexity of computing the p i,j ’ s

University of Karlsruhe - January Compute Sequential and Parallel Execution Times (contd.)  where p i,j is the probability that thread i appears in processor j and is either equal to p i or is computed for every pair of threads involved in an overhead.  Each p i,j is computed in: –O(1) for squash overhead –O(NB) for overflow overhead  Thus, all p i,j ’s are computed in O(NB*N*P)=O(N 2 *P)  Thus, all p(T par tuple =W i ) are computed in O(N 2 *P)  Finally, T par est is then computed in: O(N 2 *P) << enumeration

University of Karlsruhe - January Computing the Estimated Speedup  S est = Tseq est /Tpar est O(N 2 P): << enumeration (compare with O(N) of PACT’04 model)

University of Karlsruhe - January Outline  Motivation  Speculative Parallelization  Compiler Cost Model –Evaluation –Related Work –Conclusions  Current and Future Directions

University of Karlsruhe - January Evaluation Environment  Implementation: IR of SUIF1 –high level control structure retained –instructions within basic blocks dismantled  Simulation: trace-driven with Simics  Architecture: Stanford Hydra CMP –4 single-issue processors –private 16KB L1 cache –private fully associative 2KB speculative buffer –shared on-chip 2MB L2 cache

University of Karlsruhe - January Applications  Subset of SPEC2000 benchmarks –4 floating point and 5 integer  MinneSPEC reduced input sets: –input size: 2~3 billion instructions –simulated instructions: 100m to 600m  Focus on load imbalance and squash overheads –none of the loops suffered from overflow  Total of 190 loops –collectively account for about 50% to 100% of sequential execution time of most applications

University of Karlsruhe - January Speedup Distribution Very varied speedup/slowdown behavior

University of Karlsruhe - January Model Accuracy (I): Outcomes Only 23% false positives (performance degradation) Negligible false negatives (missed opportunities) Most speedups/slowdowns correctly predicted by the model

University of Karlsruhe - January Error less than 50% for 77% of the loops Model Accuracy (II): Cumulative Errors Distribution Acceptable errors, but room for improvement

University of Karlsruhe - January Model Accuracy (II): Cumulative Errors Distribution

University of Karlsruhe - January Performance Improvements Mostly better performance than previous policies Close to performance of oracle Can curb performance degradation of naive policy

University of Karlsruhe - January Outline  Motivation  Speculative Parallelization  Compiler Cost Model –Evaluation –Related Work –Conclusions  Current and Future Directions

University of Karlsruhe - January Related Work  Architectures support speculative parallelization: –Multiscalar processor (Wisconsin); –Hydra (Stanford); –Clustered Speculative multithreaded processor (UPC); –Thread-level Data Speculation (CMU); –MAJC (Sun); –Superthreaded processor (Minnesota); –Multiplex (Purdue); –CMP with speculative multithreading (Illinois)

University of Karlsruhe - January Related Work  Compiler support for speculative parallelization: –most of above projects have a compiler branch –thread partitioning, optimizations, based on simple heuristics and/or profiling  Recent publications on compiler cost model –Chen et. al. (PPoPP’03)  a mathematical model, concentrated on probabilistic points-to –Du et. al. (PLDI’04)  cost model of squash overhead based on probability of dependences  only intended for a CMP with 2 processors  No literature found on cost model with the inclusion of load imbalance and other overheads and for several processors

University of Karlsruhe - January Conclusions  Compiler cost model of speculative multithreaded execution  Fairly accurate quantitative predictions of speedup: –correctly identify speedup/slowdown in 73% of cases –errors of less than 50% for 77% of the cases  Good model-driven selection policy: –usually faster than other policies and within 11% of oracle –can curb performance degradation of naïve policy  Can accommodate all other speculative execution overheads –accuracy not as high as PACT’04 results, but still good for a static scheme

University of Karlsruhe - January Outline  Motivation  Speculative Parallelization  Compiler Cost Model –Evaluation –Related Work –Conclusions  Current and Future Directions

University of Karlsruhe - January Current and Future Directions  Software-only speculative parallelization –Speculatively parallelized Convex Hull problem (CG) (CGA’04) –Parallelizing Minimum Enclosing Circle (CG) and Simultaneous Multiple Sequence Alignment (Bioinformatics) problems –Extending scheme to reduce/eliminate overheads  Complete compiler model of overheads –Use probabilistic memory disambiguation analysis to factor squash overhead into model –Use probabilistic cache miss models to factor speculative buffer overflow overhead into model

University of Karlsruhe - January Current and Future Directions  Probabilistic memory disambiguation –Extend current points-to, alias, and data flow analyses to generate probability of occurrence of these relations –Necessary infrastructure for all quantitative cost models for data speculation  Other speculative multithreading models –Combining speculative parallelization with speculative helper threads

University of Karlsruhe - January Current and Future Directions  Speculative compiler optimizations –Perform traditional compiler optimizations in the presence of potential data flow relations (e.g., loop distribution/fusion; hoisting; hyperblock instruction scheduling) –Use spare contexts in SMT (Hyperthreading) processors to run/verify speculative optimization (“helper thread”) –Add TLS support for deep speculation

University of Karlsruhe - January Acknowledgments  Research Team and Collaborators –Prof. Diego Llanos (University of Valladolid, Spain) –Prof. Belen Palop (University of Valladolid, Spain) –Jialin Dou –Constantino Ribeiro –Salman Khan –Syamsul Bin Hussin  Funding –UK – EPSRC –EC – TRACS –EC – HPC Europa

Toward a Compiler Framework for Thread-Level Speculation Marcelo Cintra University of Edinburgh

University of Karlsruhe - January Squashing Useful work Possibly correct work Wasted correct work Squash overhead Producer i i+j Consumer i+j+1 i+j+2 Wr Rd Time Squashing is very costly

University of Karlsruhe - January Compute Sequential and Parallel Execution Times (contd.)  Each p i,j is computed as: –Squash overheads –Overflow overheads p i.(1-p ovflow )+p i.p ovflow.(1- ∑ p k ) j, for original size and j  0 (1-p producer ) j.p base +(1-(1-p producer ) j ).p base.p dep, for original size p i,j = p i, for original size and j=0 (1-(1-p producer ) j.p base.p dep, for squashed size k=longer B (( ∑ p k ) j -( ∑ p k ) j ).p ovflow.p base, for overflowed size k=1 wait k=1 wait-1 p i,j =

University of Karlsruhe - January Model Accuracy (III): Squash Prediction

University of Karlsruhe - January Sources of largest errors (top 10%) Source of errorNumberError (%) Incorrect IR workload estimation 454~116 Unknown iteration count(i<P) 354~61 Unknown inner loop iteration count 298~161 Biased conditional 1136