CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard.

Slides:



Advertisements
Similar presentations
1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
1 Lecture 10 Intermediate Representations. 2 front end »produces an intermediate representation (IR) for the program. optimizer »transforms the code in.
® IA-64 Architecture Innovations John Crawford Architect & Intel Fellow Intel Corporation Jerry Huck Manager & Lead Architect Hewlett Packard Co.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Multiscalar processors
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
1 Code Compression Motivations Data compression techniques Code compression options and methods Comparison.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
OOE vs. EPIC Emily Evans Prashant Nagaraddi Lin Gu.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
Register Allocation CS 471 November 12, CS 471 – Fall 2007 Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
CS 352H: Computer Systems Architecture
15-740/ Computer Architecture Lecture 3: Performance
Multiscalar Processors
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Henk Corporaal TUEindhoven 2009
CSL718 : VLIW - Software Driven ILP
Amir Kamil and Katherine Yelick
Morgan Kaufmann Publishers The Processor
Introduction to Code Generation
Yingmin Li Ting Yan Qi Zhao
Ka-Ming Keung Swamy D Ponpandi
Henk Corporaal TUEindhoven 2011
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
* From AMD 1996 Publication #18522 Revision E
Lecture 16: Register Allocation
Amir Kamil and Katherine Yelick
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Loop-Level Parallelism
Static Scheduling Techniques
rePLay: A Hardware Framework for Dynamic Optimization
EECS 583 – Class 3 Region Formation, Predicated Execution
Instruction Scheduling Hal Perkins Autumn 2011
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

CompilingforIA-64 Carol Thompson Optimization Architect Hewlett Packard

History of ILP Compilers CISC era: no significant ILPCISC era: no significant ILP –Compiler is merely a tool to enable use of high- level language, at some performance cost RISC era: advent of ILPRISC era: advent of ILP –Compiler-influenced architecture –Instruction scheduling becomes important EPIC era: ILP as driving forceEPIC era: ILP as driving force –Compiler-specified ILP

Increasing Scope for ILP Compilation Early RISC CompilersEarly RISC Compilers –Basic block scope (delimited by branches & branch targets) Superscalar RISC and early VLIW CompilersSuperscalar RISC and early VLIW Compilers –Trace scope (single entry, single path) –Superblocks & Hyperblocks (single entry, multiple path) EPIC CompilersEPIC Compilers –Composite regions: multiple entry, multiple path Composite Regions Traces Superblock Basic Blocks

Unbalanced and Unbiased Control Flow Most code is not well balancedMost code is not well balanced –Many very small blocks –Some very large –Then and else clause frequently unbalanced –Number of instructions –Pathlength Many branches are highly biasedMany branches are highly biased –But some are not –Compiler can obtain frequency information from profiling or derive heuristically

Basic Blocks Basic Blocks are simpleBasic Blocks are simple –No issues with executing unnecessary instructions –No speculation or predication support required But, very limited ILPBut, very limited ILP –Short blocks offer very little opportunity for parallelism –Long latency code is unable to take advantage of issue bandwidth in an earlier block

Traces Traces allow scheduling of multiple blocks togetherTraces allow scheduling of multiple blocks together –Increases available ILP –Long latency operations can be moved up, as long as they are on the same trace But, unbiased branches are a problemBut, unbiased branches are a problem –Long latency code in slightly less frequent paths can’t move up –Issue bandwidth may go unused (not enough concurrent instructions to fill available execution units)

Superblocks and Hyperblocks Superblocks and Hyperblocks allow inclusion of multiple important pathsSuperblocks and Hyperblocks allow inclusion of multiple important paths –Long latency code may migrate up from multiple paths –Hyperblocks may be fully predicated –More effective utilization of issue bandwidth But, requires code duplicationBut, requires code duplication Wholesale predication may lengthen important pathsWholesale predication may lengthen important paths

Composite Regions Allow rejoin from non-Region codeAllow rejoin from non-Region code –Wholesale code duplication is not required –Support full code motion across region –Allow all interesting paths to be scheduled concurrently Nested, less important Regions bear the burden of the rejoinNested, less important Regions bear the burden of the rejoin –Compensation code, as needed

Predication Approaches Full Predication of entire RegionFull Predication of entire Region –Penalizes short paths

On-Demand Predication Predicate (and Speculate) as neededPredicate (and Speculate) as needed –reduce critical path(s) –fully utilize issue bandwidth Retain control flow to accommodate unbalanced pathsRetain control flow to accommodate unbalanced paths

Predicate Analysis Instruction scheduler requires knowledge of predicate relationshipsInstruction scheduler requires knowledge of predicate relationships –For dependence analysis –For code motion –… Predicate Query SystemPredicate Query System –Graphical representation of predicate relationships –Superset, subset, disjoint, …

Predicate Computation Compute all predicates possibly neededCompute all predicates possibly needed OptimizeOptimize –to share predicates where possible –to utilize parallel compares –to fully utilize dual-targets

Predication and Branch Counts Predication reduces branchesPredication reduces branches –at both moderate and aggressive opt. levels

Predication & Branch Prediction Comparable misprediction rate with predicationComparable misprediction rate with predication –despite significantly fewer branches Îincreased mean time between mispredicted branches

Register Allocation Modeled as a graph-coloring problem.Modeled as a graph-coloring problem. –Nodes in the graph represent live ranges of variables –Edges represent a temporal overlap of the live ranges –Nodes sharing an edge must be assigned different colors (registers) x =... y =... =... x z =... = … y = … z y z x Requires Two Colors y z x

Register Allocation x =... y =... x z y With Control Flow z =... = … z = … y x =... = … x x y z Requires Two Colors

Register Allocation x z y With Predication x x =... y =...z =... = …yx =... = …z = … x z Now Requires Three Colors y

Predicate Analysis p0 p2p1 x y x =... y =...z =... = …yx =... = …z = … x z p1 and p2 are disjoint If p1 is TRUE, p2 is false and vice versa

Register Allocation x z y With Predicate Analysis x y x =... y =...z =... = …yx =... = …z = … x z Now Back to Two Colors

Effect of Predicate-Aware Register Allocation Reduces register requirements for individual procedures by 0% to 75%Reduces register requirements for individual procedures by 0% to 75% –Depends upon how aggressively predication is applied Average dynamic reduction in register stack allocation for gcc is 4.7%Average dynamic reduction in register stack allocation for gcc is 4.7%

Object-Oriented Code ChallengesChallenges –Small Procedures, many indirect (virtual) –Limits size of regions, scope for ILP –Exception Handling –Bounds Checking (Java) –Inherently serial - must check before executing load or store SolutionsInlining for non-virtual functions or provably unique virtual functions Speculative inlining for most common variant Liveness analysis of handlers Architectural support for speculation ensures recoverability Speculative execution Guarantees correct exception behavior Dynamic optimization (e..g Java) Make use of dynamic profile

Method Calls Barrier between execution streamsBarrier between execution streams Often, location of called method must be determined at runtimeOften, location of called method must be determined at runtime –Costly “identity check” on object must complete before method may begin –Even if the call nearly always goes to the same place –Little ILP Resolve target method Call-dependent code Possible target Possible target Possible target

Speculating Across Method Calls Compiler predicts target methodCompiler predicts target method –Profiling –Current state of class hierarchy Predicted method is inlinedPredicted method is inlined –Full or partial Speculative execution of called method begins while actual target is determinedSpeculative execution of called method begins while actual target is determined

Speculation Across Method Calls Resolve target method call method Dominant called method Other target method Other target method call other method if needed Dominant called method Other target method Other target method Resolve target method

Bounds & Null Checks Checks inhibit code motionChecks inhibit code motion Null checksNull checks x = y.foo;if( y == null ) throw NullPointerException; x = y.foo; Bounds checksBounds checks x = a[i];if( a == null ) throw NullPointerException; if( i = a.length) throw ArrayIndexOutOfBounds Exception; x = a[i];

Speculating Across Bounds Checks Bounds checks rarely failBounds checks rarely fail x = a[i];ld.st = a[i]; if( a == null ) throw NullPointerException; if( i = a.length) throw ArrayIndexOutOfBoundsException; chk.st x = t; Long latency load can begin before checksLong latency load can begin before checks

Exception Handling Exception handling inhibits motion of subsequent codeException handling inhibits motion of subsequent code if( y.foo ) throw MyException; x = y.bar + z.baz;

Speculation in the Presence of Exception Handling Execution of subsequent instructions may begin before exception is resolvedExecution of subsequent instructions may begin before exception is resolved if( y.foo ) throw MyException; x = y.bar + z.baz; ldt1 = y.foo ld.st2 = y.bar ld.st3 = z.baz addx = t2 + t3 if( t1 ) throw MyException; chk.sx

Dependence Graph for Instruction Scheduling add t1 = 8,p (p1) ld4 t3 = [log] (p1) add t2 = 1,t2 mov out0 = 0 br.ret rp (p1) ld4 out0 = [t4] shladd t4 = n,4,t3 (p1) ld4 t3 = [p] (p1) st4 [log] = t2 ld4 count = [t1] cmp4.ge p1,p2=n,count If( n count ) { (*log)++; (*log)++; return p->x[n]; return p->x[n]; } else { return 0; return 0;}

Dependence Graph with Predication & Speculation add t1 = 8,p (p1) ld4 t3 = [log] (p1) add t2 = 1,t2 mov out0 = 0 br.ret rp (p1) ld4 out0 = [t4] shladd t4 = n,4,t3 (p1) ld4 t3 = [p] (p1) st4 [log] = t2 ld4 count = [t1] cmp4.ge p1,p2=n,count chk.a t4 chk.a p During dependence graph construction, potentially control and data speculative edges and nodes are identifiedDuring dependence graph construction, potentially control and data speculative edges and nodes are identified Check nodes are added where possibly needed (note that only data speculation checks are shown here)Check nodes are added where possibly needed (note that only data speculation checks are shown here)

Dependence Graph with Predication & Speculation add t1 = 8,p(p1) ld4 t3 = [log] (p1) add t2 = 1,t2 (p2) mov out0 = 0 br.ret rp (p1) ld4 out0 = [t4] shladd t4 = n,4,t3 (p1) ld4 t3 = [p] (p1) st4 [log] = t2 ld4 count = [t1] cmp4.ge p1,p2=n,count chk.a t4chk.a p Speculative edges may be violated. Here the graph is re-drawn to show the enhanced parallelismSpeculative edges may be violated. Here the graph is re-drawn to show the enhanced parallelism Note that the speculation of both writes to the out0 register would require insertion of a copy. The scheduler must consider this in its schedulingNote that the speculation of both writes to the out0 register would require insertion of a copy. The scheduler must consider this in its scheduling Nodes with sufficient slack (e.g. writes to out0) will not be speculatedNodes with sufficient slack (e.g. writes to out0) will not be speculated

Conclusions IA-64 compilers push the complexity of the compilerIA-64 compilers push the complexity of the compiler –However, the technology is a logical progression from today’s –Today’s RISC compilers – are more complex – are more reliable – and deliver more performance than those of the early days –Complexity trend is mirrored in both hardware and applications –Need a balance to maximize benefits from each