Chapter 4 Itanium EPIC Processor Architecture

Chapter 4 Itanium EPIC Processor Architecture
ECE 371 Microprocessors Chapter 4 Itanium EPIC Processor Architecture Herbert G. Mayer, PSU Status 11/5/2015 For use at CCUT Fall 2015

Syllabus Introduction Intel® Itanium® Architecture Data and Memory
Itanium Registers Instruction Set Architecture ISA Assembler Source Program Appendix Bibliography

Photo of Itanium 2 Processor

Itanium Processor Block Diagram

Introduction The Itanium® processor is Intel’s first published, commercial 64-bit computer product, launched 2001, co-developed with HP Corp. IPF stands for Itanium Processor Family Published means: Smart Intel was diligently developing a contemporaneous, competing 64-bit processor, the extended version of its ancient x86 architecture, just in case, as a secret backup risk hedge 64-bit means that the logical address range spans 264 different memory bytes; and natural integer objects are 64 bits wide The exact format of data objects is described in section Data and Memory During its development at Intel, the first generation of Itanium processors was internally code-named Merced The family is now officially called IPF, for Itanium Processor Family, while early in its development it was referred to as IA- 64, for Intel 64-bit architecture; conflicting later with x86

Introduction Intel’s Itanium architecture is radically different from the widely used 32-bit IA-32 architecture IA-32 should be referred to as x86 architecture, lest one incorrectly infers today that it be restricted to 32-bit addresses and integer types of 32-bit length That limitation no longer exists since introduction of 64-bit versions about ½ year after AMD’s extension of IA-32 to 64 bits; see also EM64T Imagine how Intel felt, when AMD, the company having produced CPUs compatible with Intel’s chips, suddenly had a more advanced, attractive x86 CPU!

Intel® Itanium® Architecture
Interestingly, IA-32 object code is executable on Itanium processors More interesting yet, even the Hewlett-Packard PA- RISC code is executable on this novel 64-bit IPF processor HP and Intel were strategic partners in the definition, development, and cost sharing of the IPF, with HP having initiated the development Cautious about performance inferences! Just because IA-32 object code is executable on IPF, one should not deduce such code executes on IPF as fast as on an x86 processor! 

IPF is Intel’s and HP’s first instance of the novel EPIC architecture EPIC stands for Explicitly Parallel Instruction Computing. It is Intel’s first launched 64-bit architecture; the second was launched later (1q04), with EM64T, the first 64-bit version of the old x86 architecture HP already had a 64-bit version with its Performance Architecture (PA) RISC processor at the time of Itanium launch Explicit means, the assembly language programmer bears the intellectual burden (or the smart compiler) to take advantage of the parallelism in the architecture; see ref [8] It is not the processor that automatically exploits the numerous, parallel computing modules; the microprocessor needs to be told!

As a consequence, compilers for IPF are highly complex; see Donald Knuth’s comment, ref [7] Compiler complexity is not desirable, as that means more errors, decreased object code quality, something a new architecture should avoid On the other hand, the IPF has provided explicit architectural features that enable implementing highly optimizing compilers A case in point is architectural support for software pipelined loops (SW PL) Certain source constructs let the compiler emit SW PL loops that need no prologue and epilogue Absence of Prologue and Epilogue not only renders the object code more compact, but also faster

Parallel means an Itanium processor gains speed not solely via high clock rates, but via simultaneous execution of multiple operations in one clock cycle Key concepts refined, or newly introduced, in IPF include: predication, branch prediction, branch elimination, conditional move, speculation, parallel comparisons, and a large register file The first implementation of the new 64-bit Intel + HP Itanium architecture only implemented 44 physical of the 64 logical address bits

With 44 bits, the total initial address range of first Itanium HW was only about a millionth of the logical address range, but still 4000 times larger than earlier 32-bit architecture In its second generation, 56 physical bits of the 64- bit logical address space were implemented in HW Product name of that new version: Itanium® 2 Short-term, no severe limitations were expected with restricted 56-bit addresses Still about 16 million times larger than 32-bit addressing space Integer type operands are of course full 64 bits wide

Unlike earlier parallel VLIW architectures, EPIC does not use a fixed width instruction encoding Instead, operational functions can be combined to operate in parallel from a single to as many instructions as desired What is critical in EPIC is that all code is written assuming parallel semantics within a group (to be explained later), and sequential semantics across groups To be able to run in parallel, the machine is built with multiple execution modules that can all work at the same time This allows a natural architecture migration from say, 6 HW modules executing on today’s Itanium, to as many as can be crammed into a future silicon microprocessor a few years from now

To illustrate a sample taken from ref [1], consider 2 memory operands a and b to be swapped temp := a; // a, b, temp, are memory locs a := b; b := temp; The semicolon operator ‘;’ implies sequential semantics. On a machine with parallel semantics, it would be sufficient to write a := b, // operand latching needed b := a; // operand latching needed With the comma operator ‘,’ implying parallel semantics, similar to syntactic conventions in the programming language Algol-68 This source snipped is just a generic example; NOT a sample of the Itanium assembly language

Data & Memory

Data and Memory Native data types of IPF resemble conventional 32-bit architectures, except for the longer 64-bit integer and unsigned formats An extension over IA-32 object code is the IPF bundle Data types include integer, unsigned, floating-point, and pointer Integers are of different widths: byte, word, double- word, or quad-word precision Length in bits as well as min and max values are listed below:

Data and Memory, Min Max

Data and Memory Negative numbers are represented in two’s complement format, with the sign-bit in the most- significant position Floating-point data use the IEEE 754 standard Bits representing integer values are numbered from 0 in the least significant position (rightmost position) to higher values For example, the most significant bit in a double word is in position indexed 31 (Note the unusual word definition on Intel architectures: 2 bytes) Maximum address on the first generation Itanium processor (Merced) was only 17,592,186,040,322 or It grew in the second generation to 56 bits, and is now a full 64-bits long

Data and Memory Bytes are stored in little-endian order by default
Possible to programmatically select little- or big- endian order, by setting the be bit in the user mask, a special status register The be bit (for big-endian) does not affect how instructions are stored or fetched from memory Object code is always represented in little-endian order; programmer selected endianness only impacts data In little-endian order, data bytes with the lowest numeric value are stored in the byte with the lowest address; conversely for big-endian order

Data and Memory Data quad-word 0x1102030455060708 would be stored:
Data stored in 8 adjacent bytes in memory in little-endian order: Same int value 0x stored in big-endian order:

Itanium Registers The Itanium processor has 128 general registers (GR), 128 floating-point registers (FR), 64 single- bit predicate registers (PR), 8 branch registers (BR), and 128 application registers (AR) In addition, there are Performance Monitor Data registers (PMD), processor identifiers (CPUID), a Current Frame Marker register (CFM), user mask (UM), and instruction pointer registers (IP) GRs, FRs, BRs, ARs, CPUIDs, IP, and PMDs are 64 bits wide PRs are 1 bit wide, while the UM holds 6 and the CFM 38 bits; depicted below:

Itanium Register File

Itanium Registers GR The 128 GR registers are the common workhorses during computation They contain integer values being computed It is possible to use these integer values as machine addresses, thus GRs can be used as pointers in load- and store-operations All machine instructions can refer to these registers, for reading and writing values In addition to the 64 data bits, each GR has an associated NAT bit, which stands for Not A Thing NAT is 1, if the associated register has not been initialized with valid data

Itanium Registers GR NATs support speculation
For example, if a speculative load is issued but aborted, before the value arrives in its destined GR, the NAT state records that fact Enables integrity of the machine’s exception process There are 2 groups of GR registers: The first 32, GR0 through GR31, are visible to all software, and are used to hold globally computed, intermediate values However, GR0 is read-only, providing the constant 0, 64 bits long

Itanium Registers GR The next 96, GR32 to GR127, are used to implement a small but frequently used portion of the top of the run-time stack; i.e. work like a special-purpose top- of-stack cache These stack registers are made available to SW by allocation of a register stack frame, and include from 0 to 96 registers Registers not used from this subset are inaccessible to general SW The stack frame portion implemented via GRs is further partitioned into subsections, one meant to hold local registers, the other output registers, i.e. results of the current function call

Sample Stack Frame, Generic

Itanium Predicate Registers PR
Execution of most IPF instructions can be predicated by one of the PRs Value 1 in the PR means: the operation can be completed normally PR value 0 means the result will not be posted (committed), even if it has been computed already. I.e. there will be no stores and no impact on any AR of the machine Exception of an instruction that cannot be predicated is the loop operation

Itanium Predicate Registers
The PRs are also partitioned into 2 sections: PR0 through PR15 are static PRs The other 48 are so called rotating PRs PR0 is an exceptional register, it can only be read, and its value is always 1, meaning, the predicate is true; thus PR0 denotes unconditional execution The remaining 48 PRs are used to hold stage predicates, used during software pipelining SW PL to be discussed in advanced computer architecture

Branch Registers BR IPF instructions are grouped in bundles, which are 16-byte aligned byte sequences holding executable code. Hence their rightmost 4 address bits will always be 0 due to alignment; these 4 address bits don’t need to be stored explicitly Execution of an indirect branch requires an explicit operand On the Itanium architecture this operand is a branch register; a branch register BR holds the branch destination The machine then loads the value of the referenced BR into the IP register and execution continues from there; IP stands for Instruction Pointer Executing branch-related instructions is about the only way to directly affect the value in the instruction pointer, the register that holds the address of the next bundle to be executed

Current Frame Marker Register CFM
Note: Frame Marker is often referred to as Stack Frame, and its fixed portion as the Stack Marker Each function has a specific stack frame associated with it, which is created at function invocation; it is cleared at function return If all the relevant data of a function’s stack frame do fit, they are placed in the stack of general registers; else the overflowing data must reside in memory Either way, the current frame marker (CFM) holds the frame marker for the function that is currently active Generally, most functions have small stack frames

Current Frame Marker Register CFM
Layout of the CFM: Meaning of Bits in CFM:

Application Registers AR
Application Registers – t.b.d.:

Instruction Pointer IP
IPF instructions are fetched in units of bundles, which are chunks of 16 bytes, or 128 bits Bundles are stored bundle-aligned The ip can address 18,446,744,073,709,551,616 different bytes (but only at bundle addresses) The rightmost 4 bits of the ip thus will always be zero, due to the bundle-alignment Hence these 4 bits don’t needs to be stored on the microprocessor silicon

Performance Monitor Data Register
These are architecture-provided resources that record the use of hardware modules Contents is read-only by SW But contrary to the performance monitor registers on Intel Pentium architectures, they are user visible on Itanium

Instruction Set Architecture
Itanium ISA Instruction Set Architecture

Instruction Set Architecture ISA
Parallelism, Dependences, and Groups Itanium instructions packaged in groups can execute in parallel; allows fast execution, if HW is available! Assembly programmer or compiler may craft groups as large as desired; the performance consequence is: All operations embedded in a single group can be executed simultaneously, in parallel, saving time over the equivalent sequential execution The physical silicon angle of this is: Of all operations that could be executed in parallel only those are actually performed in parallel, for which there exist HW resources E.g. on an Itanium® 2 implementation of IPF, there are 6 units available to operate in parallel

Parallelism, Dependences, and Groups If fewer actions are enclosed in a group, some HW will idle If more actions could be included in a group, then all HW elements are active, yet some degree of possible parallelism will be lost; future HW implementations may execute that same object code faster due to the higher degree of parallelism Parallel execution is not feasible if dependencies exist between instructions On Itanium these dependencies are not resolved by the machine It is the human programmer or optimizer that explicitly tracks, what can be done in parallel, and what must be done in sequence. The machine just runs it, goal: TO BE FAST!

Parallelism, Dependences, and Groups If a result has to be computed first before it can be read somewhere else (memory or register), a true dependence exists; AKA data dependence; conventional to say “dependence” On Itanium we call this a RAW (Read after Write) dependence If a result has to be read first before it can be re-computed, a false dependence is created, AKA anti-dependence On Itanium this is named WAR (Write after Read) dependency If a result has to be computed first before it can be computed again, assuming that an intermediate reference is possible, output dependence is created Itanium calls this third dependence: WAW (Write after Write) dependence

Parallelism, Dependences, and Groups In all these cases, the prior operation has to complete, before the dependent can be started; e.g.: ld8 r14 = [r3] -- load GR14 w. 8 bytes addr. by GR3 add r15 = r14, r16 -– integer sum into GR15, RAW dep This is an example of RAW dependence, AKA true dependence The loading of an 8-byte value into (8-byte) register GR14 must complete first, before the addition of the 2 long integer values, held in GR14 and GR16, can be started Note the assembler register names: r14, and not gr14 This is Intel and HP assembly language convention! Another assembler may use different conventions

Assembly Language Format Format of an Itanium assembler instruction: In meta-syntax [ and ] brackets mean that the bracketed portion of the instruction is optional In assembly syntax, square bracket pairs [] express: indirection Careful not to get confused by 2 different contexts! [(pr)] mnemonic[.comp] dest = src1 [, src2 [, src3 ] ] Meaning of the various assembly language fields:

Assembly Language Format A sample assembly language instruction is shown next: (p0) add r5 = r4, r3, 1 // (p0) can be skipped This is an integer add instruction that sums up the integer values in GR4 and GR3, also adds integer literal 1 Assigns sum to register GR5. Since the predicate register used is PR0, which is always true, the commit of the sum to register GR5 is unconditional, as if no predicate qualifier had been given Predicate registers, when listed, are enclosed in ( ) parentheses Not all instructions allow or need a completer. Typical completers are shown below Some instructions allow multiple completers, notably the memory access instructions, and branch instructions

Itanium Bundle Format Executable code on Itanium comes in units of bundles. A bundle consists of 3 instructions, all grouped with an associated template Template completes the instruction specification and above all, defines group boundaries Boundary is also known as a stop. Stop defines where one group ends and another group starts If no stop is included in a template, this means that the bundle will be part of a larger group, consisting of more instructions in the next bundle

Itanium Bundle Format Each instruction is 41 bits long, a template consumes 5 bits, one template per bundle With 3 instructions per bundle, the overall bundle length is 3 * = 128 bits, fitting into 16 bytes; all bundle-aligned, easily accomplished due to first bundle residing on a mod-16 memory boundary From then on all will be aligned on 16-byte boundaries With the memory bus being 128 bits wide (or wider on future IPF implementations) and bundles being bundle- aligned, fetching instruction memory is fast Requiring one single transfer on the bus

Itanium Bundle Format General layout of a bundle is shown next, with bits ordered from 0 through 127 increasing r. to l. The template serves as a means for the compiler to communicate additional information about instructions 1, 2, and 3, without which they could be ambiguous One such key piece of information is the placement of an instruction group stop, in assembler ;;

Itanium Bundle Format A group stop can occur after instruction 2, or 1, or 0, indicating an earlier group must complete execution, before another starts But Itanium instructions allows at most 2 stops in a bundle If 3 stops are needed, a NOOP must be packed into one of the instructions, to effectively create 2 physical groups, with the third being the NOOP, whose execution order does not matter Compiler-generated code performs this work- around automatically

Itanium Bundle Format The template specifies which types of instructions are assembled into slot 0, 1, and 2 IPF instructions are partitioned into the following 6 groups:

Itanium Bundle Format Providing such information in the template speeds up instruction decoding, improving execution speed A list with the Instruction Set Architecture (ISA) templates and embedded stops is shown next Note at most 2 stops in any of the formats On an architecture that aims to have large groups, it seems logical to have few stops (max 2) per bundle

Itanium Bundle Format The difference between above templates 0x00 and 0x01, both being MII type operations is: after instruction 2 in template 0x01 there is a stop, while in template 0x00 there is none In other words, the next bundle after the one for template 0x00 will belong to the same group, and a higher degree of parallelism will be possible there

Itanium Assembly Code A group is a sequence of 1 or more instructions delimited by a stop. The first instruction in a whole program is thought to be preceded by a stop Similarly, the last instruction of a complete program is thought to be followed by a stop All instructions placed into a single group can be executed in parallel. Whether or not they will depends on the number of hardware resources available. In the initial Itanium architecture only 6 resources were available In a later implementation, more HW resources may become available, thus potentially speeding up execution of the same old, unchanged Itanium code on a future generation The ;; indicates to the assembler, where one boundary ends and thus the next group starts

Itanium Assembly Code Some assembly language instructions follow: comp.eq p1, p2 = r33, r34 This checks general purpose registers 33 and 34 for equality; if equal, predicate register 1 is set to true, predicate register 2 to false. Otherwise p1 is set to false and p2 to true. A more complicated case is: (p3) comp.eq.unc p1, p2 = r33, r34 checks if predicate register 3 is true at the start. If so, if registers GR33 and GR34 are equal, register p1 is set to true and p2 to false, else the reverse Else –i.e. if p3 is false a priori— then predicate registers 1 and 2 are both set to false

Assembler Source Program Stack Unwind Operations
With & Without Stack Unwind Operations From ref [8]

Assembler for Hello World, With
// hello_world.c assembly with unwind directive // sample taken from ref [8] // page 1/3 .file "hello.c" .pred.safe_across_calls p1-p5, p16-p63 .section .rdata, "a", "progbits" .align 8 .STRING1: stringz "Hello World!!!\n" .text .align 16 .global hello# .proc hello# hello: .prologue .save ar.pfs, r34

// hello_world.c assembly with unwind directive // sample taken from ref [8] // page 2/3 alloc r34 = ar.pfs, 0, 4, 1, 0 .vframe r35 mov r35 = r12 .save rp, r33 mov r33 = b0 // load branch register into GR33 .body addl r36 gp ;; ld8 r36 = [r36] mov r32 = r1 br.call.sptk.many b0 = printf# // b0!

// hello_world.c assembly with unwind directive // sample taken from ref [8] // page 3/3 mov r1 = r32 mov ar.pfs = r34 mov b0 = r33 // restore branch register .restore sp mov r12 = r35 br.ret.sptk.many b0 .endp hello# .global printf# .type

Assembler for Hello World, Without
// hello_world.c assembly without unwind directive // sample taken from ref [8] // page 1/3 // The string is defined in the read only data section .section .rdata, "a", "progbits" .align 8 .STRING1: stringz "Hello World!!!\n" // definition of function hello is in text section // Registers to be saved in local registers: // gp = r1 - loc0 = r32 // rp = b0 - loc1 = r33 // ar.pfs - loc2 = r34 // sp = r12 - loc3 = r35

// hello_world.c assembly without unwind directive // sample taken from ref [8] // page 2/3 .text .global hello .proc hello hello: alloc loc2 = ar.pfs, 0, 4, 1, 0 mov loc3 = sp mov loc1 = b0 // save branch register b0 addl out0 gp ;; ld8 out0 = [out0] // group of 3 instructions mov loc0 = gp br.call.sptk.many b0 = printf

// hello_world.c assembly without unwind directive // sample taken from ref [8] // page 3/3 mov gp = loc0 mov ar.pfs = loc2 mov b0 = loc1 mov sp = loc3 br.ret.sptk.many b0 .endp hello .global printf .type

Appendix: Some Definitions

Definitions Branch Elimination
Replacing object code that has conditional branches, with code that has a straight-forward execution path, lacking branches The second version with branches eliminated must be semantically equivalent to the original code with branches Everything else equal, the version without branches generally executes faster due to less cache misses

Definitions Bundle Group of 3 instructions plus a template, that all fit into a 16-byte long, 16-byte aligned section of instruction memory on Itanium Total number of bits = 128

Definitions Conditional Move
Move instruction that transfers bits from source to destination, but only if an associated condition is true Otherwise the instruction operates like a noop Such a move can serve as a special case of branch elimination. For example, the C source construct: if ( a > 0 ) x = 99; -- HL source program could be mapped into the conditional move: cmov x, #99, a, #0, gt -- hypothetical asm which has no branches. Source operand #99 is moved into memory location x only if the > condition holds between operands a and integer literal 0

Definitions Endian, Endianness
A convention that defines in which order the higher-valued bytes of a multi-byte data object are addressed Can be programmed on Itanium with be bit If the higher address byte holds the higher numeric value, we call this little-endian typical on Intel x86 architecture The other way around we call big-endian ordering typical on IBM 370 architecture

Definitions EPIC Explicitly Parallel Instruction Computing, with IPF being the first commercial architecture that implements EPIC Note IPF’s ability to also execute old Intel x86 and old HP PA object code

Definitions Epilogue When the steady state of a software pipelined loop completes, there may be yet to be used operands and operations to be computed that would not fit into the steady state These last operands must be consumed, some even be generated during the epilogue, and ultimately the pipeline must be drained This is accomplished in the object code after the steady state, and that portion of code is called the epilogue See also prologue

Definitions Group A sequence of instructions, each with an associated template and a defined stop A group is composed of one bundle or more The stop means, the hardware cannot start executing any subsequent group, until the current group has completed Syntax notation for stop in Itanium assembler is the double-semicolon ;;

( ( a > b ) && ( c <= d ) )
Definitions Parallel Comparison A composite source program condition of the form: ( ( a > b ) && ( c <= d ) ) requires multiple steps to compute a boolean predicate Generally, on a sequential architecture these multiple steps are combined via explicit instructions for anding and oring, or else the flow of control of execution selects a matching true label. All this takes time The Itanium processor allows parallel evaluation of certain composite Boolean expressions in one single step The result can be used as a predicate in subsequent instructions. Notice that such combined Boolean expressions must be side-effect free Is not equivalent to C’s short-circuit evaluation of complex boolean expressions!

( fun( j, k ) && ( i < MAX ) )
Definitions Parallel Comparison, Cont’d For example, another complex boolean expression ( fun( j, k ) && ( i < MAX ) ) cannot be mapped into a parallel EPIC comparison Since one operand is a function call fun( i, k ) with a possibly large number of parameters, and may have a side-effect on one of the other operands, for example “i” which is yet to be compared This type of boolean expression is mapped into sequential code

Definitions Predication
Is the association of a boolean condition with the execution of an instruction sequence. This allows the following: Two instruction streams can be executed in parallel, clearly requiring multiple hardware modules; provided on EPIC Both streams have a predicate associated with their operations. Only the stream with the true predicate is actually retired; the other will be aborted and ignored Abort can happen as soon as the predicate is known. This means, the computation of the predicate can proceed in parallel with the execution of the two code streams, but must complete by the time these 2 code streams waite for who’ll be the winner An ISA with predication requires bits for the predicates to use, and which direction (true? or false?) to select Also, the discarded code path may contain no side-effect, such as a write to memory!

Definitions Prologue Before a software pipelined loop body can be initiated, hardware resources (e.g. registers) must be initialized; we say the loop must be primed This is accomplished in the object code before the steady state, called the Prologue See also epilogue

Definitions Register File The IPF has a rich set of registers
This includes 128 general purpose registers (for integer operations), 128 floating-point-, 64 predicate-, 64 branch-, and 128 so-called application registers Also a variety of special purpose register is visible; visible means accessible by the assembly language program Includes a user mask, stack marker (frame marker), ip, processor id, and performance monitoring registers

Definitions Speculation
If it is suspected --but not sure-- that operand o will be used in the future, and this operand is not readily available (not yet in a high-speed register), and it takes long to fetch o, a processor may initiate the fetch well before it is actually used Advantage: by the time o is needed, it is already available without delay Disadvantage: if the flow of control never reaches the place where o was thought to be needed, then the speculative fetch was superfluous May still be meaningful, if a) no side-effects occurred that are harmful to program correctness, and b) if the hardware resource required to fetch o was idle anyway; then no loss!

Definitions Steady State
The software pipelined object code executed repeatedly, after the Prologue has been initiated, before the Epilogue will be active, is called the Steady State Each iteration of the Steady State makes some progress toward multiple iterations of the original source loop See also prologue and epilogue

Definitions Syllable Is the instruction-only portion of a bundle
A bundle always holds 3 instructions plus a template, the template specifying additional necessary information about an instruction The instruction alone, without the needed template information, is a syllable

Bibliography Triebel, Walter: “IA-64 Architecture for Software Developers”, Intel Press © 2000, 308 pages iid=c2d2e0aecd2b7110VgnVCM d6e10RCRD&ciid=ce1f d701521c7110VgnVCM d6e10RCRD Donald Knuth: “Interview with Donald Knuth” Intel® Itanium® Architecture Assembly Reference Guide, © 2002, Intel order number , at

Chapter 4 Itanium EPIC Processor Architecture

Similar presentations

Presentation on theme: "Chapter 4 Itanium EPIC Processor Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 4 Itanium EPIC Processor Architecture

Similar presentations

Presentation on theme: "Chapter 4 Itanium EPIC Processor Architecture"— Presentation transcript:

Similar presentations

About project

Feedback