EPIC 64-bit Architecture: The Itanium and Itanium 2 CSCE 380 5/10/2000, 12/11/2003, 5/7/2004 Draft
2 Disclaimer w This slide show is intended to explain the EPIC architecture in a general way. It is not intended to give a precise description of the architecture.
3 EPIC w Explicitly Parallel Instruction Code w Earlier known as IA-64 Architecture w Originally it was jointly defined by Intel and Hewlett-Packard but it appears it now an Intel product. w 64 bit w It is designed to be used in clusters of up to 128 processors (or more). A machine with over 1900 processors exists in Richland, WA. w Will run binary IA-32 programs.
4 Current state-of-affairs w Compilers typically write sequential, in-order code. w Advanced CISC chips use lots of hardware logic to try to execute the code in parallel. w Advanced CISC chips use lots of hardware logic to try to execute the code “out-of- order.” w RISC chips are not designed for properly for parallel execution
5 EPIC: Some design principals w Intel’s conclusion: Let the compiler produce parallel, out-of-order code. This simplifies chip logic. The space saved can be used for things like more registers. w EPIC has 128 general purpose integer registers of which 96 can be stacked and renamed w It has 128 floating point registers of which 96 can be rotated or renamed w It has 64 1-bit predicate registers of which 48 can be rotated w It has “128” Application Registers which have special purposes. 56 are reserved and 16 are ignored. w It has 8 Branch Registers
6 EPIC: Explicitly Parallel Instruction Computing w Has resources for parallel execution Many registers Many functional units Inherently scalable w Explicit parallelism w Features Prediction Speculation …
7 VLIW: Very Long Instruction Word w EPIC packs 3 instructions into its 128 bit long instruction. w Compiler specifies parallelism The three instructions together are called a bundle Instruction 2Instruction 1Instruction 0 Template 127 0
8 Basic instructions Up to 3 in a bundle w Basic Itanium instruction: [qp]mnemonic[.comp] dest = srcs w qp: optional predicate register. The result is committed only if qp = 1 w mnemonic: unique instruction name w comp: optional variation for the instruction w dest: destination of the result w srcs: one or more sources (Intro to programming, page 1:132)
9 Instruction Groups w Instruction groups: groups of instructions that do not have RAW or WAW register dependencies w Depending on the machine, 1, 2, …, all of the instructions in the group can be issued in parallel w Hence: Instructions in a single instruction group cannot have RAW or Read after Write dependencies: One instruction cannot read a register written by another instruction in the same group. WAW or Write after Write dependencies: two instructions cannot write to the same register. (Software Developer’s Manual, page 1:133)
10 Instruction groups: Example 1 ld8 r1 = [r5] ; ; // first group or bundle add r3 = r1, r4 // second group ; ; a stop: the end of an instruction group Stops may appear inside at the end of a bundle. There can be several bundles in a group ld8 is a load 8 bytes from memory st8 is a store 8 bytes to memory
11 Instruction groups: Example 2 ld8 r1 = [r5] sub r6 = r8, r9 ; ; // first group add r3 = r1, r4 st8 [r6] = r12 ;; // second group add r3, 1, r6 ;; // third group Goals: 1. Put as many instruction is a group as possible to allow as much parallelism as possible. 2: Load in advance so that to avoid waiting for memory. (Adapted from Software Developer’s Manual, page 1:133) Read after write Write after write
12 Instruction groups: w Using pseudo code, break the following into instruction groups: w a = b - c; d = e + f + g; h = a * i; j = d; load r1=b load r2=c load r3=e load r4=f load r5=g load r6=i ;; Hopefully, additional instructions can be added to the groups before forming bundles sub r7 = r1, r2 // r7 = b – c add r8 = r3, r4 ;; // r8 = e + f store a = r7 add r9 = r8+r5 // r9 = e+f+g mul r10=r7*r6 ;; // r10=a*i store d = r9 store h = r10 store j = r9 ;;
13 Instructions w Standard add r1=r2,r3 w Predicated (p4) add r1=r2,r3 (Converted to a noop if p4 is false) w With immediate add r2=r3,1 w With completer cmp.eq p3=r2,r4 instruction option w Memory operations: ld r1=[r4](load) st [r4]=r1 (store)
14 Minimizing Memory Latency w Latency – Time wasted while waiting for memory w Problem: In normal CPUs, jumps make it difficult to schedule loads in advance so CPU stalls at cache misses
15 Jumps w Example: if (a > b) x = y; else x = z; w Pentium Pro type solution. CPU predicts which branch will be used and starts execution. If right: great. If not: the pipeline has to be cleared. Cycles are wasted.
16 Jumps w Intel claims that 5 to 10% mispredicts can cause a 30-40% performance cut! How could this be? w Assume instructions normally can be processed in 1 cycle w Assume 30% of instructions are branches w Assume 10% of branches mispredicted w 3% of instructions cause mispredicted branches w Assume 13 cycles lost for each mispredicted branch w For every 100 instructions (cycles), 39 of 139 or 28% of cycles of are wasted
17 Itanium solution: Predication w Source code: if (a > b) x = y; else x = z; w EPIC solution. Both branches are executed, but results are stored only if the associated predicate is true (using pseudo code) w p1 = a>b p2 = not(a>b)... (p1) load y (p2) load z... (p1) store x (p2) store x... w There are 64 predicate registers such as p1 and p2.
18 Predication w If (a) Sample C code b = c + d; if (e) h = i - j; w Pseudo code for Itanium using predication - No branching is needed. cmp.ne p1, p2 = a, r0 // p1 a! = 0 cmp.ne p3, p4 = e, r0 // p3 e != 0 (p1)add b = c, d // if a != 0 then add (p3)sub h = i, j // if e != 0 then subtract w Note: r0 is always 0 w (Software Developer’s Manual, page 1:135)
19 EPIC memory loads w Values are loaded as far in advance as possible w Code can verify variables are loaded before actual use
20 Jumps and loads w In an effort to reduce the latency problem, the compiler will try to load data as far as in advance as possible. The code might be modified as follows (using pseudo code): XXXXX XXXXX load r10, y XXXXX XXXXX load r11, z XXXXX XXXXX XXXXX p1 = a>b p2 = not(a>b) XXXXX (p1) store x, r10 (p2) store x, r11 XXXXX w Preloading is called hoisting loads
21 Register Rotation w Consider for (i = 0; i <=n; i++) b[i] = a[i] +1; w Traditional compilers might code the inner loop statement as load ax, a[i] inc ax store b[i], ax w Even if multiple execution units are available, the loop is executed sequentially because register ax is specified.
22 Register Rotation w In loop structures, register rotation and renumbering allows a compiler to specify one register but in reality multiple registers are being used. r32A[0]load r32, a[i] Cycle 1 r32A[1] r33A[0]+1 inc r33 Cycle 2 r32A[2] r33A[1]+1 r34A[0]+1store b[i], r34 Cycle 3
23 Register Rotation and Software Pipelining w The EPIC compiler can specify that registers should be used on a rotational basis. for (i = 0; i < n; i++) b[i] = a[i] + 1; load r32, a[i] value is rotated into r33 add r34 = r33,1 values rotated into r34 and r35 store b[i] = r35 values rotated into r35 and r36 w Predicate registers and prolog and epilog counters are used to start and stop loops
24 Loop count and epilog registers w Loop control requires branches and overhead in standard computers. w EPIC provides loop count and epilog registers which together with the predicate registers allow taking care of the overhead in loop setup and cleanup without branches
25 Procedure calls w Procedure calls are highly desirable but … w In traditional computers, procedure calls require push and pops (memory operations) to handle parameters w Memory is slow w Procedure calls are slow
26 Procedure calls w 96 integer registers can be used like a stack w Calling and called procedure can share some registers w If the register stack overflows, registers are sent to memory in the background w Register saves are unneeded Suppose procedure A calls procedure B(x,y) param x param y Proc A Proc B Integer register stack
27 Comments w Itaniums use a flat 64 bit addressing space w They normally store little endian but can support big endian operating systems
28 Floating point w Allows 32 bit, 64 bit, or 80 floating values w Registers are 82 bits long w Uses software for division w Square roots use looping techniques
29 MMX semantic equivalence used when emulating a IA-32 chip w Integer registers can be treated as eight 8 bit, four 16 bit, or two 32 bit registers w Floating point registers can be treated as two 32 bit registers w This allows one instruction to process multiple data values (SIMD) w Provides MMX semantic equivalence
30 First implementation w Itanium was first released in 2001 after years of development and testing. w Initial applications: servers and high powered work stations
31 Review of EPIC w 64 bit data path and registers. w Complexity of the logic needed for out-of- order and speculative execution is removed from the chip. w Order of calculation is up to the compiler w On board chip space can be used for registers.
32 Review of EPIC w Intel plans to continue development and production of 32 bit processors (IA-32) w Itanium 2 was released in 2002 or 2003
33 Implementations w Itanium versions: Speeds: 733 MHz, 800 MHz Cache: L3: 2MB, 4MB, L2: 96KB, L1: 32 KB w Itanium 2 versions: All have L2 cache: 256KB, L1 cache :32KB For multi-processor and dual processor applications Speeds: 1.5, 1.4 and 1.3 GHz Cache: L3: 3MB, 4MB, or 6MB, For dual processor applications Speed 1.4 GHz (for servers and workstations L3 cache: 1.5MB Lower power high density dual processor applications Speed: 1 GHz (low power) L3 cache: 1.5 MB
34 Pacific Northwest National Laboratories EMSL Supercomputer w Build by HP w Nearly GHz Itanium 2 processors, 2 per node w Uses Linux w Fifth fastest unclassified computer in the world. w Speed: 11.8 teraflop theoretical, rating based on 8.63 teraflop performance running Linpac – solving dense linear equations w One-half petabyte of disk space w Quadrics QSNet 2 interconnect that enables the processors to communicate in less than three microseconds.
35 Pacific Northwest National Laboratories EMSL Supercomputer
36 Pacific Northwest National Laboratories EMSL Supercomputer (NWLinux description) (describes the computer)
37 References: w (IA 64 home page) * w (Real video) * w index.htm (Multiple media) * w architecture.htm * w index.htm (tutorials) In particular, "Introducing the IA-64 Architecture" * w *checked 5/10/00 but these links no longer work
38 References: w w x.htm x.htm w dmanual.htm (Intel Itanium Architecture Software Developer’s Manual – October 2002) dmanual.htm w htm (Intel Itanium 2 Processor Reference Manual for Software Development and Optimization) htm w /itanium2/demo/index.htm?iid=ipp_srvr_proc_itan ium2+epic_animation& (Simplistic video) /itanium2/demo/index.htm?iid=ipp_srvr_proc_itan ium2+epic_animation&
39 CSCE 380 w Fall 1998, Spring 1999, Spring 2000, Fall 2003, Spring 2004 w James Brink