Presentation is loading. Please wait.

Presentation is loading. Please wait.

EPIC 64-bit Architecture: The Itanium and Itanium 2 CSCE 380 5/10/2000, 12/11/2003, 5/7/2004 Draft.

Similar presentations


Presentation on theme: "EPIC 64-bit Architecture: The Itanium and Itanium 2 CSCE 380 5/10/2000, 12/11/2003, 5/7/2004 Draft."— Presentation transcript:

1 EPIC 64-bit Architecture: The Itanium and Itanium 2 CSCE 380 5/10/2000, 12/11/2003, 5/7/2004 Draft

2 2 Disclaimer w This slide show is intended to explain the EPIC architecture in a general way. It is not intended to give a precise description of the architecture.

3 3 EPIC w Explicitly Parallel Instruction Code w Earlier known as IA-64 Architecture w Originally it was jointly defined by Intel and Hewlett-Packard but it appears it now an Intel product. w 64 bit w It is designed to be used in clusters of up to 128 processors (or more). A machine with over 1900 processors exists in Richland, WA. w Will run binary IA-32 programs.

4 4 Current state-of-affairs w Compilers typically write sequential, in-order code. w Advanced CISC chips use lots of hardware logic to try to execute the code in parallel. w Advanced CISC chips use lots of hardware logic to try to execute the code “out-of- order.” w RISC chips are not designed for properly for parallel execution

5 5 EPIC: Some design principals w Intel’s conclusion: Let the compiler produce parallel, out-of-order code. This simplifies chip logic. The space saved can be used for things like more registers. w EPIC has 128 general purpose integer registers of which 96 can be stacked and renamed w It has 128 floating point registers of which 96 can be rotated or renamed w It has 64 1-bit predicate registers of which 48 can be rotated w It has “128” Application Registers which have special purposes. 56 are reserved and 16 are ignored. w It has 8 Branch Registers

6 6 EPIC: Explicitly Parallel Instruction Computing w Has resources for parallel execution Many registers Many functional units Inherently scalable w Explicit parallelism w Features Prediction Speculation …

7 7 VLIW: Very Long Instruction Word w EPIC packs 3 instructions into its 128 bit long instruction. w Compiler specifies parallelism The three instructions together are called a bundle Instruction 2Instruction 1Instruction 0 Template 127 0

8 8 Basic instructions Up to 3 in a bundle w Basic Itanium instruction: [qp]mnemonic[.comp] dest = srcs w qp: optional predicate register. The result is committed only if qp = 1 w mnemonic: unique instruction name w comp: optional variation for the instruction w dest: destination of the result w srcs: one or more sources (Intro to programming, page 1:132)

9 9 Instruction Groups w Instruction groups: groups of instructions that do not have RAW or WAW register dependencies w Depending on the machine, 1, 2, …, all of the instructions in the group can be issued in parallel w Hence: Instructions in a single instruction group cannot have RAW or Read after Write dependencies: One instruction cannot read a register written by another instruction in the same group. WAW or Write after Write dependencies: two instructions cannot write to the same register. (Software Developer’s Manual, page 1:133)

10 10 Instruction groups: Example 1 ld8 r1 = [r5] ; ; // first group or bundle add r3 = r1, r4 // second group ; ; a stop: the end of an instruction group Stops may appear inside at the end of a bundle. There can be several bundles in a group ld8 is a load 8 bytes from memory st8 is a store 8 bytes to memory

11 11 Instruction groups: Example 2 ld8 r1 = [r5] sub r6 = r8, r9 ; ; // first group add r3 = r1, r4 st8 [r6] = r12 ;; // second group add r3, 1, r6 ;; // third group Goals: 1. Put as many instruction is a group as possible to allow as much parallelism as possible. 2: Load in advance so that to avoid waiting for memory. (Adapted from Software Developer’s Manual, page 1:133) Read after write Write after write

12 12 Instruction groups: w Using pseudo code, break the following into instruction groups: w a = b - c; d = e + f + g; h = a * i; j = d; load r1=b load r2=c load r3=e load r4=f load r5=g load r6=i ;; Hopefully, additional instructions can be added to the groups before forming bundles sub r7 = r1, r2 // r7 = b – c add r8 = r3, r4 ;; // r8 = e + f store a = r7 add r9 = r8+r5 // r9 = e+f+g mul r10=r7*r6 ;; // r10=a*i store d = r9 store h = r10 store j = r9 ;;

13 13 Instructions w Standard add r1=r2,r3 w Predicated (p4) add r1=r2,r3 (Converted to a noop if p4 is false) w With immediate add r2=r3,1 w With completer cmp.eq p3=r2,r4 instruction option w Memory operations: ld r1=[r4](load) st [r4]=r1 (store)

14 14 Minimizing Memory Latency w Latency – Time wasted while waiting for memory w Problem: In normal CPUs, jumps make it difficult to schedule loads in advance so CPU stalls at cache misses

15 15 Jumps w Example: if (a > b) x = y; else x = z; w Pentium Pro type solution. CPU predicts which branch will be used and starts execution. If right: great. If not: the pipeline has to be cleared. Cycles are wasted.

16 16 Jumps w Intel claims that 5 to 10% mispredicts can cause a 30-40% performance cut! How could this be? w Assume instructions normally can be processed in 1 cycle w Assume 30% of instructions are branches w Assume 10% of branches mispredicted w 3% of instructions cause mispredicted branches w Assume 13 cycles lost for each mispredicted branch w For every 100 instructions (cycles), 39 of 139 or 28% of cycles of are wasted

17 17 Itanium solution: Predication w Source code: if (a > b) x = y; else x = z; w EPIC solution. Both branches are executed, but results are stored only if the associated predicate is true (using pseudo code) w p1 = a>b p2 = not(a>b)... (p1) load y (p2) load z... (p1) store x (p2) store x... w There are 64 predicate registers such as p1 and p2.

18 18 Predication w If (a) Sample C code b = c + d; if (e) h = i - j; w Pseudo code for Itanium using predication - No branching is needed. cmp.ne p1, p2 = a, r0 // p1  a! = 0 cmp.ne p3, p4 = e, r0 // p3  e != 0 (p1)add b = c, d // if a != 0 then add (p3)sub h = i, j // if e != 0 then subtract w Note: r0 is always 0 w (Software Developer’s Manual, page 1:135)

19 19 EPIC memory loads w Values are loaded as far in advance as possible w Code can verify variables are loaded before actual use

20 20 Jumps and loads w In an effort to reduce the latency problem, the compiler will try to load data as far as in advance as possible. The code might be modified as follows (using pseudo code): XXXXX XXXXX load r10, y XXXXX XXXXX load r11, z XXXXX XXXXX XXXXX p1 = a>b p2 = not(a>b) XXXXX (p1) store x, r10 (p2) store x, r11 XXXXX w Preloading is called hoisting loads

21 21 Register Rotation w Consider for (i = 0; i <=n; i++) b[i] = a[i] +1; w Traditional compilers might code the inner loop statement as load ax, a[i] inc ax store b[i], ax w Even if multiple execution units are available, the loop is executed sequentially because register ax is specified.

22 22 Register Rotation w In loop structures, register rotation and renumbering allows a compiler to specify one register but in reality multiple registers are being used. r32A[0]load r32, a[i] Cycle 1 r32A[1] r33A[0]+1 inc r33 Cycle 2 r32A[2] r33A[1]+1 r34A[0]+1store b[i], r34 Cycle 3

23 23 Register Rotation and Software Pipelining w The EPIC compiler can specify that registers should be used on a rotational basis. for (i = 0; i < n; i++) b[i] = a[i] + 1; load r32, a[i] value is rotated into r33 add r34 = r33,1 values rotated into r34 and r35 store b[i] = r35 values rotated into r35 and r36 w Predicate registers and prolog and epilog counters are used to start and stop loops

24 24 Loop count and epilog registers w Loop control requires branches and overhead in standard computers. w EPIC provides loop count and epilog registers which together with the predicate registers allow taking care of the overhead in loop setup and cleanup without branches

25 25 Procedure calls w Procedure calls are highly desirable but … w In traditional computers, procedure calls require push and pops (memory operations) to handle parameters w Memory is slow w Procedure calls are slow

26 26 Procedure calls w 96 integer registers can be used like a stack w Calling and called procedure can share some registers w If the register stack overflows, registers are sent to memory in the background w Register saves are unneeded Suppose procedure A calls procedure B(x,y) param x param y Proc A Proc B Integer register stack

27 27 Comments w Itaniums use a flat 64 bit addressing space w They normally store little endian but can support big endian operating systems

28 28 Floating point w Allows 32 bit, 64 bit, or 80 floating values w Registers are 82 bits long w Uses software for division w Square roots use looping techniques

29 29 MMX semantic equivalence used when emulating a IA-32 chip w Integer registers can be treated as eight 8 bit, four 16 bit, or two 32 bit registers w Floating point registers can be treated as two 32 bit registers w This allows one instruction to process multiple data values (SIMD) w Provides MMX semantic equivalence

30 30 First implementation w Itanium was first released in 2001 after years of development and testing. w Initial applications: servers and high powered work stations

31 31 Review of EPIC w 64 bit data path and registers. w Complexity of the logic needed for out-of- order and speculative execution is removed from the chip. w Order of calculation is up to the compiler w On board chip space can be used for registers.

32 32 Review of EPIC w Intel plans to continue development and production of 32 bit processors (IA-32) w Itanium 2 was released in 2002 or 2003

33 33 Implementations w Itanium versions: Speeds: 733 MHz, 800 MHz Cache: L3: 2MB, 4MB, L2: 96KB, L1: 32 KB w Itanium 2 versions: All have L2 cache: 256KB, L1 cache :32KB For multi-processor and dual processor applications Speeds: 1.5, 1.4 and 1.3 GHz Cache: L3: 3MB, 4MB, or 6MB, For dual processor applications Speed 1.4 GHz (for servers and workstations L3 cache: 1.5MB Lower power high density dual processor applications Speed: 1 GHz (low power) L3 cache: 1.5 MB

34 34 Pacific Northwest National Laboratories EMSL Supercomputer w Build by HP w Nearly 2000 1.5 GHz Itanium 2 processors, 2 per node w Uses Linux w Fifth fastest unclassified computer in the world. w Speed: 11.8 teraflop theoretical, rating based on 8.63 teraflop performance running Linpac – solving dense linear equations w One-half petabyte of disk space w Quadrics QSNet 2 interconnect that enables the processors to communicate in less than three microseconds.

35 35 Pacific Northwest National Laboratories EMSL Supercomputer http://mscf.emsl.pnl.gov/about/managers_report_2002.shtml

36 36 Pacific Northwest National Laboratories EMSL Supercomputer http://www.emsl.pnl.gov/new/spotlight/spotlight06.shtml http://mscf.emsl.pnl.gov/rank5.shtml http://mscf.emsl.pnl.gov/about/managers_report_2002.shtml http://www.emsl.pnl.gov/proj/nwlinux/system_details.html http://www.emsl.pnl.gov/proj/nwlinux/system_details.html (NWLinux description) http://mscf.emsl.pnl.gov/hardware/config_mpp2.shtml http://mscf.emsl.pnl.gov/hardware/config_mpp2.shtml (describes the computer)

37 37 References: w http://developer.Intel.com/design/ia64/index.htm (IA 64 home page) * w http://intel.broadcast.com/intel/idf98/keynote1.htm (Real video) * w http://www.intel.com/design/idf/archive/feb98/ index.htm (Multiple media) * w http://developer.intel.com/design/ia-64/ architecture.htm * w http://developer.intel.com/vtune/cbts/ia64/ index.htm (tutorials) In particular, "Introducing the IA-64 Architecture" * w *checked 5/10/00 but these links no longer work

38 38 References: w http://www.intel.com/design/itanium2/ http://www.intel.com/design/itanium2/ w http://www.intel.com/design/Itanium/itanium/inde x.htm http://www.intel.com/design/Itanium/itanium/inde x.htm w http://www.intel.com/design/Itanium/manuals/iias dmanual.htm (Intel Itanium Architecture Software Developer’s Manual – October 2002) http://www.intel.com/design/Itanium/manuals/iias dmanual.htm w http://www.intel.com/design/itanium2/manuals/25 1110.htm (Intel Itanium 2 Processor Reference Manual for Software Development and Optimization) http://www.intel.com/design/itanium2/manuals/25 1110.htm w http://www.intel.com/business/bss/products/server /itanium2/demo/index.htm?iid=ipp_srvr_proc_itan ium2+epic_animation& (Simplistic video) http://www.intel.com/business/bss/products/server /itanium2/demo/index.htm?iid=ipp_srvr_proc_itan ium2+epic_animation&

39 39 CSCE 380 w Fall 1998, Spring 1999, Spring 2000, Fall 2003, Spring 2004 w James Brink


Download ppt "EPIC 64-bit Architecture: The Itanium and Itanium 2 CSCE 380 5/10/2000, 12/11/2003, 5/7/2004 Draft."

Similar presentations


Ads by Google