EPIC 64-bit Architecture: The Itanium and Itanium 2 CSCE 380 5/10/2000, 12/11/2003, 5/7/2004 Draft.

Slides:



Advertisements
Similar presentations
In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Superscalar and VLIW Architectures Miodrag Bolic CEG3151.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.
S. Barua – CPSC 440 CHAPTER 2 INSTRUCTIONS: LANGUAGE OF THE COMPUTER Goals – To get familiar with.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.
Chapter 21 IA-64 Architecture (Think Intel Itanium)
IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due.
Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
Computer performance.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Small constants are used quite frequently (50% of operands) e.g., A = A + 5; B = B + 1; C = C - 18; Solutions? Why not? put 'typical constants' in memory.
M. Mateen Yaqoob The University of Lahore Spring 2014.
Introduction to MMX, XMM, SSE and SSE2 Technology
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
1 Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
EPIC 64-bit Architecture: The Itanium and Itanium 2 CSCE 380 5/10/2000, 12/11/2003, 5/7/2004, 5/11/2005, 5/4/2006 Draft.
CS 352H: Computer Systems Architecture
Topics to be covered Instruction Execution Characteristics
Advanced Architectures
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
VLIW Architecture FK Boachie..
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Advanced Topic: Alternative Architectures Chapter 9 Objectives
Instruction-Level Parallelism
Henk Corporaal TUEindhoven 2009
Introduction to Pentium Processor
CS170 Computer Organization and Architecture I
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
The Instruction Set Architecture Level
CSCE Fall 2013 Prof. Jennifer L. Welch.
Yingmin Li Ting Yan Qi Zhao
Lecture 23: Static Scheduling for High ILP
Henk Corporaal TUEindhoven 2011
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Instructions
CSCE Fall 2012 Prof. Jennifer L. Welch.
What is Computer Architecture?
VLIW direct descendant of horizontal microprogramming
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
What is Computer Architecture?
Superscalar and VLIW Architectures
CSC3050 – Computer Architecture
Execution time Execution Time (processor-related) = IC x CPI x T
How to improve (decrease) CPI
Review In last lecture, done with unsigned and signed number representation. Introduced how to represent real numbers in float format.
IA-64 Vincent D. Capaccio.
Processor design Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section 11.3.
Presentation transcript:

EPIC 64-bit Architecture: The Itanium and Itanium 2 CSCE 380 5/10/2000, 12/11/2003, 5/7/2004 Draft

2 Disclaimer w This slide show is intended to explain the EPIC architecture in a general way. It is not intended to give a precise description of the architecture.

3 EPIC w Explicitly Parallel Instruction Code w Earlier known as IA-64 Architecture w Originally it was jointly defined by Intel and Hewlett-Packard but it appears it now an Intel product. w 64 bit w It is designed to be used in clusters of up to 128 processors (or more). A machine with over 1900 processors exists in Richland, WA. w Will run binary IA-32 programs.

4 Current state-of-affairs w Compilers typically write sequential, in-order code. w Advanced CISC chips use lots of hardware logic to try to execute the code in parallel. w Advanced CISC chips use lots of hardware logic to try to execute the code “out-of- order.” w RISC chips are not designed for properly for parallel execution

5 EPIC: Some design principals w Intel’s conclusion: Let the compiler produce parallel, out-of-order code. This simplifies chip logic. The space saved can be used for things like more registers. w EPIC has 128 general purpose integer registers of which 96 can be stacked and renamed w It has 128 floating point registers of which 96 can be rotated or renamed w It has 64 1-bit predicate registers of which 48 can be rotated w It has “128” Application Registers which have special purposes. 56 are reserved and 16 are ignored. w It has 8 Branch Registers

6 EPIC: Explicitly Parallel Instruction Computing w Has resources for parallel execution Many registers Many functional units Inherently scalable w Explicit parallelism w Features Prediction Speculation …

7 VLIW: Very Long Instruction Word w EPIC packs 3 instructions into its 128 bit long instruction. w Compiler specifies parallelism The three instructions together are called a bundle Instruction 2Instruction 1Instruction 0 Template 127 0

8 Basic instructions Up to 3 in a bundle w Basic Itanium instruction: [qp]mnemonic[.comp] dest = srcs w qp: optional predicate register. The result is committed only if qp = 1 w mnemonic: unique instruction name w comp: optional variation for the instruction w dest: destination of the result w srcs: one or more sources (Intro to programming, page 1:132)

9 Instruction Groups w Instruction groups: groups of instructions that do not have RAW or WAW register dependencies w Depending on the machine, 1, 2, …, all of the instructions in the group can be issued in parallel w Hence: Instructions in a single instruction group cannot have RAW or Read after Write dependencies: One instruction cannot read a register written by another instruction in the same group. WAW or Write after Write dependencies: two instructions cannot write to the same register. (Software Developer’s Manual, page 1:133)

10 Instruction groups: Example 1 ld8 r1 = [r5] ; ; // first group or bundle add r3 = r1, r4 // second group ; ; a stop: the end of an instruction group Stops may appear inside at the end of a bundle. There can be several bundles in a group ld8 is a load 8 bytes from memory st8 is a store 8 bytes to memory

11 Instruction groups: Example 2 ld8 r1 = [r5] sub r6 = r8, r9 ; ; // first group add r3 = r1, r4 st8 [r6] = r12 ;; // second group add r3, 1, r6 ;; // third group Goals: 1. Put as many instruction is a group as possible to allow as much parallelism as possible. 2: Load in advance so that to avoid waiting for memory. (Adapted from Software Developer’s Manual, page 1:133) Read after write Write after write

12 Instruction groups: w Using pseudo code, break the following into instruction groups: w a = b - c; d = e + f + g; h = a * i; j = d; load r1=b load r2=c load r3=e load r4=f load r5=g load r6=i ;; Hopefully, additional instructions can be added to the groups before forming bundles sub r7 = r1, r2 // r7 = b – c add r8 = r3, r4 ;; // r8 = e + f store a = r7 add r9 = r8+r5 // r9 = e+f+g mul r10=r7*r6 ;; // r10=a*i store d = r9 store h = r10 store j = r9 ;;

13 Instructions w Standard add r1=r2,r3 w Predicated (p4) add r1=r2,r3 (Converted to a noop if p4 is false) w With immediate add r2=r3,1 w With completer cmp.eq p3=r2,r4 instruction option w Memory operations: ld r1=[r4](load) st [r4]=r1 (store)

14 Minimizing Memory Latency w Latency – Time wasted while waiting for memory w Problem: In normal CPUs, jumps make it difficult to schedule loads in advance so CPU stalls at cache misses

15 Jumps w Example: if (a > b) x = y; else x = z; w Pentium Pro type solution. CPU predicts which branch will be used and starts execution. If right: great. If not: the pipeline has to be cleared. Cycles are wasted.

16 Jumps w Intel claims that 5 to 10% mispredicts can cause a 30-40% performance cut! How could this be? w Assume instructions normally can be processed in 1 cycle w Assume 30% of instructions are branches w Assume 10% of branches mispredicted w 3% of instructions cause mispredicted branches w Assume 13 cycles lost for each mispredicted branch w For every 100 instructions (cycles), 39 of 139 or 28% of cycles of are wasted

17 Itanium solution: Predication w Source code: if (a > b) x = y; else x = z; w EPIC solution. Both branches are executed, but results are stored only if the associated predicate is true (using pseudo code) w p1 = a>b p2 = not(a>b)... (p1) load y (p2) load z... (p1) store x (p2) store x... w There are 64 predicate registers such as p1 and p2.

18 Predication w If (a) Sample C code b = c + d; if (e) h = i - j; w Pseudo code for Itanium using predication - No branching is needed. cmp.ne p1, p2 = a, r0 // p1  a! = 0 cmp.ne p3, p4 = e, r0 // p3  e != 0 (p1)add b = c, d // if a != 0 then add (p3)sub h = i, j // if e != 0 then subtract w Note: r0 is always 0 w (Software Developer’s Manual, page 1:135)

19 EPIC memory loads w Values are loaded as far in advance as possible w Code can verify variables are loaded before actual use

20 Jumps and loads w In an effort to reduce the latency problem, the compiler will try to load data as far as in advance as possible. The code might be modified as follows (using pseudo code): XXXXX XXXXX load r10, y XXXXX XXXXX load r11, z XXXXX XXXXX XXXXX p1 = a>b p2 = not(a>b) XXXXX (p1) store x, r10 (p2) store x, r11 XXXXX w Preloading is called hoisting loads

21 Register Rotation w Consider for (i = 0; i <=n; i++) b[i] = a[i] +1; w Traditional compilers might code the inner loop statement as load ax, a[i] inc ax store b[i], ax w Even if multiple execution units are available, the loop is executed sequentially because register ax is specified.

22 Register Rotation w In loop structures, register rotation and renumbering allows a compiler to specify one register but in reality multiple registers are being used. r32A[0]load r32, a[i] Cycle 1 r32A[1] r33A[0]+1 inc r33 Cycle 2 r32A[2] r33A[1]+1 r34A[0]+1store b[i], r34 Cycle 3

23 Register Rotation and Software Pipelining w The EPIC compiler can specify that registers should be used on a rotational basis. for (i = 0; i < n; i++) b[i] = a[i] + 1; load r32, a[i] value is rotated into r33 add r34 = r33,1 values rotated into r34 and r35 store b[i] = r35 values rotated into r35 and r36 w Predicate registers and prolog and epilog counters are used to start and stop loops

24 Loop count and epilog registers w Loop control requires branches and overhead in standard computers. w EPIC provides loop count and epilog registers which together with the predicate registers allow taking care of the overhead in loop setup and cleanup without branches

25 Procedure calls w Procedure calls are highly desirable but … w In traditional computers, procedure calls require push and pops (memory operations) to handle parameters w Memory is slow w Procedure calls are slow

26 Procedure calls w 96 integer registers can be used like a stack w Calling and called procedure can share some registers w If the register stack overflows, registers are sent to memory in the background w Register saves are unneeded Suppose procedure A calls procedure B(x,y) param x param y Proc A Proc B Integer register stack

27 Comments w Itaniums use a flat 64 bit addressing space w They normally store little endian but can support big endian operating systems

28 Floating point w Allows 32 bit, 64 bit, or 80 floating values w Registers are 82 bits long w Uses software for division w Square roots use looping techniques

29 MMX semantic equivalence used when emulating a IA-32 chip w Integer registers can be treated as eight 8 bit, four 16 bit, or two 32 bit registers w Floating point registers can be treated as two 32 bit registers w This allows one instruction to process multiple data values (SIMD) w Provides MMX semantic equivalence

30 First implementation w Itanium was first released in 2001 after years of development and testing. w Initial applications: servers and high powered work stations

31 Review of EPIC w 64 bit data path and registers. w Complexity of the logic needed for out-of- order and speculative execution is removed from the chip. w Order of calculation is up to the compiler w On board chip space can be used for registers.

32 Review of EPIC w Intel plans to continue development and production of 32 bit processors (IA-32) w Itanium 2 was released in 2002 or 2003

33 Implementations w Itanium versions: Speeds: 733 MHz, 800 MHz Cache: L3: 2MB, 4MB, L2: 96KB, L1: 32 KB w Itanium 2 versions: All have L2 cache: 256KB, L1 cache :32KB For multi-processor and dual processor applications Speeds: 1.5, 1.4 and 1.3 GHz Cache: L3: 3MB, 4MB, or 6MB, For dual processor applications Speed 1.4 GHz (for servers and workstations L3 cache: 1.5MB Lower power high density dual processor applications Speed: 1 GHz (low power) L3 cache: 1.5 MB

34 Pacific Northwest National Laboratories EMSL Supercomputer w Build by HP w Nearly GHz Itanium 2 processors, 2 per node w Uses Linux w Fifth fastest unclassified computer in the world. w Speed: 11.8 teraflop theoretical, rating based on 8.63 teraflop performance running Linpac – solving dense linear equations w One-half petabyte of disk space w Quadrics QSNet 2 interconnect that enables the processors to communicate in less than three microseconds.

35 Pacific Northwest National Laboratories EMSL Supercomputer

36 Pacific Northwest National Laboratories EMSL Supercomputer (NWLinux description) (describes the computer)

37 References: w (IA 64 home page) * w (Real video) * w index.htm (Multiple media) * w architecture.htm * w index.htm (tutorials) In particular, "Introducing the IA-64 Architecture" * w *checked 5/10/00 but these links no longer work

38 References: w w x.htm x.htm w dmanual.htm (Intel Itanium Architecture Software Developer’s Manual – October 2002) dmanual.htm w htm (Intel Itanium 2 Processor Reference Manual for Software Development and Optimization) htm w /itanium2/demo/index.htm?iid=ipp_srvr_proc_itan ium2+epic_animation& (Simplistic video) /itanium2/demo/index.htm?iid=ipp_srvr_proc_itan ium2+epic_animation&

39 CSCE 380 w Fall 1998, Spring 1999, Spring 2000, Fall 2003, Spring 2004 w James Brink