Chapter 15 IA-64 Architecture. Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to.

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
CPE432 Chapter 4C.1Dr. W. Abu-Sufah, UJ Chapter 4C: The Processor, Part C Read Section 4.10 Parallelism and Advanced Instruction-Level Parallelism Adapted.
Computer Organization and Architecture
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Computer Organization and Architecture
Computer Organization and Architecture
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
Chapter 12 Pipelining Strategies Performance Hazards.
Chapter 15 IA-64 Architecture No HW, Concentrate on understanding these slides Next Monday we will talk about: Microprogramming of Computer Control units.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Chapter 21 IA-64 Architecture (Think Intel Itanium)
IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Hardware Support for Compiler Speculation
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.
Computer Architecture Chapter (14): Processor Structure and Function
Instruction Level Parallelism
William Stallings Computer Organization and Architecture 8th Edition
William Stallings Computer Organization and Architecture 8th Edition
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Henk Corporaal TUEindhoven 2009
Instruction Level Parallelism and Superscalar Processors
Instruction Level Parallelism and Superscalar Processors
Henk Corporaal TUEindhoven 2011
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Architecture
Midterm 2 review Chapter
Created by Vivi Sahfitri
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Chapter 15 IA-64 Architecture

Reflection on Superscalar Machines Superscaler Machine: A Superscalar machine employs multiple independent pipelines to executes multiple independent instructions in parallel. —Particularly common instructions (arithmetic, load/store, conditional branch) can be executed independently. Superpipelined machine: Superpiplined machines overlap pipe stages —Relies on stages being able to begin operations before the last is complete.

Reflecting on Superscalar Machines Example:

Reflecting on Superscalar Machines Superscalar vs Superpipelined

Reflection on Superscalar Machines Challenges: Data Dependencies —Requires reordering of instructions Procedural Dependencies —Requires reordering of fetch, execute, updating of registers —Requires register renaming —Requires “committing” or “retiring” instructions Resource Conflicts —Requires reordering of instructions Superscaling has “scaling” challenges: Control complexity increases exponentially Time delay increases exponentially

IA-64 : Background Explicitly Parallel Instruction Computing (EPIC) - Jointly developed by Intel & Hewlett-Packard (HP) New 64 bit architecture —Not extension of x86 —Not adaptation of HP 64bit RISC architecture To exploit increasing chip transistors and increasing speeds Utilizes systematic parallelism Departure from superscalar Note: Has become the architecture of the Intel Itanium

Why This New Architecture? Processor designers obvious choices for use of increasing number of transistors on chip and extra speed: Bigger Caches  diminishing returns Increase degree of Superscaling by adding more execution units  complexity wall: more logic, need improved branch prediction, more renaming registers, more complicated dependencies. Multiple Processors  challenge to use them effectively in general computing

Design Approach – Explicit Parallelism Compiler has vision of whole program and what is coming Increase the execution units and use them effectively Reduce dynamic reconfigurations Avoid exponentially increasing complex circuitry Compiler statically schedules instructions at compile time, rather than processor dynamically scheduling them at run time.

Basic Concepts for IA-64 Instruction level parallelism —Explicit in machine instruction rather than determined at run time by processor Long or very long instruction words (LIW/VLIW) —Fetch bigger chunks already “preprocessed” Branch predication (not the same as branch prediction) —Go ahead and fetch & decode instructions, but keep track of them so the decision to “issue” them, or not, can be practically made later Speculative loading —Go ahead and load data so it is ready when need, and have a practical way to recover is speculation proved wrong

Superscalar v IA-64

General Organization

Intel’s Itanium Implements the IA-64

IA-64 Key Features Large number of registers —IA-64 instruction format assumes 256 Registers –128 * 64 bit integer, logical & general purpose –128 * 82 bit floating point and graphic —64 predicated execution registers (To support high degree of parallelism) Multiple execution units —8 or more

Predicate Registers Used as a flag for instructions that may or may not be executed. A set of instructions is assigned a predicate register when it is uncertain whether the instruction sequence will actually be executed (think branch). Only instructions with a predicate value of true are executed. When it is known that the instruction is going to be executed, its predicate is set. All instructions with that predicate true can now be completed. Those instructions with predicate false are now candidates for cleanup.

IA-64 Execution Units I-Unit —Integer arithmetic —Shift and add —Logical —Compare —Integer multimedia ops M-Unit —Load and store –Between register and memory —Some integer ALU operations B-Unit —Branch instructions F-Unit —Floating point instructions

Relationship between Instruction Type & Execution Unit

Instruction Format 128 bit bundles Can fetch one or more bundles at a time Bundle holds three instructions plus template Instructions are usually 41 bit long —Have associated predicated execution registers Template contains info on which instructions can be executed in parallel —Not confined to single bundle —e.g. a stream of 8 instructions may be executed in parallel —Compiler will have re-ordered instructions to form contiguous bundles —Can mix dependent and independent instructions in same bundle

Instruction Format Diagram

Field Encoding & Instr Set Mapping Note: BAR indicates stops: Possible dependencies with Instructions after the stop

Assembly Language Format [qp] mnemonic [.comp] dest = srcs ;; // qp - predicate register –1 at execution  execute and commit result to hardware –0  result is discarded mnemonic - name of instruction comp – one or more instruction completers used to qualify mnemonic dest – one or more destination operands srcs – one or more source operands ;; - instruction groups stops (when appropriate) –Sequence without read after write or write after write –Do not need hardware register dependency checks // - comment follows

Assembly Example ld8 r1 = [r5] ;; //first group add r3 = r1, r4 //second group Second instruction depends on value in r1 —Changed by first instruction —Can not be in same group for parallel execution Note ;; ends the group of instructions that can be executed in parallel Register Dependency:

Assembly Example ld8 r1 = [r5] //first group sub r6 = r8, r9 ;; //first group add r3 = r1, r4 //second group st8 [r6] = r12 //second group Last instruction stores in the memory location whose address is in r6, which is established in the second instruction Multiple Register Dependencies:

Predication

Speculative Loading

Assembly Example – Predicated Code if (a&&b) j = j + 1; else if(c) k = k + 1; else k = k – 1; i = i + 1; Consider the Following program with branches:

Assembly Example – Predicated Code Source Code if (a&&b) j = j + 1; else if(c) k = k + 1; else k = k – 1; i = i + 1; Pentium Assembly Code cmp a, 0 ; compare with 0 je L1 ; branch to L1 if a = 0 cmp b, 0 je L1 add j, 1 ; j = j + 1 jmp L3 L1: cmp c, 0 je L2 add k, 1 ; k = k + 1 jmp L3 L2: sub k, 1 ; k = k – 1 L3: add i, 1 ; i = i + 1

Assembly Example – Predicated Code Source Code if (a&&b) j = j + 1; else if(c) k = k + 1; else k = k – 1; i = i + 1; Pentium Code cmp a, 0 je L1 cmp b, 0 je L1 add j, 1 jmp L3 L1: cmp c, 0 je L2 add k, 1 jmp L3 L2: sub k, 1 L3: add i, 1 IA-64 Code cmp. eq p1, p2 = 0, a ;; (p2) cmp. eq p1, p3 = 0, b (p3) add j = 1, j (p1) cmp. ne p4, p5 = 0, c (p4) add k = 1, k (p5) add k = -1, k add i = 1, i

Example of Prediction

Control & Data Speculation Control —AKA Speculative loading —Load data from memory before needed Data —Load moved before store that might alter memory location —Subsequent check in value

Assembly Example – Control Speculation (p1) br some_label // cycle 0 ld8 r1 = [r5] ;; // cycle 1 add r1 = r1, r3 // cycle 3 Consider the Following program:

Assembly Example – Control Speculation (p1) br some_label //cycle 0 ld8 r1 = [r5] ;; //cycle 1 add r1 = r1, r3 //cycle 3 Consider the Following program: Original code Speculated Code Original code Speculated Code ld8.s r1 = [r5] ;; //cycle -2 // other instructions (p1) br some_label //cycle 0 chk.s r1, recovery //cycle 0 add r2 = r1, r3 //cycle 0

Assembly Example – Data Speculation st8 [r4] = r12 //cycle 0 ld8 r6 = [r8] ;; //cycle 0 add r5 = r6, r7 ;; //cycle 2 st8 [r18] = r5 //cycle 3 Consider the Following program: What if r4 and r18 point to the same address?

Assembly Example – Data Speculation st8 [r4] = r12 //cycle 0 ld8 r6 = [r8] ;; //cycle 0 add r5 = r6, r7 ;; //cycle 2 st8 [r18] = r5 //cycle 3 Consider the Following program: Without Data Speculation With Data Speculation What if r4 and r18 point to the same address? ld8.a r6 = [r8] ;; //cycle -2, adv // other instructions st8 [r4] = r12 //cycle 0 ld8.c r6 = [r8] //cycle 0, check add r5 = r6, r7 ;; //cycle 0 st8 [r18] = r5 //cycle 1

Assembly Example – Data Speculation ld8.a r6 = [r8];; //cycle -3,adv ld // other instructions add r5 = r6, r7 //cycle -1,uses r6 // other instructions st8 [r4] = r12 //cycle 0 chk.a r6, recover //cycle 0, check back: //return pt st8 [r18] = r5 //cycle 0 recover: ld8 r6 = [r8] ;; //get r6 from [r8] add r5 = r6, r7;; //re-execute be back //jump back ld8.a r6 = [r8] ;; //cycle-2 // other instructions st8 [r4] = r12 //cycle 0 ld8.c r6 = [r8] //cycle 0 add r5 = r6, r7 ;; //cycle 0 st8 [r18] = r5 //cycle 1 Data Dependencies: Speculation Speculation with data dependency

Software Pipelining L1:ld4 r4=[r5],4 ;;//cycle 0 load postinc 4 add r7=r4,r9 ;;//cycle 2 st4 [r6]=r7,4 //cycle 3 store postinc 4 br.cloop L1 ;;//cycle 3 Adds constant to one vector and stores result in another No opportunity for instruction level parallelism Instruction in iteration x all executed before iteration x+1 begins If no address conflicts between loads and stores can move independent instructions from loop x+1 to loop x

Unrolled Loop ld4 r32=[r5],4;; //cycle 0 ld4 r33=[r5],4;; //cycle 1 ld4 r34=[r5],4 //cycle 2 add r36=r32,r9;; //cycle 2 ld4 r35=[r5],4 //cycle 3 add r37=r33,r9 //cycle 3 st4 [r6]=r36,4;; //cycle 3 ld4 r36=[r5],4 //cycle 3 add r38=r34,r9 //cycle 4 st4 [r6]=r37,4;; //cycle 4 add r39=r35,r9 //cycle 5 st4 [r6]=r38,4;; //cycle 5 add r40=r36,r9 //cycle 6 st4 [r6]=r39,4;; //cycle 6 st4 [r6]=r40,4;; //cycle 7

Unrolled Loop Detail Completes 5 iterations in 7 cycles —Compared with 20 cycles in original code Assumes two memory ports —Load and store can be done in parallel

Software Pipeline Example Diagram

Support For Software Pipelining Automatic register renaming —Fixed size are of predicate and fp register file (p16-P32, fr32-fr127) and programmable size area of gp register file (max r32-r127) capable of rotation —Loop using r32 on first iteration automatically uses r33 on second Predication —Each instruction in loop predicated on rotating predicate register –Determines whether pipeline is in prolog, kernel, or epilog Special loop termination instructions —Branch instructions that cause registers to rotate and loop counter to decrement

IA-64 Register Set

IA-64 Registers (1) General Registers —128 gp 64 bit registers —r0-r31 static –references interpreted literally —r32-r127 can be used as rotating registers for software pipeline or register stack –References are virtual –Hardware may rename dynamically Floating Point Registers —128 fp 82 bit registers —Will hold IEEE 745 double extended format —fr0-fr31 static, fr32-fr127 can be rotated for pipeline Predicate registers —64 1 bit registers used as predicates —pr0 always 1 to allow unpredicated instructions —pr1-pr15 static, pr16-pr63 can be rotated

IA-64 Registers (2) Branch registers —8 64 bit registers Instruction pointer —Bundle address of currently executing instruction Current frame marker —State info relating to current general register stack frame —Rotation info for fr and pr —User mask –Set of single bit values –Allignment traps, performance monitors, fp register usage monitoring Performance monitoring data registers —Support performance monitoring hardware Application registers —Special purpose registers

Register Stack Avoids unnecessary movement of data at procedure call & return Provides procedure with new frame up to 96 registers on entry —r32-r127 Compiler specifies required number —Local —Output Registers renamed so local registers from previous frame hidden Output registers from calling procedure now have numbers starting r32 Physical registers r32-r127 allocated in circular buffer to virtual registers Hardware moves register contents between registers and memory if more registers needed

Register Stack Behaviour

Register Formats