Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intel and FAER’s Reach to Teach A Program on Computer Architecture Part 1: PIPELINED PROCESSORS R. Govindarajan and Matthew Jacob SERC, Indian Institute.

Similar presentations


Presentation on theme: "Intel and FAER’s Reach to Teach A Program on Computer Architecture Part 1: PIPELINED PROCESSORS R. Govindarajan and Matthew Jacob SERC, Indian Institute."— Presentation transcript:

1 Intel and FAER’s Reach to Teach A Program on Computer Architecture Part 1: PIPELINED PROCESSORS R. Govindarajan and Matthew Jacob SERC, Indian Institute of Science, Bangalore

2 2 Pipelined Processor Architecture 1. Terminology and assumptions 2. Review: Computer organization; Data representation 3. Pipelined processor architecture 4. ILP (Instruction Level Parallelism) processor architecture

3 3 What is Computer Architecture? architecture in the English dictionary  art and science of designing and building habitable structures structures → computer systems inhabitants → computer programs  a structure, or structures collectively  a style and method of design and construction (e.g., Moghul architecture) The study of computer structures; design, evaluation, description

4 4 Computer Architect vs Computer Designer vs Logic Designer Computer Architect develops the Instruction Set Architecture (ISA: description of instructions which are allowed and semantics of what each instruction does when executed) and computer system architecture Computer Designer develops detailed machine organization (blocks, specifications, testing) Logic Designer implements these blocks

5 5 Basics: Computer Organization Cache Memory I/O Bus I/O MMU ALU Registers CPU CU REGISTERS General Purpose  Integer Registers  FP Registers Special Purpose  Program Counter  Stack Pointer  Link Register  Instruction Register

6 6 Basics: Laws, Principles, Rules Amdahl’s Law: The performance improvement to be gained from using some faster mode of execution is limited by the fraction of time the slower mode is used speedup

7 7 Principle of Locality of Reference A program property; programs tend to reuse instructions and data 90-10 rule: 90% of execution time spent in 10% of code Temporal locality: recently accessed things are likely to be accessed in near future Spatial locality: things whose addresses are close in space tend to be accessed close together in time

8 8 General Principle of Locality Denning SJCC 1972, Blevin & Ramamurthy IEEE Trans Comp 1976 During any interval of time, resource demands are non-uniformly distributed Correlation between immediate past and immediate future resource demand patterns tends to be high, and correlation between disjoint resource demand patterns tends to 0 as the distance between them tends to infinity Direction and strength of linear relationship between 2 random variables Correlation

9 9 `Moore’s Law’ 2000198019811983 1984 1985 19861987 198819891990 1991 1992 1993 19941995 19961997 1998 1999 1982 µProc 60%/yr. (2X/1.5yr) Memory 9%/yr. (2X/10 yrs) Processor-Memory Performance Gap: (grows 50% / year) Performance Time 1 10 100 1000 DRAM CPU

10 10 Background: Data Representation Binary, bit, Byte Commonly used representations are: Character data: ASCII code Signed Integer data: 2s complement 1s complement, sign-magnitude Real data: Floating point Example: IEEE single precision floating point standard

11 11 2s Complement Representation The n bit quantity represents the signed integer value least significant bit

12 12 IEEE Floating Point Representation 32 bit value (s, f, e), where f is a 23 bit fraction and e an 8 bit exponent, evaluates to Normalized form Special forms (zero, infinity, NaN, denormals)

13 13 Instruction Set Architecture Description of machine from view of the programmer/compiler  Example: Intel x86 ISA Includes specification of 1. The different kinds of instructions available (instruction set) 2. How operands are specified (addressing modes) 3. What each instruction looks like (instruction format)

14 14 Kinds of Instructions 1.Arithmetic/logical instructions  Add, subtract, multiply, divide, compare (int/fp)  Or, and, not, xor  Shift (left/right, arithmetic/logical), rotate 2.Data transfer instructions  Load (Move data value to a register from memory)  Store (Move data value to memory location from register)  Move 3.Control transfer instructions  Jump, conditional branch, function call, return 4.Other instructions  Example: halt

15 15 Operand Addressing Modes Operands to an instruction Source: input value to instruction Destination: where result is to go Addressing Mode How the location of operand is specified An operand can be either in a memory location in a register

16 16 Addressing Modes How the location of operands is specified  Register Direct - in a register, add R1, R2, R3  Immediate - part of the instrn, add R1, R1, #4  Register indirect - in memory, register specifying the address of memory, add R1, R2, (R3)  Base-Displacement - memory addr. is sum of base (reg.) and offset, add R1, 8(R3)  Absolute - memory addr. specified in instrn  Indexed - addr. is sum of base + index  Others (Auto increment/decrement, PC relative)

17 17 Terms: Byte addressable Memory: A sequence of locations, each containing some information referenced by an address Address Space  Memory address space, Register address space Addressability: how much data in a location? Example: In byte-addressable memory, each location contains 8 bits (1 byte) Word: data in a set of contiguous locations Word Length: Maximum data accessed in a single fetch

18 18 Terms: Byte ordering, Alignment Word at 400 Big Endian byte ordering 1AC8 B246 Little Endian byte ordering 46B2 C81A Word aligned: at a word boundary Word at 400 is word aligned Word at 402 is not, but it is short-word aligned 1AC846B2F08CDF1E Data (in hex) 400 406404402 Address (in dec) 0001 1010 1100 1000 1011 0010 0100 0110 0100 0110 1011 0010 1100 1000 0001 1010 Decimal: 449,360,454 Decimal: 1,186,121,754

19 19 ISA Example: MIPS32 ISA Registers: 32 integer GPRs (R0,R1,…,R31)  R0 is hardwired to 0  R31 is implicitly used by jal instruction  HI and LO: Special purpose registers used implicitly by multiply and divide instructions Addressing modes  Register direct  Base displacement (by loads and stores)  Immediate  Absolute (by jump instructions)  PC relative (by branch instructions)

20 20 MIPS32 ISA. Shift SLL, SLV, SRA, SR sr R1, R2, #4 R1  0000 || (R2) 31-4 Comparison SLT, SLTI, SLTU slti R1, R2, #16 R1  1 if R2 < SE(16)  0 otherwise

21 21 MIPS32 ISA.. Notation we will use for instructions: Opcode Destination, Source1, Source2 Example: ADD R1, R2, R3 ADD R1 ← R2, R3

22 22 Steps in Instruction Processing 1. Fetch the instruction from memory  Get instruction whose address is in PC from memory into IR  Increment PC 2. Decode the instruction  Understand instruction, addressing modes, etc  Calculate effective addresses of the operands to the instruction and fetch the operand values 3. Execute the instruction  Do the required operation 4. Write back the result of the instruction Program Counter Instruction Register

23 23 Timeline of events PC to memory Instruction in IR PC++; Decode Op1 eff add calc Op1 fetched Op2 eff add calc Op2 fetched Op done Write result Processor/Memory Speed disparity: 2-3 orders of magnitude

24 24 Assumptions Activity is overlapped in time where possible.  PC increment and instruction fetch?  Instruction decode and effective address calc? Load-store ISA: the only instructions that take operands from memory are loads & stores Main memory delays not typically seen by instruction processor  Cache memories (more on this in a later lecture) Register file with 2 read ports and 1 write port

25 25 Processor cycle time: time required to do  Cache memory access  Register access + some logic (like decode)  ALU operation Instruction can be processed in 3-5 cycles  Jump: IFetch, Decode/OpFetch, DoOp  ALU: IFetch, Decode/OpFetch, DoOp, WriteReg  Load: IFetch, Decode, EffAddr, Cache, WriteReg

26 26 Performance of Processor Which is more important?  execution time of an instruction, or  throughput of instruction execution (number of instructions executed per unit time) Cycles per instruction (CPI) In our example, CPI between 3 and 5 Objective of Pipelining  To improve CPI; make it close to 1

27 27 Steps in Instruction Processing 1. Instruction Fetch: instruction is fetched from memory and PC is incremented 2. Instruction Decode: instruction is decoded and register operands fetched 3. Execute if arithmetic operation. Else, calculate effective address 4. Memory operation: if Load/Store, do memory access 5. Write back computed value to destination register IF WB MEM EX ID

28 28 Pipelining IFWBMEMEXIDIFWBMEMEXIDIFWBMEMEXIDIFWBMEMEXID Instruction execution time: 5 cycles Instruction execution throughput: 1 instruction per cycle It may not always be possible for instructions to progress through the pipeline in this way time

29 29 Pipeline Hazards Hazard: a situation that prevents the next instruction of the program from executing during its designated clock cycle 1.Structural hazard: Happens due to request for the same hardware resource by 2 or more instructions at the same time 2.Data hazard: Happens when one instruction depends on the result of previous instruction that is still in the pipeline 3.Control hazard: Happens due to control transfer instructions

30 30 1. Structural Hazards BBBB MEM & IF need to use memory IFWBMEMEXID LW R3 ← mem [8(R2)] i IFWBMEMEXID i + 1 IFWBMEMEXID i + 2 IF i + 3 IFWBMEMEXID i + 3 B

31 31 2. Data Hazards IFWBMEMEXID add R3 ← R1, R2 i IFWBMEMEXID i + 1 sub R4 ← R3, R8 BBBB i + 1 WBMEMEXID i + 1 WBMEMEXID BBBB

32 32 A Data Hazard Solution Interlock: Hardware that detects data dependency and stalls dependent instructions time instr 0 1 2 3 4 5 6 ADDIFIDEXMEM WB SUBIFstallstall ID EX MEM ORstallstall IF IDEX

33 33 Another Data Hazard Solution Forwarding or Bypassing: forward the result as soon as available to EX add R3 ← R1, R2 IFWBMEMEXID or R7 ← R3, R6 IFEXID sub R5 ← R3, R4 IFMEMEXID

34 34 Other Data Hazards Solutions Delayed loads  Require that instruction that uses load value be separated from the load instruction Instruction Scheduling  Reorder instructions so that dependent instructions are far enough apart  Compile time vs run time instruction scheduling

35 35 Instruction Scheduling Before Scheduling: LW R3 ← 0(R1) ADDI R5 ← R3, #1 ADD R2 ← R2, R3 LW R13 ← 0(R11) ADD R12 ← R13, R3 After Scheduling: LW R3 ← 0(R1) LW R13 ← 0(R11) ADDI R5 ← R3, #1 ADD R2 ← R2, R3 ADD R12 ← R13, R3 1 stall 2 stalls (following load) 0 stalls

36 36 3. Control Hazards BEQZ R3, out IFEXID Fetch instrn. (i +1) or from target? IFMEMEXIDIFID BBBB B BB BB Fetch instrn. (i +1) or from target? Branch resolved; appropriate instruction correctly fetched IFWBMEMEXID Branch condition & target resolved here

37 37 Lecture Summary Computer architecture is the study of computer structures; design, evaluation, description It builds on a background of computer organization, the study of how data can be represented and manipulated Pipelined processors improve program execution time (instruction execution throughput) by overlapping in time the execution of many instructions

38 38 Next Week Instruction Level Parallelism (ILP) and how it is exploited by current processors to improve program execution time even more


Download ppt "Intel and FAER’s Reach to Teach A Program on Computer Architecture Part 1: PIPELINED PROCESSORS R. Govindarajan and Matthew Jacob SERC, Indian Institute."

Similar presentations


Ads by Google