1 1999 ©UCB CS 162 Computer Architecture Lecture 2: Introduction & Pipelining Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan/cs162.

1 1999 ©UCB CS 162 Computer Architecture Lecture 2: Introduction & Pipelining Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan/cs162

2 1999 ©UCB Review of Last Class °MIPS Datapath °Introduction to Pipelining °Introduction to Instruction Level Parallelism (ILP) °Introduction to VLIW

3 1999 ©UCB What is Multiprocessing °Parallelism at the Instruction Level is limited because of data dependency => Speed up is limited!! °Abundant availability of program level parallelism, like Do I = 1000, Loop Level Parallelism. How about employing multiple processors to execute the loops => Parallel processing or Multiprocessing °With billion transistors on a chip, we can put a few CPUs in one chip => Chip multiprocessor

4 1999 ©UCB Memory Latency Problem Even if we increase CPU power, memory is the real bottleneck. Techniques to alleviate memory latency problem: 1.Memory hierarchy – Program locality, cache memory, multilevel, pages and context switching 2.Prefetching – Get the instruction/data before the CPU needs. Good for instns because of sequential locality, so all modern processors use prefetch buffers for instns. What do with data? 3.Multithreading – Can the CPU jump to another program when accessing memory? It’s like multiprogramming!!

5 1999 ©UCB Hardware Multithreading °We need to develop a hardware multithreading technique because switching between threads in software is very time-consuming (Why?), so not suitable for main memory (instead of I/O) access, Ex: Multitasking °Develop multiple PCs and register sets on the CPU so that thread switching can occur without having to store the register contents in main memory (stack, like it is done for context switching). °Several threads reside in the CPU simultaneously, and execution switches between the threads on main memory access. °How about both multiprocessors and multithreading on a chip? => Network Processor

6 1999 ©UCB Architectural Comparisons (cont.) Time (processor cycle) SuperscalarFine-GrainedCoarse-Grained Multiprocessing Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot Simultaneous Multithreading

7 1999 ©UCB Intel IXP1200 Network Processor Initial component of the Intel Exchange Architecture - IXA Each micro engine is a 5-stage pipeline – no ILP, 4-way multithreaded 7 core multiprocessing – 6 Micro engines and a Strong Arm Core 166 MHz fundamental clock rate  Intel claims 2.5 Mpps IP routing for 64 byte packets Already the most widely used NPU  Or more accurately the most widely admitted use

8 1999 ©UCB IXP1200 Chip Layout StrongARM processing core Microengines introduce new ISA I/O  PCI  SDRAM  SRAM  IX : PCI-like packet bus On chip FIFOs  16 entry 64B each

9 1999 ©UCB IXP1200 Microengine 4 hardware contexts  Single issue processor  Explicit optional context switch on SRAM access Registers  All are single ported  Separate GPR  1536 registers total 32-bit ALU  Can access GPR or XFER registers Standard 5 stage pipe 4KB SRAM instruction store – not a cache!

10 1999 ©UCB Intel IXP2400 Microengine (New) XScale core replaces StrongARM 1.4 GHz target in 0.13-micron Nearest neighbor routes added between microengines Hardware to accelerate CRC operations and Random number generation 16 entry CAM

11 1999 ©UCB MIPS Pipeline Chapter 6 CS 161 Text

12 1999 ©UCB Review: Single-cycle Datapath for MIPS Data Memory (Dmem) PCRegisters ALU Instruction Memory (Imem) Stage 1Stage 2Stage 3 Stage 4 Stage 5 IFtchDcdExecMemWB °Use datapath figure to represent pipeline ALU IM Reg DMReg

13 1999 ©UCB Stages of Execution in Pipelined MIPS 5 stage instruction pipeline 1) I-fetch: Fetch Instruction, Increment PC 2) Decode: Instruction, Read Registers 3) Execute: Mem-reference: Calculate Address R-format: Perform ALU Operation 4) Memory: Load:Read Data from Data Memory Store:Write Data to Data Memory 5) Write Back: Write Data to Register

14 1999 ©UCB Pipelined Execution Representation °To simplify pipeline, every instruction takes same number of steps, called stages °One clock cycle per stage IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB Program Flow Time

15 1999 ©UCB Datapath Timing: Single-cycle vs. Pipelined °Assume the following delays for major functional units: 2 ns for a memory access or ALU operation 1 ns for register file read or write °Total datapath delay for single-cycle: °In pipeline machine, each stage = length of longest delay = 2ns; 5 stages = 10ns InsnInsnRegALUDataRegTotal TypeFetchReadOperAccessWriteTime beq 2ns1ns2ns5ns R-form2ns1ns2ns1ns6ns sw 2ns1ns2ns2ns7ns lw 2ns1ns2ns2ns1ns8ns

16 1999 ©UCB Pipelining Lessons °Pipelining doesn’t help latency (execution time) of single task, it helps throughput of entire workload °Multiple tasks operating simultaneously using different resources °Potential speedup = Number of pipe stages °Time to “fill” pipeline and time to “drain” it reduces speedup °Pipeline rate limited by slowest pipeline stage °Unbalanced lengths of pipe stages also reduces speedup

17 1999 ©UCB Single Cycle Datapath (From Ch 5) Regs Read Reg1 Read data1 ALUALU Read data2 Read Reg2 Write Reg Write Data Zero ALU- con RegWrite Address Read data Write Data Sign Extend Dmem MemRead MemWrite MuxMux MemTo- Reg MuxMux Read Addr Instruc- tion Imem 4 PCPC addadd addadd << 2 MuxMux PCSrc ALUOp ALU- src MuxMux 25:21 20:16 15:11 RegDst 15:0 31:0

18 1999 ©UCB Required Changes to Datapath °Introduce registers to separate 5 stages by putting IF/ID, ID/EX, EX/MEM, and MEM/WB registers in the datapath. °Next PC value is computed in the 3 rd step, but we need to bring in next instn in the next cycle – Move PCSrc Mux to 1 st stage. The PC is incremented unless there is a new branch address. °Branch address is computed in 3 rd stage. With pipeline, the PC value has changed! Must carry the PC value along with instn. Width of IF/ID register = (IR)+(PC) = 64 bits.

19 1999 ©UCB Changes to Datapath Contd. °For lw instn, we need write register address at stage 5. But the IR is now occupied by another instn! So, we must carry the IR destination field as we move along the stages. See connection in fig. Length of ID/EX register = (Reg1:32)+(Reg2:32)+(offset:32)+ (PC:32)+ (destination register:5) = 133 bits Assignment: What are the lengths of EX/MEM, and MEM/WB registers

20 1999 ©UCB Pipelined Datapath (with Pipeline Regs)(6.2) Address 4 32 0 Add Add result Shift left 2 I n s t r u c t i o n M u x 0 1 Add PC 0 Address Write data M u x 1 Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero Imem Dmem Regs IF/ID ID/EX EX/MEM MEM/WB 64 bits 133 bits 102 bits 69 bits 5 Fetch Decode Execute Memory Write Back

1 1999 ©UCB CS 162 Computer Architecture Lecture 2: Introduction & Pipelining Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan/cs162.

Similar presentations

Presentation on theme: "1 1999 ©UCB CS 162 Computer Architecture Lecture 2: Introduction & Pipelining Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan/cs162."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 1999 ©UCB CS 162 Computer Architecture Lecture 2: Introduction & Pipelining Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan/cs162.

Similar presentations

Presentation on theme: "1 1999 ©UCB CS 162 Computer Architecture Lecture 2: Introduction & Pipelining Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan/cs162."— Presentation transcript:

Similar presentations

About project

Feedback