Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natawut NupairojAssembly Language1 Pipelining Processor.

Similar presentations


Presentation on theme: "Natawut NupairojAssembly Language1 Pipelining Processor."— Presentation transcript:

1 Natawut NupairojAssembly Language1 Pipelining Processor

2 Natawut NupairojAssembly Language2 Instruction Cycle pc = 0; do { ir := memory[pc++];{ Fetch the instruction. } decode(ir);{ Decode the instruction. } fetch(operands);{ Fetch the operands. } execute;{ Execute the instruction. } store(results);{ store the results. } } while(ir != HALT);

3 Natawut NupairojAssembly Language3 Pipelining Improve the execution speed. Divide instruction cycle into “stages”. Each stage executes independently and concurrently. Pipelining is natural !!! (from David Patterson’s lecture note.)

4 Natawut NupairojAssembly Language4

5 Natawut NupairojAssembly Language5

6 Natawut NupairojAssembly Language6

7 Natawut NupairojAssembly Language7 Pipelining Lessons Pipelining doesn’t help latency of single task. It helps throughput of entire workload. Multiple tasks operating simultaneously using different resources. Potential speedup = number pipe stages

8 Natawut NupairojAssembly Language8 Pipelining in Modern Processor Instruction cycle is divided into five stages: FetchDecode Operand Fetch ExecuteStore

9 Natawut NupairojAssembly Language9 Pipelining Execution Time 1 2 3 4 5 6 7 Inst 1 F D O E S Inst 2 F D O E S Inst 3 F D O E S

10 Natawut NupairojAssembly Language10 Performance of Pipeline What do we gain ? Suppose we execute 1000 instructions on non- pipelined and pipelined CPUs. Clock speed = 500 MHz (1 clock = 2 ns.) non-pipelined CPU: –total time = 2ns/cycle x 5 cycles/inst x 1000 instr. = 10 ms. Perfect pipelined CPU: –total time = 2ns/cycle x (1 cycle/inst x 1000 instr. + 4 cycles drain) = 2.008 ms.

11 Natawut NupairojAssembly Language11 Nothing is perfect !!! Problem with branch. Don’t know what to fetch next until decoded. Time 1 2 3 4 5 6 7 8 Inst 1 F D O E S Inst 2 : JMP X F D O E S Inst X F D O E S Branch target address is not available until here !!!

12 Natawut NupairojAssembly Language12 Stalled Pipe When pipelining is not smooth, we called it is “stalled”. Branch and others ? –Subroutine calling –Memory accessing –Multi-cycle execution Can we do better ? YES…but discuss later.

13 Natawut NupairojAssembly Language13 Branching in Sparc Sparc uses a 5-stage pipeline. Recall: pipe is stalled due to branch !!! Time 1 2 3 4 5 6 7 8 Inst 1 F D O E S Inst 2 : JMP X F D O E S Inst X F D O E S Branch target address is not available until here !!!

14 Natawut NupairojAssembly Language14 Branching in Sparc However, Sparc does not stall but will execute the instruction next to the brach (or call) instruction BEFORE it actually branches. This is called “delay slot”.

15 Natawut NupairojAssembly Language15 Delay Slot Time 1 2 3 4 5 6 7 8 Inst 1 F D O E S Inst 2 : JMP X F D O E S Inst 3 F D O E S Inst X F D O E S Branch target address is not available until here !!! Delay Slot

16 Natawut NupairojAssembly Language16 Filling Delay Slots with NOP.global main main:save %sp, -64, %sp mov 9, %l0 sub %l0, 2, %o0 add %l0, 14, %o1! Instruction before branch call.mul nop! Delay slot => wasted add %l0, 8, %o1! Instruction before branch call.div nop! Delay slot => wasted mov %o0, %l1 mov 1, %g1 ta 0

17 Natawut NupairojAssembly Language17 Filling Delay Slots.global main main:save %sp, -64, %sp mov 9, %l0 sub %l0, 2, %o0 call.mul add %l0, 14, %o1! Delay slot filled call.div add %l0, 8, %o1! Delay slot filled mov %o0, %l1 mov 1, %g1 ta 0

18 Natawut NupairojAssembly Language18 Optimizing Our Second Program Can we fill the delay slot ?... mov %o0, %l1! Store it in y add %l0, 1, %l0! x++ cmp %l0, 11! x < 11 ? bl loop nop! Delay slot => wasted... Not with cmp, not add (cmp depends on add). mov can !!! No other instructions after that (and before bl) depend on this instruction.

19 Natawut NupairojAssembly Language19 Optimizing Our Second Program... mov %o0, %l1! Store it in y add %l0, 1, %l0! x++ cmp %l0, 11! x < 11 ? bl loop mov %o0, %l1! Store it in y... The key is to fill the delay slot with the instruction that has no other instruction depends on its result !!!

20 Natawut NupairojAssembly Language20 Filling Delay Slot Summary After branch and call, there is one delay slot. Always feel the delay slot to improve performance. When filling the slot, don’t change the results the program computes. No other instructions (before the branch and the branch itself) depend on the instruction in the delay slot. You can always fill the slot with “nop”.

21 Natawut NupairojAssembly Language21 Do…While Delay Slot How to fill the delay slot ? –Independent instruction –target instruction with annulled branch when you cannot find any independent instruction Independent instruction –see our second program

22 Natawut NupairojAssembly Language22 Filling with Target Instruction sub %l0, 1, %o0!(x-1) to %o0, execute once loop:call.mul sub %l0, 7, %o1!(x-7) to %o1, delay slot call.div sub %l0, 11, %o1!(x-11) to %o1, delay slot mov %o0, %l1! Store it in y add %l0, 1, %l0! x++ cmp %l0, 11! x < 11 ? bl,a loop sub %l0, 1, %o0!(x-1) to %o0 (delay slot)

23 Natawut NupairojAssembly Language23 Annulled Branch Execute an instruction in the delay slot if and only if branch occurs. Program is one instruction longer and waste one cycle when the loop exits. Do not need to find an independent instruction. Can be used with any type of branches.

24 Natawut NupairojAssembly Language24 While Loop Optimization Reduce number of instructions to be executed inside the loop. –By first jumping to the comparison at the end of the loop. Then fill the delay slot.

25 Natawut NupairojAssembly Language25 While Loop Optimization Example ba test! Initial jump nop! Delay slot loop: add %l0, %l1, %l0! a = a + b add %l2, 1, %l2! c++ test: cmp %l0, 17! Check condition ble loop! Repeat if true nop! Delay slot

26 Natawut NupairojAssembly Language26 While Loop Optimization Example ba test! Initial jump cmp %l0, 17 ! Check condition (DS) loop:add %l2, 1, %l2 ! c++ cmp %l0, 17! Check condition test: ble,a loop! Repeat if true add %l0, %l1, %l0! very tricky! (DS) Performance Improvement: –Direct translation = 7*number of loop iterations –With initial jumping = 5*number of loop iterations –And filling delay slots = 4*number of loop iterations


Download ppt "Natawut NupairojAssembly Language1 Pipelining Processor."

Similar presentations


Ads by Google