CPU performance CPU power consumption

Name: CPU performance CPU power consumption
Uploaded: 2017-08-27T05:20:01+00:00
Duration: PTM9S11
Channel: Denzel Spearman
Description: CPU performance CPU power consumption

CPU performance CPU power consumption
CPUs CPU performance CPU power consumption

Elements of CPU performance
Cycle time CPU pipeline Memory system

Pipelining Several instructions are executed simultaneously at different stages of completion Various conditions can cause pipeline bubbles that reduce utilization: branches memory system delays etc.

Pipeline structures ARM7 has 3-stage pipes: ARM9 have 5-stage pipes:
fetch instruction from memory decode opcode and operands execute ARM9 have 5-stage pipes: Instruction fetch Decode Execute Data memory access Register write

ARM9 core instruction pipeline

ARM7 pipeline execution
add r0,r1,#5 fetch decode fetch execute decode fetch execute decode sub r2,r3,r6 execute cmp r2,#3 time 1 2 3

Performance measures Latency: time it takes for an instruction to get through the pipeline Throughput: number of instructions executed per time period Pipelining increases throughput without reducing latency

Pipeline stalls If every step cannot be completed in the same amount of time, pipeline stalls Bubbles introduced by stall increase latency, reduce throughput

ARM multi-cycle LDMIA instruction
r0,{r2,r3} fetch decode ex ld r2 ex ld r3 sub r2,r3,r6 fetch decode ex sub cmp r2,#3 fetch decode ex cmp time

Control stalls Branches often introduce stalls (branch penalty)
Stall time may depend on whether branch is taken May have to squash instructions that already started executing Don’t know what to fetch until condition is evaluated

ARM pipelined branch fetch decode fetch decode time ex bne ex bne
bne foo sub r2,r3,r6 foo add r0,r1,r2 ex bne fetch decode ex add time

Delayed branch To increase pipeline efficiency, delayed branch mechanism requires n instructions after branch always executed whether branch is executed or not

Example: ARM7 execution time
Determine execution time of FIR filter: for (i=0; i<N; i++) f = f + c[i]*x[i]; Only branch in loop test may take more than one cycle. BLT loop takes 1 cycle best case, 3 worst case.

ARM10 processor execution time
Impossible to describe briefly the exact behavior of all instructions in all circumstances Branch prediction Prefetch buffer Branch folding The independent Load/Store Unit Data alignment How many accesses hit in the cache and TLB

ARM10 integer core 3 instr’s

Integer core Prefetch Unit Integer Unit Load/store Unit
Fetches instructions from I-cache or external memory Predicts the outcome of branches whenever it can Integer Unit Decode Barrel shifter, ALU, Multiplier Main instruction sequencer Load/store Unit Load or store two registers(64bits) per cycle Decouple from the integer unit after the first access of a LDM or STM instruction Supports Hit-Under-Miss (HUM) operation

Pipeline Fetch Issue Decode Execute Memory Write
I-cache access, branch prediction Issue Initial instruction decode Decode Final instruction decode, register read for ALU op, forwarding, and initial interlock resolution Execute Data address calculation, shift, flag setting, CC check, branch mispredict detection, and store data register read Memory Data cache access Write Register writes, instruction retirement

Typical operations

Load or store operation

LDR operation that misses

Interlocks Integer core Pipeline interlocks
forwarding to resolve data dependencies between instructions Pipeline interlocks Data dependency interlocks: Instructions that have a source register that is loaded from memory by the previous instruction Hardware dependency A new load waiting for the LSU to finish an existing LDM or STM A load that misses when the HUM slot is already occupied A new multiply waiting for a previous multiply to free up the first stage of the multiply

Pipeline forwarding paths

Example of interlocking and forwarding
Execute-to-execute mov r0, #1 add r1, r0, #1 Memory-to-execute sub r1, r2, #2 add r2, r0, #1

Example of interlocking and forwarding, cont’d
Single cycle interlock ldr r0, [r1, r2] str r3, [r0, r4] fetch decode write memory execute issue r1+r2 r0 read ldr fetch issue decode execute memory write r0+r4 r3 read r3 write str

Superscalar execution
Superscalar processor can execute several instructions per cycle. Uses multiple pipelined data paths. Programs execute faster, but it is harder to determine how much faster.

Data dependencies Execution time depends on operands, not just opcode.
Superscalar CPU checks data dependencies dynamically: data dependency r0 r1 add r2,r0,r1 add r3,r2,r5 r2 r5 r3

Memory system performance
Caches introduce indeterminacy in execution time Depends on order of execution Cache miss penalty: added time due to a cache miss Several reasons for a miss: compulsory, conflict, capacity

CPU performance CPU power consumption

Similar presentations

Presentation on theme: "CPU performance CPU power consumption"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPU performance CPU power consumption

Similar presentations

Presentation on theme: "CPU performance CPU power consumption"— Presentation transcript:

Similar presentations

About project

Feedback