Presentation on theme: "CPU performance CPU power consumption"— Presentation transcript:
1 CPU performance CPU power consumption CPUsCPU performanceCPU power consumption
2 Elements of CPU performance Cycle timeCPU pipelineMemory system
3 PipeliningSeveral instructions are executed simultaneously at different stages of completionVarious conditions can cause pipeline bubbles that reduce utilization:branchesmemory system delaysetc.
4 Pipeline structures ARM7 has 3-stage pipes: ARM9 have 5-stage pipes: fetch instruction from memorydecode opcode and operandsexecuteARM9 have 5-stage pipes:Instruction fetchDecodeExecuteData memory accessRegister write
7 Performance measuresLatency: time it takes for an instruction to get through the pipelineThroughput: number of instructions executed per time periodPipelining increases throughput without reducing latency
8 Pipeline stallsIf every step cannot be completed in the same amount of time, pipeline stallsBubbles introduced by stall increase latency, reduce throughput
10 Control stalls Branches often introduce stalls (branch penalty) Stall time may depend on whether branch is takenMay have to squash instructions that already started executingDon’t know what to fetch until condition is evaluated
11 ARM pipelined branch fetch decode fetch decode time ex bne ex bne bne foosubr2,r3,r6foo addr0,r1,r2ex bnefetchdecodeex addtime
12 Delayed branchTo increase pipeline efficiency, delayed branch mechanism requires n instructions after branch always executed whether branch is executed or not
13 Example: ARM7 execution time Determine execution time of FIR filter:for (i=0; i<N; i++)f = f + c[i]*x[i];Only branch in loop test may take more than one cycle.BLT loop takes 1 cycle best case, 3 worst case.
14 ARM10 processor execution time Impossible to describe briefly the exact behavior of all instructions in all circumstancesBranch predictionPrefetch bufferBranch foldingThe independent Load/Store UnitData alignmentHow many accesses hit in the cache and TLB
16 Integer core Prefetch Unit Integer Unit Load/store Unit Fetches instructions from I-cache or external memoryPredicts the outcome of branches whenever it canInteger UnitDecodeBarrel shifter, ALU, MultiplierMain instruction sequencerLoad/store UnitLoad or store two registers(64bits) per cycleDecouple from the integer unit after the first access of a LDM or STM instructionSupports Hit-Under-Miss (HUM) operation
17 Pipeline Fetch Issue Decode Execute Memory Write I-cache access, branch predictionIssueInitial instruction decodeDecodeFinal instruction decode, register read for ALU op, forwarding, and initial interlock resolutionExecuteData address calculation, shift, flag setting, CC check, branch mispredict detection, and store data register readMemoryData cache accessWriteRegister writes, instruction retirement
21 Interlocks Integer core Pipeline interlocks forwarding to resolve data dependencies between instructionsPipeline interlocksData dependency interlocks:Instructions that have a source register that is loaded from memory by the previous instructionHardware dependencyA new load waiting for the LSU to finish an existing LDM or STMA load that misses when the HUM slot is already occupiedA new multiply waiting for a previous multiply to free up the first stage of the multiply
23 Example of interlocking and forwarding Execute-to-executemov r0, #1add r1, r0, #1Memory-to-executesub r1, r2, #2add r2, r0, #1
24 Example of interlocking and forwarding, cont’d Single cycle interlockldr r0, [r1, r2] str r3, [r0, r4]fetchdecodewritememoryexecuteissuer1+r2r0 readldrfetchissuedecodeexecutememorywriter0+r4r3 readr3 writestr
25 Superscalar execution Superscalar processor can execute several instructions per cycle.Uses multiple pipelined data paths.Programs execute faster, but it is harder to determine how much faster.
26 Data dependencies Execution time depends on operands, not just opcode. Superscalar CPU checks data dependencies dynamically:data dependencyr0r1add r2,r0,r1add r3,r2,r5r2r5r3
27 Memory system performance Caches introduce indeterminacy in execution timeDepends on order of executionCache miss penalty: added time due to a cache missSeveral reasons for a miss: compulsory, conflict, capacity