Wackiness Algorithm A: Algorithm B: Generate 200,000 random values 0-255 Add up all values >= 128 Algorithm B: Sort the values
Pipelining Pt2
Pipelining Limits In theory: n times speedup for n stage pipeline But Only if all stages are balanced Only if can be kept full
Hazards Hazard : Situation preventing next instruction from continuing in pipeline Structural : Resource (shared hardware) conflict Data : Needed data not ready Control : Correct action depends on earlier instruction
Branch Unconditional Branch in perfect world: Skip inst 3, 4, no bubble
Branch Timing Don’t know it is branch until ID
Branch Timing Branch address not available until after EX
Branch Real Timing Branch destination calculated at T4 Can’t start the instruction until T5 Need to insert NOP bubble
Branch Real Timing If we can forward address from EX to IF can start x at T4
Branch Real Timing Branch destination calculated at T4 Already started running instruction 3 Need ability to ignore started instruction Still a bubble – ignored instruction instead of No-OP
Conditional Branch Conditional branch has two possibilities: Not taken
Solving Conditional Branch Option 1: Stall until we know Not taken Taken
Solving Conditional Branch Option 2: Prediction Predict Not Taken & Is Not Taken Predict Not Taken & Is Taken
Predicting Taken Calculating branch destination in time to use in next cycle = more hardware:
Solving Conditional Branch Option 2: Prediction Predict Taken & Is Not Taken Predict Taken & Is Taken
Branch Prediction Penalty In our CPU Predict correct = 0 cycle penalty Predict wrong = 1 cycle penalty Longer pipeline No way to decode before next fetch Bigger penalty for miss Penalty for any taken branch
Static Branch Prediction Static prediction : Hardcoded assumptions If branch backwards, it is a loop, assume we take the branch
Dynamic Branch Prediction Dynamic Prediction : Predict based on runtime behavior More hardware : Branch prediction buffer (aka branch history table) Indexed by recent branch instruction addresses Stores outcome (taken/not taken) To execute a branch Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction
Prediction 1 bit history (Taken / Not taken) may not be optimal Ex Nested loop: Inner CBZ missed on Last iteration Next first iteration
Prediction 2 bit history avoids that issue
Real Stuff Is it worth it?
Real Stuff Is it worth it?
Pipelineing worth it? Yes… to a point
ARM Pipelines Early ARM Pipeline: ARM v6 pipeline
Modern Pipeline Cortex A53 : ARMv8
Modern Pipeline Cortex A53 : Pipeline stalls basically double CPI
Why Loads Have +8 in Address Fun Fact Why Loads Have +8 in Address LDR : Calculates location as: currentLocation + 8 + immediate 1000 (PC) + 8 + C (810 + 1210) 1000 + 14 (2010) 1014 By the time it executes, PC will be 8 greater
Intel Pipelines
Intel i7 Branch Performance A few mispredictions can have large impact:
Intel vs AMD Part of Intel's IPC advantage: Branch prediction AMD claiming major advances in new architecture: