3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster –Balance must be found: E.g. sophisticated pipeline: CPI ↓ clock cycle ↑

Fallacies and Pitfalls Pitfall: Emphasizing improving CPI by increasing issue rate, while sacrificing clock rate can decrease performance –Again, question of balance SuperSPARC –vs– HP PA 7100 –Complex interactions between cycle time and organisation

Fallacies and Pitfalls Pitfall: Improving only one aspect of a multiple-issue processor and expecting overall performance improvement –Amdahl’s Law! –Boosting performance of one area may uncover problems in another

Fallacies and Pitfalls Pitfall: Sometimes bigger and dumber is better! –Alpha 21264: sophisticated multilevel tournament branch predictor –Alpha 21164: simple two-bit predictor –21164 performs better for transaction processing application! Can handle twice as many local branch predictions

Concluding Remarks Lots of open questions! –Clock speed –vs– CPI –Power issues –Exploiting parallelism ILP –vs– explicit

Characteristics of Modern (2001) Processors Figure 3.61 –3–4 way superscalar –4–22 stage pipelines –Branch prediction –Register renaming (except UltraSPARC) –400MHz – 1.7GHz –7–130 million transistors

Chapter 4 Exploiting ILP with Software

4.1. Compiler Techniques for Exposing ILP Compilers can improve the performance of simple pipelines –Reduce data hazards –Reduce control hazards

Loop Unrolling Compiler technique to increase ILP –Duplicate loop body –Decrease iterations Example: –Basic code: 10 cycles per iteration –Scheduled: 6 cycles for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; } for (int k = 0; k < 1000; k+=4) { x[k] = x[k] + s; x[k+1] = x[k+1] + s; x[k+2] = x[k+2] + s; x[k+3] = x[k+3] + s; } Loop Unrolling Basic code: 7 cycles per “iteration” Scheduled: 3.5 cycles (no stalls!)

Loop Unrolling Requires clever compilers –Analysing data dependences, name dependences and control dependences Limitations –Code size –Decrease in amortisation of overheads –“Register pressure” –Compiler limitations Useful for any architecture

Superscalar Performance Two-issue MIPS (int + FP) 2.4 cycles per “iteration” –Unrolled five times

4.2. Static Branch Prediction Useful: –where behaviour can be predicted at compile- time –to assist dynamic prediction Architectural support –Delayed branches

Static Branch Prediction Simple: –Predict taken –Has average misprediction rate of 34% (SPEC) –Range: 59% – 9% Better: –Predict backward taken, forward not-taken –Worse for SPEC!

Static Branch Prediction Advanced compiler analysis can do better Profiling is very useful –FP: 9% ± 4% –Int: 15% ± 5%

4.3. Static Multiple Issue: VLIW Compiler groups instructions into “packets”, checking for dependences –Remove dependences –Flag dependences Simplifies hardware

VLIW First machines used a wide instruction with multiple operations per instruction –Hence Very Long Instruction Word (VLIW) –64–128 bits Alternative: group several instructions into an issue packet

VLIW Architectures Multiple functional units Compiler selects instructions for each unit to create one long instruction/an issue packet Example: five operations –Integer/branch, 2 × FP, 2 × memory access Need lots of parallelism –Use loop unrolling, or global scheduling

Example Loop unrolled seven times! 1.29 cycles per result 60% of available instruction “slots” filled for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

Summary of Improvements TechniqueUnscheduledScheduled Basic code106 Loop unrolled (4)73.5 Superscalar (5)2.4 VLIW (7)1.29

Drawbacks of Original VLIWs Large code size –Need to use loop unrolling –Wasted space for unused slots Clever encoding techniques, compression Lock-step execution –Stalling one unit stalls them all Binary code compatibility –Variations on structure required recompilation

4.4. Compiler Support for Exploiting ILP We will not cover this section in detail Loop unrolling –Loop-carried dependences Software pipelining –Interleave instructions from different iterations

4.5. Hardware Support for Extracting More Parallelism Techniques like loop-unrolling work well when branch behaviour can be predicted at compile time If not, we need more advanced techniques: –Conditional instructions –Hardware support for compiler speculation

Conditional or Predicated Instructions Instructions have associated conditions –If condition is true execution proceeds normally –If not, instruction becomes a no-op Removes control hazards if (a == 0) b = c; bnez %r8, L1 nop mov %r1, %r2 L1:... cmovz %r8, %r1, %r2

Conditional Instructions Control hazards effectively replaced by data hazards Can be used for speculation –Compiler reorders instructions depending on likely outcome of branches

Limitations on Conditional Instructions Annulled instructions still execute –But may occupy otherwise stalled time Most useful when conditions evaluated early Limited usefulness for complex conditions May be slower than unconditional operations

Conditional Instructions in Practice MachineConditional Instructions MIPS, Alpha, SPARC Move HP PA Any register-register instruction can annul the following instruction IA-64Full predication

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

Similar presentations

Presentation on theme: "3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

Similar presentations

Presentation on theme: "3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster."— Presentation transcript:

Similar presentations

About project

Feedback