Presentation is loading. Please wait.

Presentation is loading. Please wait.

Is There Anything More to Learn about High Performance Processors? J. E. Smith.

Similar presentations


Presentation on theme: "Is There Anything More to Learn about High Performance Processors? J. E. Smith."— Presentation transcript:

1 Is There Anything More to Learn about High Performance Processors? J. E. Smith

2 June 2003copyright J. E. Smith, Underlying Issues  Power  Wire Delays  Many available transistors  Applications Web Databases Entertainment Office Embedded

3 June 2003copyright J. E. Smith, The State of the Art  Multiple instructions per cycle  Out-of-order issue  Register renaming  Deep pipelining  Branch prediction  Speculative execution  Cache memories  Multi-threading

4 June 2003copyright J. E. Smith, History Quiz  Superscalar processing was invented by a) Intel in 1993 b) RISC designers in the late ’80s, early ’90s c) IBM ACS in late ’60s; Tjaden and Flynn 1970

5 June 2003copyright J. E. Smith, History Quiz  Out-of-order issue was invented by a) Intel in 1993 b) RISC designers in the late ’80s, early ’90s c) Thornton/Cray in the 6600, 1963

6 June 2003copyright J. E. Smith, History Quiz  Register renaming was invented by a) Intel in 1995 b) RISC designers in the late ’80s, early ’90s c) Tomasulo in late ’60s; also Tjaden and Flynn 1970 What Keller said in1975:

7 June 2003copyright J. E. Smith, History Quiz  Deep pipelining was invented by a) Intel in 2001 b) RISC designers in the late ’80s, early ’90s c) Seymour Cray in : gates/stage (?) 1976: Cray-1 8 gates/stage 1985: Cray-2 4 gates/stage 1991: Cray-3 6 gates/stage (?)

8 June 2003copyright J. E. Smith, History Quiz  Branch prediction was invented by a) Intel in 1995 b) RISC designers in the late ’80s, early ’90s c) Stretch 1959 (static); Livermore S1(?) 1979 or earlier at IBM(?)

9 June 2003copyright J. E. Smith, History Quiz  Speculative Execution was invented by a) Intel in 1995 b) RISC designers in the late ’80s, early ’90s c) CDC 180/990 (?) in 1983

10 June 2003copyright J. E. Smith, History Quiz  Cache memories were invented by a) Intel in 1985 b) RISC designers in the late ’80s, early ’90s c) Maurice Wilkes in 1965

11 June 2003copyright J. E. Smith, History Quiz  Multi-threading was invented by a) Intel in 2001 b) RISC designers in the ’80s c) Seymour Cray in 1964

12 June 2003copyright J. E. Smith, Summary  Multiple instructions per cycle  Out-of-order issue  Register renaming  Deep pipelining  Branch prediction  Speculative Execution  Cache memories  Multi-threading All were done as part of a development project and immediately put into practice. After introduction, only a few remained in common use

13 June 2003copyright J. E. Smith, The 1970s & 80s – Less Complexity  Level of integration wouldn’t support it Not because of transistor counts, but because of small replaceable units.  Cray went toward simple issue, deep pipelining  Microprocessor development first used high complexity then drove pipelines deeper  Limits to Wide Issue  Limits to Deep Pipelining

14 June 2003copyright J. E. Smith, Typical Superscalar Performance  Your basic superscalar processor: 4-way issue, 32 window 16K I-cache and D-Cache 8K gshare branch predictor  Wide performance range  Performance typically much less than peak (4)

15 June 2003copyright J. E. Smith, Superscalar Processor Performance  Compare 4-way issue, 32 window Ideal I-cache, D-cache, Branch predictor Non-ideal I-cache, D-cache, Branch predictor  Peak performance would be achievable IF it weren’t for “bad” events I Cache misses D Cache misses Branch mispredictions

16 June 2003copyright J. E. Smith, Performance Model  Consider profile of dynamic instructions issued per cycle:  Background "issue-width" near-peak IPC With never-ending series of transient events  determine performance with ideal caches & predictors then account for “bad” transient events time IPC branch mispredicts i-cache miss long d-cache miss

17 June 2003copyright J. E. Smith, Backend: Ideal Conditions  Key Result (Michaud, Seznec, Jourdan): Square Root relationship between I ssue R ate and W indow size

18 June 2003copyright J. E. Smith, Branch Misprediction Penalty 1) lost opportunity performance lost by issuing soon-to-be flushed instructions 2) pipeline re-fill penalty obvious penalty; most people equate this with the penalty 3) window fill penalty performance lost due to window startup

19 June 2003copyright J. E. Smith, Calculate Mispredict Penalty 8.5 insts/4 = 2.1 cp 9 insts/4 = 2.2 cp insts/4 = 4.9 cp Total Penalty = 9.2 cp

20 June 2003copyright J. E. Smith, Importance of Branch Prediction

21 June 2003copyright J. E. Smith, Importance of Branch Prediction  Doubling issue width means predictor has to be four times better for similar performance profile  Assumes everything else is ideal I-caches & D-caches  Research State of the Art: about 5 percent mispredicts on average (perceptron predictor) => one misprediction per 100 instructions

22 June 2003copyright J. E. Smith, Next Generation Branch Prediction  Classic Memory/Computation Tradeoff  Conventional Branch Predictors Heavy on memory; light on computation  Perceptron Predictor Add heavier computation Also adds latency to prediction  Future predictors should balance memory, computation, prediction latency

23 June 2003copyright J. E. Smith, Implication of Deeper Pipelines  Assume 1 misprediction per 96 instructions  Vary fetch/decode/rename section of pipe  Advantage of wide issue diminishes as pipe deepens  This ignores implementation complexity  Graph also ignores longer execution latencies

24 June 2003copyright J. E. Smith, Deep Pipelining: the Optimality of Eight  Hrishikesh et al. : 8 F04s  Kunkel et me : 8 gates  Cray-1: 8 4/5 NANDS We’re getting there!

25 June 2003copyright J. E. Smith, Deep Pipelining  Consider time per instruction (TPI) versus pipeline depth (Hartstein and Puzak)  The curve is very flat near the optimum Good engineeringGood sales

26 June 2003copyright J. E. Smith, Transistor Radios and High MHz  A lesson from transistor radios…  Wonderful new technology in the late ’50s  Clearly, the more transistors, the better the radio! => Easy way to improve sales 6 transistors, 8 transistors, 14 transistors… Use transistors as diodes… Lesson: Eventually, people caught on

27 June 2003copyright J. E. Smith, The Optimality of Eight 8 Transistors!

28 June 2003copyright J. E. Smith, So, Processors are Dead for Research?  Of course not BUT IPC oriented research may be on life support

29 June 2003copyright J. E. Smith, Consider Car Engine Development  Conclusion: We should be driving cars with 48 cylinders! Don’t focus (obsess) on one aspect of performance And don’t focus only on performance Power efficiency Reliability Security Design Complexity

30 June 2003copyright J. E. Smith, Co-Designed VMs  Move hardware/software boundary  Give “hardware” designer some software in concealed memory  Hardware does what it does best: speed  Software does what it does best: manage complexity Operating System VMM Application Prog. Profiling HWConfiguration HW Visible Memory Concealed Memory Hardware Data Tables

31 June 2003copyright J. E. Smith, Co-Designed VMs: Micro-OS  Manage processor with micro-OS VMM software Manage processor resources in an integrated way Identify program phase changes Save/restore implementation contexts A microprocessor-controlled microprocessor Configurable I-cache size Simultaneous multithreading Variable branch predictor global history Configurable Instruction window Configurable D-Cache size Variable D-cache prefetch algorithm Configurable Reorder Buffer Pipeline

32 June 2003copyright J. E. Smith, Co-Designed VMs  Other Applications Binary Translation (e.g. Transmeta) Enables new ISAs Security (Dynamo/RIO)

33 June 2003copyright J. E. Smith, Speculative Multi-threading  Reasons for skepticism Complex Incompatible w/ deep pipelining The devil will be in the details researcher: 4 instruction types designer: 100(s) of instruction types High Power Consumption Performance advantages tend to be focused on specific programs (benchmarks) Better to push ahead with the real thread

34 June 2003copyright J. E. Smith, The Memory Wall: D-Cache Misses  Divide into: Short misses – handle like long latency functional unit Long misses – need special treatment  Things that can reduce performance 1) Structural hazards ROB fills up behind load and dispatch stalls Window fills with instructions dependent on load and issue stops 2) Control dependences Mispredicted branch dependent on load data  Instructions beyond branch wasted

35 June 2003copyright J. E. Smith, Structural and Data Blockages  Experiment: Window size 32, Issue width 4 Ideal branch prediction Cache miss delay 1000 cycles Separate Window and ROB 4K entries each Simulate single cache miss and see what happens

36 June 2003copyright J. E. Smith, Results Issue continues at full speed Typical dependent instructions: about 30 Usually dependent instructions follow load closely BenchmarkAvg. # instsAvg. #insts issued afterin window missdep. on load Bzip Crafty Eon Gap Gcc Mcf Gzip Parser Perl Twolf Vortex Vpr

37 June 2003copyright J. E. Smith, Control Dependences  Non-ideal Branch prediction How many cache misses lead to branch mispredict and when? Use 8K gshare

38 June 2003copyright J. E. Smith, Results Bimodal behavior; for some programs, branch mispredictions are crucial In many cases 30-40% cache miss data leads to mispredicted branch Inhibits ability to overlap data cache misses One more reason to worry about branch prediction fract. loads#insts before Benchmark drivingmispredict mispredict Bzip Crafty Eon Gap Gcc Mcf Gzip Parser Perl Twolf Vortex Vpr

39 June 2003copyright J. E. Smith, Dealing with the Memory Wall  Don’t speculate about it Run through it  ROB grows as nD issue width is n ; miss delay D cycles miss delay of 200 cycles; four-issue machine ROB of about 800 entries  Window grows as dm m outstanding misses; d dependent instructions each Example: 6 outstanding misses and 30 dependent instructions then the window should be enlarged by 180 slots

40 June 2003copyright J. E. Smith, Future High Performance Processors  Fast clock cycle: 8 gates per stage  Less speculation Deciding what to take out more important than new things to put in  Return to Simplicity Leave the baroque era behind  ILP less important

41 June 2003copyright J. E. Smith, Research in the deep pipeline domain  When there are 40 gate levels, we can be sloppy about adding gadgets  When there are 8 gate levels, a gadget requiring even one more level slows clock by 12.5% logic latch Neat Gadget logic  To really evaluate performance impact of adding a gadget, we need a detailed logic design  Future research should be focused in jettisoning gadgets, not adding them

42 June 2003copyright J. E. Smith, Conclusion: Important Research Areas  Processor simplicity  Power efficiency  Security  Reliability  Reduced design times  Systems (on a chip) balancing threads and on-chip RAM  Many very simple processors on a chip Look at architecture of Denelcor HEP…

43 June 2003copyright J. E. Smith, Attack of Killer Game Chips  OR: The most important thing I learned at Cray Rsch.  OR: What happened to SSTs? It isn’t enough that we can build them It isn’t enough that there are interested customers Volume rules! Researchers have made a supercomputer - which is powerful enough to rival the top systems in the world - out of PlayStation 2 components A US research centre has clustered 70 Sony PlayStation 2 game consoles into a Linux supercomputer that ranks among the 500 most powerful in the world. According to the New York Times, the National Centre for Supercomputing Applications (NCSA) at the University of Illinois assembled the $50,000 (£30,000) machine out of components bought in retail shops. In all, 100 PlayStation 2 consoles were bought but only 70 have been used for this project.

44 June 2003copyright J. E. Smith, Acknowledgements  Performance Model Tejas Karkhanis  Funding NSF, SRC, IBM, Intel  Japanese Transistor Radio Radiophile.com


Download ppt "Is There Anything More to Learn about High Performance Processors? J. E. Smith."

Similar presentations


Ads by Google