EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”

EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor” by Tullsen, Eggers, Emer, Levy, Lo, Stamm) The University of British ColumbiaEECE 476© 2005 Guy Lemieux

2 The Speed Limit? Weiss and Smith [1984]1.58  1.6 instr/cycle  Sohi and Vajapeyam [1987]1.81 Tjaden and Flynn [1970]1.86 Tjaden and Flynn [1973]1.96 Uht [1986]2.00 Smith et al. [1989]2.00 Jouppi and Wall [1988]2.40 Johnson [1991]2.50 Acosta et al. [1986]2.79 Wedig [1982]3.00 Butler et al. [1991]5.8 Melvin and Patt [1991]6 Wall [1991]7 Kuck et al. [1972]8 Riseman and Foster [1972]51 Nicolau and Fisher [1984] 90  90 instrs in parallel!!!

3 Barrier to Performance Wide Superscalar –Many Functional Units (ALUs, Ld/St Units, etc) –Lots of “potential” performance if all are busy –Fact: often idle!!! –Idle FUs  actual performance << potential performance Cause of idle FUs? –Waiting for memory results –Out-of-order: fetch more instructions while waiting –Limits on ILP  only 2 or 3 instructions available –Reach more load/store instructions  miss-under-miss  wait longer  more idle FUs How to extract parallelism? –Can try to explicitly write a parallel program –Most languages are inherently sequential –Humans break down complex tasks sequentially –Difficult to write a “parallel program” to make parallelism explicit

4 More Parallelism? Multithreading Key observation –Hard to get parallelism out of 1 program –Latency: execution time of 1 program –Difficult to improve latency! Doomed! Give up! Concorde vs Boeing 747? –Concorde: 2170 km/h  NYC to London in 2.6 hrs –747: 980 km/h  NYC to London in 5.7 hrs –Concorde is faster, has lower latency BUT –747 carries 3.5 times more people –747 throughput is higher: 1.6 times more people*km / hr –Airlines prefer 747 to the Concorde Multithreading: carry more programs, improve throughput! –Compute centres prefer CPUs with higher throughput

5 Which has Greater Performance? AirplanePassenger Capacity (ppl) Range (km)Speed (km/h) Throughput (ppl*km/h) Boeing 7773757 450980367 500 Boeing 747 470 6 680980 460 600 BAC/Sud Concorde 1326 440 2 170 286 440 Douglas DC- 8-50 146 14 030 875127 750

6 Multithreading: Basic Idea Execute program 1 –FUs busy –Cache miss Switch to program 2 –FUs busy –Cache miss Switch to program 3 –Etc … Switch to program 1 –Cache should now have data from memory

7 Multithreading: Messy Details Multithreading for max of 4 programs –4 programs  4 Program Counters and 4 Register Files –Share Data cache? Programs are competing with each other Program 1 may evict data necessary for others –Multiple Data caches? Each one smaller If only running 1 program, can’t use whole cache If 3 programs don’t need much cache, Program 1 can’t try to use “unused” part –Instruction cache: shared or multiple? –Share TLB? Bigger TLB? Must prevent program 1 from accessing data in programs 2, 3, 4!!! –How to switch between 4 programs fairly and quickly Ensure no single program hogs or starves CPU

8 Multithreading Limits Executing 1 program at a time –Maybe switch programs after 20 to 100 instructions Still extracting parallelism from 1 program at a time –Many FUs still idle Need more parallelism!

9 Simultaneous MultiThreading (SMT) Switch between programs more frequently –Switch after 1 to 20 instructions! Wide superscalar has many FUs (eg, 10) –Dispatch 10 instructions every clock cycle Simultaneous Multithreading –Allow issue of instructions from multiple programs every clock cycle This is the key difference from regular multithreading! –Find small amounts of parallelism from multiple programs –Combine their parallelism: keeps FUs very busy !!!! –Still many messy details….

10 “Exploiting Choice…” Paper Earlier version published ISCA 1995 –Mostly “hypothetical” (idealistic), many assumptions –ISCA = International Symposium on Computer Architecture This is The Conference on Computer Architecture Best of the best! This version published ISCA, 1996 –Many improvements over 1995 paper: adds realism! –Key idea: is it practical to build real CPU with SMT? –Key contribution: yes, how to and why you want to!

11 SMT Study: Baseline SuperScalar CPU Lots of caches –32KB Instr cache, 32KB Data cache –256KB L2 combined I+D cache –2MB L3 cache –Lockup-free Fetch (up to) 8 instructions per cycle BranchPrediction + BranchTargetBuffer + SubroutineReturnStacks –256 entries in BTB, 2K x 2bits for BranchPrediction Buffer Register Renaming –32 regs + 100 more for renaming Two InstructionQueues: 1 Floating-Point, 1 Integer –32 entries in each queue –3 Floating-Point Units –6 Integer Units: 4 can also do Loads/Stores

12 Baseline CPU + Multithreading PCs

13 a) Baseline CPU Pipeline, b) Multithreaded Pipeline

14 CPU Performance on 1 Program Use Multiflow compiler (good choice!) Baseline pipeline (does not support multithreading) –Aka “unmodified superscalar” –1/CPI = 2.16 instructions per cycle Longer pipeline (required to support multithreading) –1/CPI = 2.11 instructions per cycle Not much harm due to pipeline changes –Longer pipeline has larger misprediction penalty –Misprediction is rare, so not much increase in CPI

15 CPU Performance on >1 Program Modify: extra PCs, extra registers, and longer pipeline for SMT

16 Improved Parallelism! Throughput dramatically improves Can execute extra programs “almost for free” –Some slowdown to first program, but better throughput! Example –One program: CPI = 1/2.11 = 0.473 –Total execution time of 6 programs run sequentially Proportional to 6 / 2.11 = 2.84 –Six programs SMT: CPI = ¼ = 0.25 –Total execution time of 6 programs run simultaneously: Proportional to 6/4 = 1.5 –Speedup with SMT = 2.84 / 1.5 = 1.9 –Nearly twice the throughput!!! Consider SMT of 6 nearly-equal programs –All programs start & finish around same time –Latency of each program = total runtime of all 6 programs –Without SMT, latency of each program = 1/6 total runtime of all 6 programs –If you need “partial results early”, lower latency is better (don’t use SMT)

17 ISCA’96 Paper Continues… Best way to fill InstrQueues Best way to keep FUs busy Best way to fetch from Instrs from cache Unmodified SMT speedup = 1.9 Combining “best” techniques speedup = 2.5 SMT is great for throughput!

18 Fetching Instructions from Cache RR.1.8 is the “baseline” RR.2.8 is “best” RR.1.8 fetches up to 8 instrs. from 1 program every cycle RR is “round robin”: next cycle always fetches instructoins from the next program (program i+1)

19 Choosing which Thread to Fetch (versus Round-Robin) RR1.8 is the “baseline”, ICOUNT.2.8 is “best” ICOUNT: fetch from programs with fewest # of instructions in InstrQueue (IQ)

20 Improving Fetch Efficiency (Longer IQ, Avoid Icache Misses) BIGQ: Bigger IQ doesn’t help much. ITAG: Check Instr cache 1 cycle ahead, if cache miss don’t try to fetch instructions for that program. Helps sometimes.

EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”

Similar presentations

Presentation on theme: "EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”

Similar presentations

Presentation on theme: "EECE476 Lecture 28: Simultaneous Multithreading (aka HyperThreading) (ISCA’96 research paper “Exploiting Choice…Simultaneous Multithreading Processor”"— Presentation transcript:

Similar presentations

About project

Feedback