Presentation is loading. Please wait.

Presentation is loading. Please wait.

Caltech CS184 Spring2003 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading.

Similar presentations


Presentation on theme: "Caltech CS184 Spring2003 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading."— Presentation transcript:

1 Caltech CS184 Spring2003 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading

2 Caltech CS184 Spring2003 -- DeHon 2 Today Multitasking/Multithreading model Fine-Grained Multithreading SMT (Symmetric Multi-Threading)

3 Caltech CS184 Spring2003 -- DeHon 3 Problem Long latency of operations –IO or page-fault –Non-local memory fetch Main memory, L3, remote node in distributed memory –Long latency operations (mpy, fp) Wastes processor cycles while stalled If processor stalls on return –Latency problem turns into a throughput (utilization) problem –CPU sits idle

4 Caltech CS184 Spring2003 -- DeHon 4 Idea Run something else useful while stalled In particular, another process/thread –Another PC Use parallelism to “tolerate” latency

5 Caltech CS184 Spring2003 -- DeHon 5 Old Idea Share expensive machine among multiple users (jobs) When one user task must wait on IO –Run another one Time multiplex machine among users

6 Caltech CS184 Spring2003 -- DeHon 6 Mandatory Concurrency Some tasks must be run “concurrently” (interleaved) with user tasks –DRAM Refresh –IO Keyboard, network, … –Window system (xclock…) –Autosave –Clippy 

7 Caltech CS184 Spring2003 -- DeHon 7 Other Useful Concurrency Print spooler Web browser –Download images in parallel Instant Messanger/Zephyr (Gale) biff/xbiff/xfaces…

8 Caltech CS184 Spring2003 -- DeHon 8 Multitasking Single machine run multiple tasks Machine provides same ISA/sequential semantics to each task –Task believes it own machines –Same as if other tasks running on different machines Tasks isolated from one another –Cannot affect each other functionally –(may impact each other’s performance)

9 Caltech CS184 Spring2003 -- DeHon 9 Each task/process Process – virtualization of the CPU –Has own unique set of state: PC Registers VM Page Table (hence memory image)

10 Caltech CS184 Spring2003 -- DeHon 10 Sharing the CPU Save/Restore –PC/Registers/Page Table Virtual Memory Isolation Privileged system software –User/System mode execution Functionally, task not notice that it gave up the CPU for period of time

11 Caltech CS184 Spring2003 -- DeHon 11 Threads Threads – separate PC, but shares and address space Has own processor state: –PC –Registers Shares –Memory –VM Page Table Process may have multiple threads

12 Caltech CS184 Spring2003 -- DeHon 12 Multitasking/Multithreading Gives us an initial model for parallelism So far, parallelism of unrelated tasks Eventually, cooperating –Have to address concurrent memory model –(next time)

13 Caltech CS184 Spring2003 -- DeHon 13 Fine Grained

14 Caltech CS184 Spring2003 -- DeHon 14 HEP/  Unity/Tera Provide a number of contexts –Separate PCs, register files, … Number of contexts  operation latency –Pipeline depth –Roundtrip time to main memory Run each context in round-robin fashion

15 Caltech CS184 Spring2003 -- DeHon 15 HEP Pipeline [figure: Arvind+Innucci, DFVLR’87]

16 Caltech CS184 Spring2003 -- DeHon 16 Strict Interleaved Threading Uses parallelism to get throughput –Avoid interlocking, bypass… –Cover memory latency –Essentially C-slow transformation of processor Potentially poor single-threaded performance –Increases end-to-end latency of thread

17 Caltech CS184 Spring2003 -- DeHon 17 SMT

18 Caltech CS184 Spring2003 -- DeHon 18 Superscalar and Multithreading? Do both? Issue from multiple threads into pipeline No worse than (super)scalar on single thread More throughput with multiple threads –Fill in what would have been empty issue slots with instructions from different threads

19 Caltech CS184 Spring2003 -- DeHon 19 SuperScalar Inefficiency Unused Slot Recall: limited Scalar IPC

20 Caltech CS184 Spring2003 -- DeHon 20 SMT Promise Fill in empty slots with other threads

21 Caltech CS184 Spring2003 -- DeHon 21 SMT Estimates (ideal) [Tullsen et. al. ISCA ’95]

22 Caltech CS184 Spring2003 -- DeHon 22 SMT Estimates (ideal) [Tullsen et. al. ISCA ’95]

23 Caltech CS184 Spring2003 -- DeHon 23 SMT uArch Observation: exploit register renaming –Get small modifications to existing superscalar architecture

24 Caltech CS184 Spring2003 -- DeHon 24 SMT uArch N.B. remarkable thing is how similar superscalar core is [Tullsen et. al. ISCA ’96]

25 Caltech CS184 Spring2003 -- DeHon 25 Alpha: Basic Out-of-order Pipeline Fetch Decod e/Map Queue Reg Read Execute Dcach e/Stor e Buffer Reg Write Retire PC Icache Register Map Dcache Regs Thread- blind [Src: Tryggve Fossum, Compaq 2000]

26 Caltech CS184 Spring2003 -- DeHon 26 Alpha SMT Pipeline Fetch Decod e/Map Queue Reg Read Execute Dcach e/Stor e Buffer Reg Write Retire Icache Dcache PC Register Map Regs [Src: Tryggve Fossum, Compaq]

27 Caltech CS184 Spring2003 -- DeHon 27 SMT uArch Changes: –Multiple PCs –Control to decide how to fetch from –Separate return stacks per thread –Per-thread reorder/commit/flush/trap –Thread id w/ BTB –Larger register file More things outstanding

28 Caltech CS184 Spring2003 -- DeHon 28 Performance [Tullsen et. al. ISCA ’96]

29 Caltech CS184 Spring2003 -- DeHon 29 Relative Performance (Alpha) Relative Multithreaded Performance 0.00 0.50 1.00 1.50 2.00 2.50 Int95Fp95Int95/FP95SQL Programs Relative Performance 1-T 2-T 3-T 4-T [Src: Tryggve Fossum, Compaq]

30 Caltech CS184 Spring2003 -- DeHon 30 Alpha SMT Cost-effective Multiprocessing--increased throughput 4 X architectural registers 2 X performance gain with little additional cost and complexity

31 Caltech CS184 Spring2003 -- DeHon 31 Optimizing: fetch freedom RR=Round Robin RR.X.Y –X – threads do fetch in cycle –Y – instructions fetched/thread [Tullsen et. al. ISCA ’96]

32 Caltech CS184 Spring2003 -- DeHon 32 Optimizing: Fetch Alg. ICOUNT – priority to thread w/ fewest pending instrs BRCOUNT MISSCOUNT IQPOSN – penalize threads w/ old instrs (at front of queues) [Tullsen et. al. ISCA ’96]

33 Caltech CS184 Spring2003 -- DeHon 33 Throughput Improvement 8-issue superscalar –Achieves little over 2 instructions per cycle Optimized SMT –Achieves 5.4 instructions per cycle on 8 threads 2.5x throughput increase

34 Caltech CS184 Spring2003 -- DeHon 34 Costs [Burns+Gaudiot HPCA’2000]

35 Caltech CS184 Spring2003 -- DeHon 35 Costs [Burns+Gaudiot HPCA’2000] Single and double cache line fetches

36 Caltech CS184 Spring2003 -- DeHon 36 Check on B5

37 Caltech CS184 Spring2003 -- DeHon 37 Machines Pull machines off grid? –Matrix1? –Matrix37—39 Off for someone else, leave off and use if they finish in timely fashion? –Not necessary for single processor machine Department grid cluster

38 Caltech CS184 Spring2003 -- DeHon 38 Big Ideas 0, 1, Infinity  virtualize resources –Processes virtualize CPU Latency Hiding –Processes, Threads –Find something else useful to do while wait…


Download ppt "Caltech CS184 Spring2003 -- DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 11: April 30, 2003 Multithreading."

Similar presentations


Ads by Google