Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

Similar presentations


Presentation on theme: "Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact."— Presentation transcript:

1 Hardware Multithreading

2 Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact – data cache Maximising Inst issue rate – branch prediction Maximising Inst issue rate – superscalar Maximising pipeline utilisation – avoid instruction dependencies – out of order execution

3 Increasing Parallelism Amount of parallelism that we can exploit is limited by the programs – Some areas exhibit great parallelism – Some others are essentially sequential In the later case, where can we find additional independent instructions? – In a different program!

4 Hardware Multithreading Allow multiple threads to share a single processor Requires replicating the independent state of each thread Virtual memory can be used to share memory among threads

5 CPU Support for Multithreading Data Cache Fetch Logic Decode LogicFetch LogicExec LogicFetch LogicMem LogicWrite Logic Inst Cache PC A PC B VA Mapping A VA Mapping B Address Translation Reg A Reg B

6 Hardware Multithreading Different ways to exploit this new source of parallelism – Coarse-grain parallelism – Fine-grain parallelism – Simultaneous Multithreading

7 Coarse-Grain Multithreading

8 Issue instructions from a single thread Operate like a simple pipeline Switch Thread on “expensive” operation: – E.g. I-cache miss – E.g. D-cache miss

9 Switch Threads on Icache miss 1234567 Inst aIFIDEXMEMWB Inst bIFIDEXMEMWB Inst cIF MISSIDEXMEMWB Inst dIFIDEXMEM Inst eIFIDEX Inst fIFID Inst X Inst Y Inst Z ---- Remove Inst c and switch to other thread The next thread will continue its execution until there is another I-cache or D-cache miss

10 Switch Threads on Dcache miss 1234567 Inst aIFIDEXM-MissWB Inst bIFIDEXMEMWB Inst cIFIDEXMEMWB Inst dIFIDEXMEM Inst eIFIDEX Inst fIFID MISS --- --- --- Inst X Inst Y Abort these Remove Inst a and switch to other thread – Remove the rest of instructions from ‘blue’ thread – Roll back ‘blue’ PC to point to Inst a

11 Coarse Grain Multithreading Good to compensate for infrequent, but expensive pipeline disruption Minimal pipeline changes – Need to abort all the instructions in “shadow” of Dcache miss  overhead – Resume instruction stream to recover Short stalls (data/control hazards) are not solved

12 Fine-Grain Multithreading

13 Overlap in time the execution of several threads Usually using Round Robin among all the threads in a ‘ready’ state Requires instantaneous thread switching

14 Fine-Grain Multithreading Multithreading helps alleviate fine-grain dependencies (e.g. forwarding?) 1234567 Inst aIFIDEXMEMWB Inst MIFIDEXMEMWB Inst bIFIDEXMEMWB Inst NIFIDEXMEM Inst cIFIDEX Inst PIFID

15 I-cache misses in Fine Grain Multithreading An I-cache miss is overcome transparently 1234567 Inst aIFIDEXMEMWB Inst MIFIDEXMEMWB Inst bIF-MISS---- Inst NIFIDEXMEM Inst PIFIDEX Inst QIFID Inst b is removed and the thread is marked as not ‘ready’ ‘Blue’ thread is not ready so ‘orange’ is executed

16 D-cache misses in Fine Grain Multithreading Mark the thread as not ‘ready’ and issue only from the other thread 1234567 Inst aIFIDEXM-MISSMiss WB Inst MIFIDEXMEMWB Inst bIFID--- Inst NIFIDEXMEM Inst PIFIDEX Inst QIFID Thread marked as not ‘ready’. Remove Inst b. Update PC. ‘Blue’ thread is not ready so ‘orange’ is executed

17 1234567 Inst aIFROEXMEMWB Inst MIFROEXMEMWB Inst bIFIDEXMEMWB Inst NIFIDEXMEM Inst cIFIDEX Inst PIFID D-cache misses in Fine Grain Multithreading In an out of order processor we may continue issuing instructions from both threads 4567 M MISS EXMEMWB ID IFIDEXMEM IFID 4567 M MISSMiss WB EXMEMWB RO(RO) EX IFROEXMEM IF(RO) IFRO

18 Fine Grain Multithreading Improves the utilisation of pipeline resources Impact of short stalls is alleviated by executing instructions from other threads Single thread execution is slowed Requires an instantaneous thread switching mechanism

19 Simultaneous Multi-Threading

20 The main idea is to exploit instructions level parallelism and thread level parallelism at the same time In a superscalar processor issue instructions from different threads Instructions from different threads can be using the same stage of the pipeline

21 Simultaneous MultiThreading Let’s look simply at instruction issue: 12345678910 Inst aIFIDEXMEMWB Inst bIFIDEXMEMWB Inst MIFIDEXMEMWB Inst NIFIDEXMEMWB Inst cIFIDEXMEMWB Inst PIFIDEXMEMWB Inst QIFIDEXMEMWB Inst dIFIDEXMEMWB Inst eIFIDEXMEMWB Inst RIFIDEXMEMWB

22 SMT issues Asymmetric pipeline stall – One part of pipeline stalls – we want other pipeline to continue Overtaking – want unstalled thread to make progress Existing implementations on O-o-O, register renamed architectures (similar to tomasulo)

23 SMT: Glimpse Into The Future? Scout threads? – A thread to prefetch memory – reduce cache miss overhead Speculative threads? – Allow a thread to execute speculatively way past branch/jump/call/miss/etc – Needs revised O-o-O logic – Needs and extra memory support – See Transactional Memory

24 Simultaneous Multi Threading Extracts the most parallelism from instructions and threads Implemented only in out-of-order processors because they are the only able to exploit that much parallelism Has a significant hardware overhead

25 Hardware Multithreading

26 Benefits of Hardware Multithreading All multithreading techniques improve the utilisation of processor resources and, hence, the performance If the different threads are accessing the same input data they may be using the same regions of memory – Cache efficiency improves in these cases

27 Disadvantages of Hardware Multithreading The perceived performance may be degraded when comparing with a single-thread CPU – Multiple threads interfering with each other The cache has to be shared among several threads so effectively they would use a smaller cache Thread scheduling at hardware level adds high complexity to processor design – Thread state, managing priorities, OS-level information, …

28 Comparison of Multithreading Techniques

29 Multithreading Summary A cost-effective way of finding additional parallelism for the CPU pipeline Available in x86, Itanium, Power and SPARC (Most architectures) Present additional CPU thread as additional CPU to Operating System Operating Systems Beware!!! (why?)


Download ppt "Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact."

Similar presentations


Ads by Google