Presentation is loading. Please wait.

Presentation is loading. Please wait.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Similar presentations


Presentation on theme: "SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong."— Presentation transcript:

1 SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong

2 Contemporary forms of parallelism Instruction-level parallelism(ILP)  Wide-issue Superscalar processors (SS)  4 or more instruction per cycle  Executing a single program or thread  Attempts to find multiple instructions to issue each cycle. Thread-level parallelism(TLP)  Fine-grained multithreaded superscalars(FGMS)  Contain hardware state for several threads  Executing multiple threads  On any given cycle a processor executes instructions from one of the threads  Multiprocessor(MP)  Performance improved by adding more CPUs

3 Simultaneous Multithreading Key idea Issue multiple instructions from multiple threads each cycle Features  Fully exploit thread-level parallelism and instruction- level parallelism.  Better Performance  Mix of independent programs  Programs that are parallelizable  Single threaded program

4 Superscalar(SS) Multithreading(FGMT) SMT Issue slots

5 Multiprocessor vs. SMT Multiprocessor(MP2) SMT

6 SMT Architecture(1) Base Processor: like out-of-order superscalar processor.[MIPS R10000] Changes: With N simultaneous running threads, need N PC and N subroutine return stacks and more than N*32 physical registers for register renaming in total.

7 SMT Architecture(2) Need large register files, longer register access time, pipeline stages are added.[Register reads and writes each take 2 stages.] Share the cache hierarchy and branch prediction hardware. Each cycle: select up to 2 threads and each fetch up to 4 instructions.(2.4 scheme) FetchDecodeRenamingQueueReg Read ExecReg WriteCommit

8 Effectively Using Parallelism on a SMT Processor Parallel workload threadsSSMP2MP4FGMTSMT 13.32.41.53.3 2--4.32.64.14.7 4-- 4.2 5.6 8-- 3.56.1 Instruction Throughput executing a parallel workload

9 Effects of Thread Interference In Shared Structures Interthread Cache Interference Increased Memory Requirements Interference in Branch Prediction Hardware

10 Interthread Cache Interference Because the share the cache, so more threads, lower hit-rate. Two reasons why this is not a significant problem: 1. The L1 Cache miss can almost be entirely covered by the 4-way set associative L2 cache. 2. Out-of-order execution, write buffering and the use of multiple threads allow SMT to hide the small increases of additional memory latency. 0.1% speed up without interthread cache miss.

11 Increased Memory Requirements More threads are used, more memory references per cycle. Bank conflicts in L1 cache account for the most part of the memory accesses. It is ignorable: 1. For longer cache line: gains due to better spatial locality outweighted the costs of L1 bank contention 2. 3.4% speedup if no interthread contentions.

12 Interference in Branch Prediction Hardware Since all threads share the prediction hardware, it will experience interthread interference. This effect is ignorable since: 1. the speedup outweighted the additional latencies 2. From 1 to 8 threads, branch and jump misprediction rates range from 2.0%-2.8% (branch) 0.0%-0.1% (jump)

13 Discussion ???


Download ppt "SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong."

Similar presentations


Ads by Google