CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September 2003

Pentium 4 Architecture Fetch/commit width = 3  ops, execution width = 6 128 registers, 126 (48 lds, 24 strs) in-flight instrs Trace cache has 12K entries, each line has 6  ops Latencies: L1 – 2 cycles, L2 – 18 cycles, memory – 361 cycles

Hyper-Threading Two threads – the Linux operating system operates as if it is executing on a two-processor system When there is only one available thread, it behaves like a regular single-threaded superscalar processor Statically divided resources: ROB, LSQ, issueq -- a slow thread will not cripple thruput (might not scale) Dynamically shared: trace cache and decode (fine-grained multi-threaded, round-robin), FUs, data cache, bpred

Results Thruput has gone from 2.2 (single-thread) to 3.9 (base SMT) to 5.4 (ICOUNT.2.8)

Methodology Three workloads: single-threaded base, parallel workload (two parallel threads of the same SPLASH application), heterogeneous workload (single- threaded app running with each of the other apps) For heterogeneous workloads – execute two threads together and restart the program when it finishes, do this 12 times, discard the last execution and compute average IPC for each thread If thread-A executes at 85% efficiency and thread-B at 75%, speedup equals 1.6

Static Partitioning A single thread is statically assigned half the queues – this impacts IPC A dummy thread ensures that there is no contention for dynamically assigned resources (caches, bpred) – helps isolate the effect of static partitioning SPEC-int achieves 83% efficiency and SPEC-fp achieves 85%, range: 71-98%

Multi-Programmed Speedup

sixtrack and eon do not degrade their partners (small working sets?) swim and art degrade their partners (cache contention?) Best combination: swim & sixtrack worst combination: swim & art Static partitioning ensures low interference – worst slowdown is 0.9

Static vs. Dynamic Statically partitioned resources: queues, ROB: threads run at 83-85% efficiency Dynamically partitioned resources: fetch bandwidth, caches, bpred: threads run at ~60% efficiency Both contribute equally – however, without static partitioning, the effect of dynamic partitioning could go out of control

Parallel Thread Results Parallel threads have similar characteristics and put more pressure on shared resources

Communication Speed Locking and reading a value takes 68 cycles Locking and updating a value takes 171 cycles (lower than memory access time) To parallelize efficiently, there has to be X amount of parallel work in each loop to offset synch costs -- X is 20,000 computations for SMT; 200,000 for an SMP – the synch mechanism assumed in past research was more optimistic than the real design

Microbenchmark Parallel region Loop-carried dependence

Computation vs. Communication

Thread Co-Scheduling Diverse programs interfere less with each other Avg. speedup is 1.20, but while running two copies of the same thread, avg. speedup is only 1.11, int-int is 1.17, fp-fp is 1.20, and int-fp is 1.21 Symbiotic jobscheduling: each thread has two favorable partners – construct a schedule such that every thread is co-scheduled only with its partners – avg. speedup of 1.27 Linux can’t exploit -- has 2 independent schedulers

Compiler Optimizations Multithreading is tolerant of low-ILP codes Higher optimization levels improve overall performance, but reduce speedup from SMT

Unanswered Questions Area overhead of SMT? (multiple renames, RAS, PC regs) Register utilization Effect of fetch policies – is it a bottleneck? Influence on power, energy, temperature

Conclusions The real design matches simulation-based expectations Static partitioning is important to minimize conflicts and control thruput losses Dynamic partitioning might be required for 8 threads Order of magnitude faster synch than an SMP, but more room for improvement

Title Bullet

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

Similar presentations

Presentation on theme: "CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

Similar presentations

Presentation on theme: "CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September."— Presentation transcript:

Similar presentations

About project

Feedback