CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.

Slides:

Advertisements

Similar presentations

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.

Advertisements

Computer Organization and Architecture

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.

CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.

SMT Parallel Applications –For one program, parallel executing threads Multiprogrammed Applications –For multiple programs, independent threads.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Allan Snavely, Wayne Pfeiffer et.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Characterizing a New Class of Threads in Science Applications for High End Supercomputing Arun Rodrigues Richard Murphy Peter Kogge Keith Underwood Presentation.

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

CS Lecture 20 The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

CS Lecture 25 Wire Delay is not a Problem for SMT Z. Chishti, T.N. Vijaykumar Proceedings of ISCA-31 June, 2004.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.

Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.

Ioana Burcea Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck and Dean M. Tullsen.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.

1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.

Computer Architecture: Multithreading (II) Prof. Onur Mutlu Carnegie Mellon University.

1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

COMP 740: Computer Architecture and Implementation

CS Lecture 20 The Case for a Single-Chip Multiprocessor

Electrical and Computer Engineering

Lecture 18: Core Design, Parallel Algos

ECE Dept., Univ. Maryland, College Park

Prof. Onur Mutlu Carnegie Mellon University

Simultaneous Multithreading

Computer Structure Multi-Threading

5.2 Eleven Advanced Optimizations of Cache Performance

/ Computer Architecture and Design

Hyperthreading Technology

Lecture: SMT, Cache Hierarchies

Levels of Parallelism within a Single Processor

Lecture: SMT, Cache Hierarchies

Simultaneous Multithreading in Superscalar Processors

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Lecture: SMT, Cache Hierarchies

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

/ Computer Architecture and Design

Lecture: SMT, Cache Hierarchies

Lecture 21: Synchronization & Consistency

Lecture 22: Multithreading

Presentation transcript:

CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September 2003

Pentium 4 Architecture Fetch/commit width = 3  ops, execution width = registers, 126 (48 lds, 24 strs) in-flight instrs Trace cache has 12K entries, each line has 6  ops Latencies: L1 – 2 cycles, L2 – 18 cycles, memory – 361 cycles

Hyper-Threading Two threads – the Linux operating system operates as if it is executing on a two-processor system When there is only one available thread, it behaves like a regular single-threaded superscalar processor Statically divided resources: ROB, LSQ, issueq -- a slow thread will not cripple thruput (might not scale) Dynamically shared: trace cache and decode (fine-grained multi-threaded, round-robin), FUs, data cache, bpred

Results Thruput has gone from 2.2 (single-thread) to 3.9 (base SMT) to 5.4 (ICOUNT.2.8)

Methodology Three workloads: single-threaded base, parallel workload (two parallel threads of the same SPLASH application), heterogeneous workload (single- threaded app running with each of the other apps) For heterogeneous workloads – execute two threads together and restart the program when it finishes, do this 12 times, discard the last execution and compute average IPC for each thread If thread-A executes at 85% efficiency and thread-B at 75%, speedup equals 1.6

Static Partitioning A single thread is statically assigned half the queues – this impacts IPC A dummy thread ensures that there is no contention for dynamically assigned resources (caches, bpred) – helps isolate the effect of static partitioning SPEC-int achieves 83% efficiency and SPEC-fp achieves 85%, range: 71-98%

Multi-Programmed Speedup

sixtrack and eon do not degrade their partners (small working sets?) swim and art degrade their partners (cache contention?) Best combination: swim & sixtrack worst combination: swim & art Static partitioning ensures low interference – worst slowdown is 0.9

Static vs. Dynamic Statically partitioned resources: queues, ROB: threads run at 83-85% efficiency Dynamically partitioned resources: fetch bandwidth, caches, bpred: threads run at ~60% efficiency Both contribute equally – however, without static partitioning, the effect of dynamic partitioning could go out of control

Parallel Thread Results Parallel threads have similar characteristics and put more pressure on shared resources

Communication Speed Locking and reading a value takes 68 cycles Locking and updating a value takes 171 cycles (lower than memory access time) To parallelize efficiently, there has to be X amount of parallel work in each loop to offset synch costs -- X is 20,000 computations for SMT; 200,000 for an SMP – the synch mechanism assumed in past research was more optimistic than the real design

Microbenchmark Parallel region Loop-carried dependence

Computation vs. Communication

Thread Co-Scheduling Diverse programs interfere less with each other Avg. speedup is 1.20, but while running two copies of the same thread, avg. speedup is only 1.11, int-int is 1.17, fp-fp is 1.20, and int-fp is 1.21 Symbiotic jobscheduling: each thread has two favorable partners – construct a schedule such that every thread is co-scheduled only with its partners – avg. speedup of 1.27 Linux can’t exploit -- has 2 independent schedulers

Compiler Optimizations Multithreading is tolerant of low-ILP codes Higher optimization levels improve overall performance, but reduce speedup from SMT

Unanswered Questions Area overhead of SMT? (multiple renames, RAS, PC regs) Register utilization Effect of fetch policies – is it a bottleneck? Influence on power, energy, temperature

Conclusions The real design matches simulation-based expectations Static partitioning is important to minimize conflicts and control thruput losses Dynamic partitioning might be required for 8 threads Order of magnitude faster synch than an SMP, but more room for improvement

Title Bullet