ECE Dept., Univ. Maryland, College Park

Slides:



Advertisements
Similar presentations
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Advertisements

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
CS 7810 Lecture 16 Simultaneous Multithreading: Maximizing On-Chip Parallelism D.M. Tullsen, S.J. Eggers, H.M. Levy Proceedings of ISCA-22 June 1995.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
CS 7810 Lecture 20 Initial Observations of the Simultaneous Multithreading Pentium 4 Processor N. Tuck and D.M. Tullsen Proceedings of PACT-12 September.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 7: April 24, 2001 Threaded Abstract Machine (TAM) Simultaneous.
Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University.
Computer Architecture: Multithreading (II) Prof. Onur Mutlu Carnegie Mellon University.
Fetch Directed Prefetching - a Study
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
1 Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Computer Architecture: Multithreading (II)
Electrical and Computer Engineering
Adaptive Cache Partitioning on a Composite Core
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Multiscalar Processors
Prof. Onur Mutlu Carnegie Mellon University
Simultaneous Multithreading
Simultaneous Multithreading
Computer Structure Multi-Threading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Hyperthreading Technology
Lecture: SMT, Cache Hierarchies
Computer Architecture: Multithreading (I)
Milad Hashemi, Onur Mutlu, Yale N. Patt
Levels of Parallelism within a Single Processor
Lecture: SMT, Cache Hierarchies
Simultaneous Multithreading in Superscalar Processors
Yingmin Li Ting Yan Qi Zhao
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture: SMT, Cache Hierarchies
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Lecture: SMT, Cache Hierarchies
Levels of Parallelism within a Single Processor
Hardware Multithreading
Lecture 22: Multithreading
Prof. Onur Mutlu Carnegie Mellon University 9/28/2012
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

ECE Dept., Univ. Maryland, College Park TRANSPARENT THREADS Gautham K.Dorai and Dr.Donald Yeung ECE Dept., Univ. Maryland, College Park

SMT Processors ICOUNT, IQPOSN, MISSCOUNT, BRCOUNT [Tullsen ISCA’96] Priority Mechanisms Pipeline Multiple Threads ICOUNT, IQPOSN, MISSCOUNT, BRCOUNT [Tullsen ISCA’96]

Individual Threads run Slower! Individual Thread Performance 31%

Single-Thread Performance Multiprogramming (Process Scheduling) Subordinate Threading (Prefetching/Pre-Execution,Cache management, Branch Prediction etc.) Performance Monitoring (Dynamic Profiling)

Transparent Threads 0% SLOWDOWN Foreground Thread Background Thread (Transparent) 0% SLOWDOWN

Single-Thread Performance Multiprogramming Latency of critical high-priority process Subordinate Threading Performance Monitoring Benefit vs Cost (Overhead) Tradeoff

Road Map Motivation Transparent Threads Experimental Evaluation Transparent Software Prefetching Conclusion

Transparency – No stealing of shared resources Shared vs. Private Transparency – No stealing of shared resources PC ROB Predictor Functional Units Issue Queues Register File I-Cache Register Map D-Cache Fetch Queue

Slots, Buffers and Memories SLOTS – Allocation based on current cycle only BUFFERS – Allocation based on future cycles PC PC MEMORIES – Allocation based on future cycles ROB Predictor Issue Units Issue Queues Register File I-Cache I-Cache I-Cache Register Map D-Cache Fetch Queue

ICOUNT (background) = ICOUNT(background) + Inst-Window Size Slot Prioritization ICOUNT (background) = ICOUNT(background) + Inst-Window Size Foreground Background I-CACHE BLOCK ICOUNT 2.N Fetch Slots

ICOUNT (background) = ICOUNT(background) + Inst-Window Size Slot Prioritization ICOUNT (background) = ICOUNT(background) + Inst-Window Size Foreground Background I-CACHE BLOCK ICOUNT 2.N Fetch Slots

ICOUNT (background) = ICOUNT(background) + Inst-Window Size Slot Prioritization ICOUNT (background) = ICOUNT(background) + Inst-Window Size Foreground Background I-CACHE BLOCK ICOUNT 2.N Fetch Slots

Buffer Transparency Foreground Background Fetch Hardware Fetch Queue ROB Fetch Hardware PC1 PC2 Fetch Queue Foreground Background Issue Queue

Background Thread Window Partitioning Limit on Background Thread Instructions ROB Stops fetch when ICOUNT reaches limit Fetch Hardware PC1 PC2 Partition Fetch Queue Foreground Background Issue Queue

Background Thread Window Partitioning ROB Foreground Thread can occupy all available entries Fetch Hardware PC1 PC2 Partition Fetch Queue Foreground Background Issue Queue

Background Thread Flushing ROB No limit on Background Thread Head Tail Fetch Hardware PC1 PC2 Tail Fetch Queue Foreground Background Issue Queue Head

Background Thread Flushing ROB No limit on Background Thread Head Flush Triggered on Conflict Tail Fetch Hardware PC1 PC2 Fetch Queue Foreground Background Tail Issue Queue Head

Background Thread Flushing ROB No limit on Background Thread Head Flush Triggered on Conflict Fetch Hardware Tail PC1 PC2 Fetch Queue Foreground Background Tail Issue Queue Head

Foreground Thread Flushing ROB Instructions remain stagnant in ROB Load Miss Head Flush Triggered on load miss at head Flush Stagnated Entries Fetch Hardware PC1 PC2 Fetch Queue Foreground Background Issue Queue Tail

Foreground Thread Flushing ROB Load Miss Head Flush F Entries from the tail Block the fetch for T Cycles Fetch Hardware PC1 PC2 Tail Fetch Queue Head Foreground Background Issue Queue

Foreground Thread Flushing ROB Load Miss Head After T Cycles allow to fetch again F & T depend on R (Residual Cache Latency) Fetch Hardware PC1 PC2 Fetch Queue Foreground Background Tail Issue Queue

SimpleScalar-based SMT Number of Contexts 4 Issue Width 8 IFQ Size 32 Int/FP Issue Queues ROB Size 128 Branch Predictor G-Share + Bimodal L1 Cache Split - 32K (4-way) L2 Unified Cache 512K (4-way) L1/L2/Memory Latency 1/10/122 cycles

Benchmark Suites Evaluate Transparency Mechanisms Name Type VPR SPECInt 2000 BZIP GZIP EQUAKE ART SPECfp 2000 GAP AMMP IRREG PDE Solver Evaluate Transparency Mechanisms Transparent Software Prefetching

Transparency Mechanisms Background Thread Window Partitioning (32 Entries) Slot Prioritization Background Thread Flushing Private Caches Equal Priority Private Predictor EP SP BP BF PC PP

Transparency Mechanisms Equal Priority – 30% Slowdown Slot Prioritization – 16% Slowdown Background Window Partitioning – 9% Slowdown Background Thread Flushing – 3% Slowdown EP SP BP BF PC PP

Performance Mechanisms ICOUNT 2.8 with Flushing Foreground Thread Window Partitioning (112F + 32B) Equal Priority ICOUNT 2.8 2B 2F 2P EP

Performance Mechanisms Equal Priority – 31% degradation ICOUNT 2.8 - 41% slower than EP ICOUNT 2.8 + Foreground Thread Flushing – 23% slower than EP Foreground Thread Window Partitioning – 13% slower than EP 2B 2F 2P EP Normalized IPC

Transparent Software Prefetching Conventional Transparent Software Prefetching Computation Thread Transparent Prefetch Thread For (I=0; I < N-PD; I+=8) { prefetch(&b[I]); b[I] = z + b[I]; } For (I=0; I < N-PD; I+=8) { b[I] = z + b[I]; } For (I=0; I < N-PD; I+=8) { prefetch(&b[I]); } In-lined Prefetch Code Offload Prefetch Code to Transparent Threads Profitability of Prefetching Zero Overhead – No profiling Required Benefit vs Cost tradeoff (Profiling required)

Transparent Software Prefetching Naive Conventional Software Prefetching Profiled Conventional Software Prefetching No Prefetching Transparent Software Prefetching Normalized Execution Time NP PF PS TSP VPR

Transparent Software Prefetching Naïve Software Prefetching – 19.6% Overhead, 0.8% Performance Selective Software Prefetching – 14.13% Overhead, 2.47% Performance Transparent Software Prefetching – 1.38% Overhead, 9.52% Performance NP PF PS TSP NP PF PS TSP NP PF PS TSP NP PF PS TSP NP PF PS TSP NP PF PS TSP NP PF PS TSP VPR BZIP GAP EQUAKE ART AMMP IRREG

Conclusions Transparency Mechanisms Throughput Mechanisms 3% overhead on foreground thread Less than 1% without cache and predictor contention Throughput Mechanisms Within 23% of Equal Priority Transparent Software Prefetching 9.52% gain with 1.38% Overhead Eliminates the need for profiling Availability of spare bandwidth Can be used transparently for interesting applications

Related Work Tullsen’s work on Flushing mechanisms [Tullsen Micro-2001] Raasch’s work on prioritization [Raasch MTEAC Worshop 1999] Snavely’s work on Job Scheduling [Snavely ICMM-2001] Chappell’s work on Subordinate Multithreading and Dubois’s work on Assisted Execution [Chappell ISCA-1999][Dubois Tech-Report Oct’98]

Foreground Thread Window Partitioning Advantages Minimal guaranteed entries Disadvantages Transparency minimized Fetch Hardware PC1 PC2 Partition Fetch Queue Foreground Background Issue Queue

Benchmark Suites Name Type ROB Occ. VPR SPECInt 2000 Medium BZIP High GZIP Low EQUAKE ART SPECfp 2000 GAP AMMP IRREG PDE Solver Evaluate Transparency Mechanisms Transparent Software Prefetching

Transparency Mechanisms EP SP BF EP SP BF EP SP BF EP SP BF EP SP EP SP BF EP SP BF EP SP BF EP SP BF

Transparency Mechanisms EP SP BF EP SP BF EP SP BF EP SP BF SP BF EP SP BF EP SP BF EP SP BF EP SP BF

Transparency Mechanisms

Transparent Software Prefetching NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF