DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering.

Slides:

Advertisements

Similar presentations

Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.

Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Synonymous Address Compaction for Energy Reduction in Data TLB Chinnakrishnan Ballapuram Hsien-Hsin S. Lee Milos Prvulovic School of Electrical and Computer.

EECS 470 Virtual Memory Lecture 15. Why Use Virtual Memory? Decouples size of physical memory from programmer visible virtual memory Provides a convenient.

OS Fall’02 Virtual Memory Operating Systems Fall 2002.

Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.

Computer Organization CS224 Fall 2012 Lesson 44. Virtual Memory  Use main memory as a “cache” for secondary (disk) storage l Managed jointly by CPU hardware.

Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.

OS Spring ‘04 Paging and Virtual Memory Operating Systems Spring 2004.

Memory Management (II)

Computer ArchitectureFall 2008 © November 10, 2007 Nael Abu-Ghazaleh Lecture 23 Virtual.

Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.

Paging and Virtual Memory. Memory management: Review  Fixed partitioning, dynamic partitioning  Problems Internal/external fragmentation A process can.

Disco Running Commodity Operating Systems on Scalable Multiprocessors.

Translation Buffers (TLB’s)

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.

Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.

Lecture 19: Virtual Memory

Operating Systems ECE344 Ding Yuan Paging Lecture 8: Paging.

July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.

Virtual Memory Expanding Memory Multiple Concurrent Processes.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day14:

8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.

Virtual Memory Part 1 Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology May 2, 2012L22-1

Virtual Memory Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.

Energy Efficient D-TLB and Data Cache Using Semantic-Aware Multilateral Partitioning School of Electrical and Computer Engineering Georgia Institute of.

Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Virtual Memory Review Goal: give illusion of a large memory Allow many processes to share single memory Strategy Break physical memory up into blocks (pages)

Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

CS203 – Advanced Computer Architecture Virtual Memory.

Dynamic Associative Caches:

CS161 – Design and Architecture of Computer

Translation Lookaside Buffer

CMSC 611: Advanced Computer Architecture

COSC6385 Advanced Computer Architecture Lecture 7. Virtual Memory

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Prof. Hsien-Hsin Sean Lee

ECE232: Hardware Organization and Design

CS161 – Design and Architecture of Computer

CS352H: Computer Systems Architecture

Lecture 12 Virtual Memory.

Module: Virtual Memory

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Chapter 8: Main Memory Source & Copyright: Operating System Concepts, Silberschatz, Galvin and Gagne.

CS510 Operating System Foundations

Lecture: SMT, Cache Hierarchies

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

CMSC 611: Advanced Computer Architecture

Simultaneous Multithreading in Superscalar Processors

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

Translation Buffers (TLB’s)

TLB Performance Seung Ki Lee.

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Translation Buffers (TLB’s)

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Lecture 8: Efficient Address Translation

Virtual Memory Lecture notes from MKP and S. Yalamanchili.

Translation Buffers (TLBs)

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Review What are the advantages/disadvantages of pages versus segments?

Presentation transcript:

DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering Georgia Institute of Technology

DLL-conscious Instruction Fetch, Mohamood 2 Dynamically Linked Libraries An efficient way to develop software on a common platform Modules that provide a set of services to application software System DLLs help manage system functionality Application DLLs enable flexibility and modularity NameFunctionality KERNEL32.DLLMemory, IO and Interrupt functions NTDLL.DLLCore operating system functions USER32.DLL User Interface functionality like window handling, message passing GDI32.DLLFunctions for creating 2-D graphics MFC42.DLL Contains the Microsoft Foundation Classes used by many Windows applications

DLL-conscious Instruction Fetch, Mohamood 3 Shared Libraries DLLs house major system and application functionality Typical Microsoft Windows applications uses 30 DLLs on an average Average of 20 DLLs are shared among different applications Different applications share system DLLs on the same virtual page Application Code System DLL Application Code Process 0 Address Space Process 1 Address Space

DLL-conscious Instruction Fetch, Mohamood 4 Simultaneous Multithreading Boost instruction throughput with minimal hardware increase Bottleneck due to resource sharing I-Cache, branch predictor, LSQ, ROB etc shared Commercial processors: IBM Power5, Intel Pentium4, Alpha Presence of DLLs exacerbates I-Cache performance

DLL-conscious Instruction Fetch, Mohamood 5 DLL Thrashing and Duplication Virtual Memory is supported by common desktop platforms Virtually-Indexed instruction caches accelerate lookup Aliasing needs to be resolved in the I-Cache and the I-TLB How can homonym aliasing be prevented ? Non-SMT processors can flush the cache/TLB upon a context switch SMT processors require a Process or Address Space Identifier to prevent access violation PID or ASID induces false misses when a different process looks up an instruction that is part of a shared DLL

DLL-conscious Instruction Fetch, Mohamood 6 X 0 X X DLL Thrashing and Duplication DLL Thrashing: In a direct-mapped I-Cache, shared DLL instructions will result in an increased number of conflict misses DLL Duplication: In a set-associative I-Cache, shared DLL instructions will exist in multiple locations resulting in wasted space Process 0: 0x1000 0x3453 Process 1: 0x1000 0x3453 PIDValidTagData 0 1 0x100 0x3453 X 0 X X 1 1 0x100 0x3453  FALSE EVICTION Process 0: 0x1000 0x3453 Process 1: 0x1000 0x3453 PIDValidTagData X 0 X X PIDValidTagData 0 1 0x100 0x x100 0x3453 DUPLICATION

DLL-conscious Instruction Fetch, Mohamood 7 DLL-Conscious Instruction Fetch Program locality in presence of DLLs disturbed due to PID matching Alleviate the DLL thrashing and/or duplication effect We propose making the micro-architecture aware with capability to distinguish DLL and non-DLL instructions DLL-Conscious Instruction Fetch: DLL (or L bit) in the page table, I-TLB Modified OS page fault handler that will set the L bit for DLLs For VIVT caches, an L bit in each line of the I-Cache to facilitate faster translation

DLL-conscious Instruction Fetch, Mohamood 8 VIVT I-Cache Optimization I-TLB for Thread 2 VALIDSHAREDVPNPPN I-TLB for Thread 1 VLPIDPPN PID Instruction Cache PIDVLTAGDATA Virtual Page Number Page Offset = HIT ! = I-L1 Tag Compare L1 Cache IndexBlock Offset I-TLB Lookup necessary only upon I-Cache Miss

DLL-conscious Instruction Fetch, Mohamood 9 VIPT I-Cache Optimization I-TLB for Thread 2 VALIDSHAREDVPNPPN I-TLB for Thread 1 VLPIDPPN PID Instruction Cache VTAGDATA Virtual Address of Instruction Virtual Page Number Page Offset L1 Cache IndexBlock Offset I-L1 Tag Compare = HIT ! =

DLL-conscious Instruction Fetch, Mohamood 10 VIPT Illustration I-TLB for Thread 2 VALIDSHAREDVPNPPN I-TLB for Thread 1 VLPIDPPN Process Identifier Instruction Cache VTAGDATA Virtual Page Number Page Offset L1 Cache IndexBlock Offset I-L1 Tag Compare = HIT ! = Process 0: 0x1000 0x3453 Process 1: 0x1000 0x3453 0XXX XX0 1100x100 0x34531 MISS

DLL-conscious Instruction Fetch, Mohamood 11 x86 SMT Out-Of-Order Performance Simulator x86 Out-Of-Order Performance Simulator Simulation Methodology Studying DLLs required the modeling of an entire platform TAXI: Trace Analysis for x86 Interpretation (by Vlaovic et al.) Bochs System Emulator Modified SimpleScalar with x86 front end Kernel Debugger to capture DLL behavior Bochs System Emulator Instruction Traces Memory Traces Instruction Traces Memory Traces

DLL-conscious Instruction Fetch, Mohamood 12 Simulation Parameters ParametersValues Fetch/Decode width4 Issue/Commit width4 Branch Predictor2-Level GAg, 512 entries BTB4-Way, 128 sets L1 I-CacheDM, 2-Way and 4-Way 16KB and 8KB, 32B line L1 D-CacheDM, 16KB, 32B line L2 Cache4-Way, Unified, 64B line 256KB L1/L2 Latency1 cycle / 6 cycles Main Memory Latency120 cycles ROB Size48 entries

DLL-conscious Instruction Fetch, Mohamood 13 DLL Instruction Percentage ApplicationTotal Instructions (millions) System DLL Instructions Adobe Acrobat Reader % MS PowerPoint % MS Word % MS Internet Explorer % MS Visual C % Netscape Communicator %

DLL-conscious Instruction Fetch, Mohamood 14 DLL Usage Distribution

DLL-conscious Instruction Fetch, Mohamood 15 2-Way DLL I-Cache Misses Number of misses per thread decrease anywhere between 3.3 and 5.0 times for homogeneous threads Heterogeneous threads decrease the number of misses by up to 2.5 times Homogeneous ThreadsHeterogeneous Threads

DLL-conscious Instruction Fetch, Mohamood 16 2-Way I-Cache Hit Rate Overall I-Cache hit rate increased by 50% (from 30% to 47% for Netscape Communicator) Homogeneous threads show promise for more performance benefits Homogeneous Threads Heterogeneous Threads

DLL-conscious Instruction Fetch, Mohamood 17 4-Way I-Cache Misses and Hit Rate Misses per thread decrease by up to 5.5 times for homogeneous threads I-Cache hit rate improves by as much as 62% (from 28% to 47% for 4 instances of Acrobat Reader)

DLL-conscious Instruction Fetch, Mohamood 18 4-Way DLL IPC Improvement 4-Wide Machine: Up to 21% improvement 8-Wide Machine: Up to 24% improvement High Latency Machine: Up to 30% improvement

DLL-conscious Instruction Fetch, Mohamood 19 4-Way IPC Improvement 4-Wide Machine: Up to 10% improvement 8-Wide Machine: Up to 14% improvement High Latency Machine: Up to 15% improvement

DLL-conscious Instruction Fetch, Mohamood 20 Related Work Execution Trace Characteristics of Windows NT Applications (Lee et. al, ISCA 1998) DLL BTB proposed by Vlaovic et. al (MICRO 2000) OS techniques including Page Coloring and Bin Hopping (Lo et. al, ISCA 1998) Commercial implementation of Global bit for reducing burden of context switch: MIPS: (G)lobal bit in TLB ARM 1176: nG bit in the TLB for global data Intel P6: PGE bit in the CR4 register

DLL-conscious Instruction Fetch, Mohamood 21 Conclusions & Contributions Current and future generations of Operating Systems will be highly modular Analyzed and quantified the effect of DLL thrashing and duplication Devised a light-weight technique to reinstate DLL sharing in processor micro-architecture Evaluated the benefits using a complete system level simulation methodology 2-Way IPC improved up to 10% 4-Way IPC improved up to 15% Exploiting system features is yet another way to continue providing performance boosts in processors at the system level

DLL-conscious Instruction Fetch, Mohamood 22 That’s All Folks ! Questions & Answers