A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Slides:



Advertisements
Similar presentations
DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.
Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
A Case for Refresh Pausing in DRAM Memory Systems
Exploiting Locality in DRAM
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
Vinodh Cuppu and Bruce Jacob, University of Maryland Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM- System.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.
1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004.
© 2007 IBM Corporation HPCA – 2010 Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
1 Fair Queuing Memory Systems Kyle Nesbit, Nidhi Aggarwal, Jim Laudon *, and Jim Smith University of Wisconsin – Madison Department of Electrical and Computer.
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Exploiting Locality in DRAM Xiaodong Zhang College of William and Mary.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Reducing Memory Interference in Multicore Systems
ECE Dept., Univ. Maryland, College Park
A Staged Memory Resource Management Method for CMP Systems
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Adaptive Cache Partitioning on a Composite Core
Reducing Hit Time Small and simple caches Way prediction Trace caches
Zhichun Zhu Zhao Zhang ECE Department ECE Department
ASR: Adaptive Selective Replication for CMP Caches
ISPASS th April Santa Rosa, California
Computer Structure Multi-Threading
Application Slowdown Model
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
Virtually Pipelined Network Memory
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Achieving High Performance and Fairness at Low Cost
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah
Haonan Wang, Adwait Jog College of William & Mary
Presentation transcript:

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State Univ.

Feb. 15, 2005HPCA-112 DRAM Memory Optimizations Optimizations at DRAM side can make a big difference on single-threaded processors  Enhancement of chip interface/interconnect  Access scheduling [Hong et al. HPCA’99, Mathew et al. HPCA’00, Rixner et al. ISCA’00]  DRAM-side locality [Cuppu et al. ISCA’99, ISCA’01, Zhang et al., MICRO’00, Lin et al. HPCA’01]

Feb. 15, 2005HPCA-113 How does SMT Impact Memory Hierarchy?  Less performance loss per cache miss to DRAM memories – Lower benefit from DRAM-side optimizations?  But more cache misses due to cache contention – Much more pressure on main memory  Is DRAM memory design more important or not?

Feb. 15, 2005HPCA-114 Outline  Motivation  Memory optimization techniques  Thread-aware memory access scheduling  Outstanding request-based  Resource occupancy-based  Methodology  Memory performance analysis on SMT systems  Effectiveness of single-thread techniques  Effectiveness of thread-aware schemes  Conclusion

Feb. 15, 2005HPCA-115 Memory Optimization Techniques  Page modes  Open page: good for programs with good locality  Close page: good for programs with poor locality  Mapping schemes  Exploitation of concurrency (multiple channels, chips, banks)  Row buffer conflicts  Memory access scheduling  Reorder of concurrent accesses  Reducing average latency and improving bandwidth utilization

Feb. 15, 2005HPCA-116 Memory Access Scheduling for Single- Threaded Systems  Hit-first  A row buffer hit has a higher priority than a row buffer miss  Read-first  A read has a higher priority than a write  Age-based  An older request has a higher priority than a new one  Criticality-based  A critical request has a higher priority than a non- critical one

Feb. 15, 2005HPCA-117 Memory Access Concurrency with Multithreaded Processors ProcessorMemory Single-threaded Multi-threaded

Feb. 15, 2005HPCA-118 Thread-Aware Memory Scheduling  New dimension in memory scheduling for SMT systems: considering the current state of each thread  States related to memory accesses  Number of outstanding requests  Number of processor resources occupied

Feb. 15, 2005HPCA-119 Outstanding Request-Based Scheme  Request-based  A request generated by a thread with fewer pending requests has a higher priority H A1 H A2 H B1 H A3 H A4 H B2 time H A1 H A2 H A3 H A4 H B1 H B2

Feb. 15, 2005HPCA-1110 Outstanding Request-Based Scheme  Request-based  Hit-first and read-first are applied on top  For SMT processors, sustained memory bandwidth is more important than the latency of an individual access H A1 H A2 M B1 H A3 H A4 M B2 time H A1 H A2 H A3 H A4 M B1 M B2

Feb. 15, 2005HPCA-1111 Resource Occupancy-Based Scheme  ROB-based  Higher priority to requests from threads holding more ROB entries  IQ-based  Higher priority to requests from threads holding more IQ entries  Hit-first and read-first are applied on top

Feb. 15, 2005HPCA-1112 Methodology  Simulator  SMT extension of sim-Alpha  Event-driven memory simulator (DDR SDRAM and Direct Rambus DRAM)  Workload  Mixture of SPEC 2000 applications  2-, 4-, 8-thread workload  “ILP”, “MIX”, and “MEM” workload mixes

Feb. 15, 2005HPCA-1113 Simulation Parameters Processor speed3 GHzL1 caches64KB I/D, 2-way, 1- cycle latency Fetch width8 inst.L2 cache512KB, 2-way, 10- cycle latency Baseline fetch policyDWarn.2.8L3 cache4MB, 4-way, 20-cycle latency Pipeline depth11MSHR entries(16+4 prefetch)/cache Issue queue size64 Int., 32 FPMemory channels2/4/8 Reorder buffer size256/threadMemory BW/channel200 MHz, DDR, 16B width Physical register num384 Int., 384 FPMemory banks4 banks/chip Load/store queue size64 LQ, 64 SQDRAM access latency15ns row, 15ns column, 15ns precharge

Feb. 15, 2005HPCA-1114 Workload Mixes 2-threadILPbzip2, gzip MIXgzip, mcf MEMmcf, ammp 4-threadILPbzip2, gzip, sixtrack, eon MIXgzip, mcf, bzip2, ammp MEMmcf, ammp, swim, lucas 8-threadILPgzip, bzip2, sixtrack, eon, mesa, galgel, crafty, wupwise MIXgzip, mcf, bzip2, ammp, sixtrack, swim, eon, lucas MEMmcf, ammp, swim, lucas, equake, applu, vpr, facerec

Feb. 15, 2005HPCA-1115 Performance Loss Due to Memory Access

Feb. 15, 2005HPCA-1116 Memory Access Concurrency

Feb. 15, 2005HPCA-1117 Memory Channel Configurations

Feb. 15, 2005HPCA-1118 Memory Channel Configurations

Feb. 15, 2005HPCA-1119 Mapping Schemes

Feb. 15, 2005HPCA-1120 Memory Access Concurrency

Feb. 15, 2005HPCA-1121 Thread-Aware Schemes

Feb. 15, 2005HPCA-1122 Conclusion DRAM optimizations have significant impacts on the performance of SMT (and likely CMP) processors  Mostly effective when a workload mix includes some memory-intensive programs  Performance is sensitive to memory channel organizations  DRAM-side locality is harder to explore due to contention  Thread-aware access scheduling schemes does bring good performance