Zhichun Zhu Zhao Zhang ECE Department ECE Department

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors
Zhichun Zhu Zhao Zhang ECE Department ECE Department Univ. Illinois at Chicago Iowa State Univ.

DRAM Memory Optimizations
Optimizations at DRAM side can make a big difference on single-threaded processors Enhancement of chip interface/interconnect Access scheduling [Hong et al. HPCA’99, Mathew et al. HPCA’00, Rixner et al. ISCA’00] DRAM-side locality [Cuppu et al. ISCA’99, ISCA’01, Zhang et al., MICRO’00, Lin et al. HPCA’01] Feb. 15, 2005 HPCA-11

How does SMT Impact Memory Hierarchy?
Less performance loss per cache miss to DRAM memories – Lower benefit from DRAM-side optimizations? But more cache misses due to cache contention – Much more pressure on main memory Is DRAM memory design more important or not? Feb. 15, 2005 HPCA-11

Outline Motivation Memory optimization techniques
Thread-aware memory access scheduling Outstanding request-based Resource occupancy-based Methodology Memory performance analysis on SMT systems Effectiveness of single-thread techniques Effectiveness of thread-aware schemes Conclusion Feb. 15, 2005 HPCA-11

Memory Optimization Techniques
Page modes Open page: good for programs with good locality Close page: good for programs with poor locality Mapping schemes Exploitation of concurrency (multiple channels, chips, banks) Row buffer conflicts Memory access scheduling Reorder of concurrent accesses Reducing average latency and improving bandwidth utilization Feb. 15, 2005 HPCA-11

Memory Access Scheduling for Single-Threaded Systems
Hit-first A row buffer hit has a higher priority than a row buffer miss Read-first A read has a higher priority than a write Age-based An older request has a higher priority than a new one Criticality-based A critical request has a higher priority than a non-critical one Feb. 15, 2005 HPCA-11

Memory Access Concurrency with Multithreaded Processors
Single-threaded Multi-threaded Feb. 15, 2005 HPCA-11

Thread-Aware Memory Scheduling
New dimension in memory scheduling for SMT systems: considering the current state of each thread States related to memory accesses Number of outstanding requests Number of processor resources occupied Feb. 15, 2005 HPCA-11

Outstanding Request-Based Scheme
A request generated by a thread with fewer pending requests has a higher priority HA1 HA2 HB1 HA3 HA4 HB2 time HB1 HB2 HA1 HA2 HA3 HA4 Feb. 15, 2005 HPCA-11

Outstanding Request-Based Scheme
Hit-first and read-first are applied on top For SMT processors, sustained memory bandwidth is more important than the latency of an individual access HA1 HA2 MB1 HA3 HA4 MB2 time MB1 MB2 HA1 HA2 HA3 HA4 Feb. 15, 2005 HPCA-11

Resource Occupancy-Based Scheme
ROB-based Higher priority to requests from threads holding more ROB entries IQ-based Higher priority to requests from threads holding more IQ entries Hit-first and read-first are applied on top Feb. 15, 2005 HPCA-11

Methodology Simulator Workload SMT extension of sim-Alpha
Event-driven memory simulator (DDR SDRAM and Direct Rambus DRAM) Workload Mixture of SPEC 2000 applications 2-, 4-, 8-thread workload “ILP”, “MIX”, and “MEM” workload mixes Feb. 15, 2005 HPCA-11

Simulation Parameters
Processor speed 3 GHz L1 caches 64KB I/D, 2-way, 1-cycle latency Fetch width 8 inst. L2 cache 512KB, 2-way, 10-cycle latency Baseline fetch policy DWarn.2.8 L3 cache 4MB, 4-way, 20-cycle latency Pipeline depth 11 MSHR entries (16+4 prefetch)/cache Issue queue size 64 Int., 32 FP Memory channels 2/4/8 Reorder buffer size 256/thread Memory BW/channel 200 MHz, DDR, 16B width Physical register num 384 Int., 384 FP Memory banks 4 banks/chip Load/store queue size 64 LQ, 64 SQ DRAM access latency 15ns row, 15ns column, 15ns precharge Feb. 15, 2005 HPCA-11

Workload Mixes 2-thread ILP bzip2, gzip MIX gzip, mcf MEM mcf, ammp
bzip2, gzip, sixtrack, eon gzip, mcf, bzip2, ammp mcf, ammp, swim, lucas 8-thread gzip, bzip2, sixtrack, eon, mesa, galgel, crafty, wupwise gzip, mcf, bzip2, ammp, sixtrack, swim, eon, lucas mcf, ammp, swim, lucas, equake, applu, vpr, facerec Feb. 15, 2005 HPCA-11

Performance Loss Due to Memory Access
Feb. 15, 2005 HPCA-11

Memory Access Concurrency
Feb. 15, 2005 HPCA-11

Memory Channel Configurations
Feb. 15, 2005 HPCA-11

Mapping Schemes Feb. 15, 2005 HPCA-11

Memory Access Concurrency
Feb. 15, 2005 HPCA-11

Thread-Aware Schemes Feb. 15, 2005 HPCA-11

Conclusion DRAM optimizations have significant impacts on the performance of SMT (and likely CMP) processors Mostly effective when a workload mix includes some memory-intensive programs Performance is sensitive to memory channel organizations DRAM-side locality is harder to explore due to contention Thread-aware access scheduling schemes does bring good performance Feb. 15, 2005 HPCA-11

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Similar presentations

Presentation on theme: "Zhichun Zhu Zhao Zhang ECE Department ECE Department"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Similar presentations

Presentation on theme: "Zhichun Zhu Zhao Zhang ECE Department ECE Department"— Presentation transcript:

Similar presentations

About project

Feedback