Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Similar presentations


Presentation on theme: "A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos"— Presentation transcript:

1 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos {zebchuk,elham,moshovos}@eecg.toronto.edu AENAO Research Group Department of Electrical and Computer Engineering University of Toronto

2 Jason Zebchuk2A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Conventional Block Centric Cache n “Small” Blocks l Optimizes Bandwidth and Performance n Large L2/L3 caches especially Fine-Grain View of Memory L2 Cache Big Picture Lost

3 Jason Zebchuk3A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy “Big Picture” View n Region: 2 n sized, aligned area of memory n Patterns and behavior exposed l Spatial locality n Exploit for performance/area/power Coarse-Grain View of Memory L2 Cache

4 Jason Zebchuk4A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Exploiting Coarse-Grain Patterns n Many existing coarse-grain optimizations n Add new structures to track coarse-grain information CPU L2 Cache Stealth Prefetching Flexible Snooping Destination-Set Prediction Spatial Memory Streaming Coarse-Grain Coherence Tracking RegionScout Circuit-Switched Coherence Hard to justify for a commercial design Coarse-Grain Framework

5 Jason Zebchuk5A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Exploiting Coarse-Grain Patterns CPU L2 Cache Coarse-Grain Framework n Embed coarse-grain information in tag array n Support many different optimizations with less area overhead Adaptable optimization FRAMEWORK

6 Jason Zebchuk6A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy L2 Cache RegionTracker Solution Manage blocks, but also track and manage regions Tag Array L1 Data Array Data Blocks Block Requests Block Requests Region Tracker Region Probes Region Responses

7 Jason Zebchuk7A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy RegionTracker Summary n Replace conventional tag array : l 4-core CMP with 8MB shared L2 cache l Within 1% of original performance l Up to 20% less tag area l Average 33% less energy consumption n Optimization Framework: l Stealth Prefetching: same performance, 36% less area l RegionScout: 2x more snoops avoided, no area overhead

8 Jason Zebchuk8A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion

9 Jason Zebchuk9A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Goals 1. Conventional Tag Array Functionality l Identify data block location and state l Leave data array un-changed 2. Optimization Framework Functionality l Is Region X cached? l Which blocks of Region X are cached? Where? l Evict or migrate Region X

10 Jason Zebchuk10A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Coarse-Grain Cache Designs n Increased BW, Decreased hit-rates Region X Large Block Size Tag ArrayData Array

11 Jason Zebchuk11A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Sector Cache n Decreased hit-rates Region X Tag Array Data Array

12 Jason Zebchuk12A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Sector Pool Cache n High Associativity (2 - 4 times) Region X Tag ArrayData Array

13 Jason Zebchuk13A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Decoupled Sector Cache n Region information not exposed n Region replacement requires scanning multiple entries Region X Tag ArrayData ArrayStatus Table

14 Jason Zebchuk14A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Design Requirements n Small block size (64B) n Miss-rate does not increase n Lookup associativity does not increase n No additional access latency l (i.e., No scanning, no multiple block evictions) n Does not increase latency, area, or energy n Allows banking and interleaving n Fit in conventional tag array “envelope”

15 Jason Zebchuk15A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy RegionTracker: A Tag Array Replacement L1 Data Array n 3 SRAM arrays, combined smaller than tag array Region Vector Array Block Status Table Evicted Region Buffer

16 Jason Zebchuk16A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Common Case: Hit Region Tag RVA IndexRegion OffsetBlock Offset 49061021 Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) 14 status 32 Data Array + BST Index To Data Array Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region

17 Jason Zebchuk17A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Worst Case (Rare): Region Miss Region Tag RVA IndexRegion OffsetBlock Offset 49061021 Address: Region Vector Array (RVA) Region Tag …… block 0 block 15 wayV Block Offset 1960 Block Status Table (BST) status 3 Ptr 2 Data Array + BST Index Evicted Region Buffer (ERB) No Match! Ptr

18 Jason Zebchuk18A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Methodology n Flexus simulator from CMU SimFlex group l Based on Simics full-system simulator n 4-core CMP modeled after Piranha l Private 32KB, 4-way set-associative L1 caches l Shared 8MB, 16-way set-associative L2 cache l 64-byte blocks n Miss-rates: Functional simulation of 2 billion instructions per core n Performance and Energy: Timing simulation using SMARTS sampling methodology n Area and Power: Full custom implementation on 130nm commercial technology n 9 commercial workloads: l WEB: SpecWEB on Apache and Zeus l OLTP: TPC-C on DB2 and Oracle l DSS: 5 TPC-H queries on DB2 Interconnect L2 P D$I$ P D$I$ P D$I$ P D$I$

19 Jason Zebchuk19A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Miss-Rates vs. Area n Sector Cache: 512KB sectors, SPC and RT: 1KB regions n Trade-offs comparable to conventional cache better Relative Miss-Rate Relative Tag Array Area Sector Cache (0.25, 1.26) 14-way 15-way 52-way 48-way

20 Jason Zebchuk20A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Performance & Energy n 12-way set-associative RegionTracker: 20% less area n Error bars: 95% confidence interval n Performance within 1%, with 33% tag energy reduction Normalized Execution Time better Reduction in Tag Energy better PerformanceEnergy

21 Jason Zebchuk21A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Road Map n Introduction n Goals n Coarse-Grain Cache Designs n RegionTracker: A Tag Array Replacement n RegionTracker: An Optimization Framework n Conclusion

22 Jason Zebchuk22A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy RegionTracker: An Optimization Framework L1 RVA ERB Data Array BST Stealth Prefetching: Average 20% performance improvement Drop-in RegionTracker for 36% less area overhead RegionScout: In-depth analysis

23 Jason Zebchuk23A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Snoop Coherence: Common Case Main Memory CPU Read x miss Read x+1 Read x+n Many snoops are to non-shared regions

24 Jason Zebchuk24A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy RegionScout Eliminate broadcasts for non-shared regions Main Memory CPU Global Region Miss Region Miss Non-Shared RegionsLocally Cached Regions Read x Region Miss

25 Jason Zebchuk25A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy RegionTracker Implementation n Minimal overhead to support RegionScout optimization n Still uses less area than conventional tag array Non-Shared Regions Add 1 bit to each RVA entry Locally Cached Regions Already provided by RVA

26 Jason Zebchuk26A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy RegionTracker + RegionScout Reduction in Snoop Broadcasts better n 4 processors, 512KB L2 Caches n 1KB regions Avoid 41% of Snoop Broadcasts, no area overhead compared to conventional tag array BlockScout (4KB) New optimization possible with RegionTracker

27 Jason Zebchuk27A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Result Summary n Replace Conventional Tag Array: l 20% Less tag area l 33% Less tag energy l Within 1% of original performance n Coarse-Grain Optimization Framework: l 36% reduction in area overhead for Stealth Prefetching l Filter 41% of snoop broadcasts with no area overhead compared to conventional cache

28 Jason Zebchuk28A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Exploiting Coarse-Grain Patterns CPU L2 Cache Stealth Prefetching Run-time Adaptive Cache Hierarchy Management via Reference Analysis Destination-Set Prediction Spatial Memory Streaming Coarse-Grain Coherence Tracking RegionScout Circuit-Switched Coherence Conclusion RegionTracker framework makes coarse-grain optimizations more attractive CPU L2 Cache

29 A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos {zebchuk,elham,moshovos}@eecg.toronto.edu AENAO Research Group Department of Electrical and Computer Engineering University of Toronto


Download ppt "A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos"

Similar presentations


Ads by Google