Presentation is loading. Please wait.

Presentation is loading. Please wait.

Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason.

Similar presentations


Presentation on theme: "Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason."— Presentation transcript:

1 Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason Zebchuk, M.A.Sc. C. Andreas Moshovos {pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu

2 Sept. 11, 2007, JJPAR2Aenao Group/Toronto Future On-Chip Caches: Just Larger? CPU I$ D$ CPU I$D$ CPU I$D$ interconnect Main Memory Observe and Exploit Memory Access Behavior at a Coarse Grain 10s – 100s of MB

3 Sept. 11, 2007, JJPAR3Aenao Group/Toronto Conventional Block-Centric Memory Hierarchy Conventional Fine-Grain Tracking n “Small” Blocks l Performance and Bandwidth n Several optimizations exist Big picture is lost

4 Sept. 11, 2007, JJPAR4Aenao Group/Toronto “Big Picture” View n Region: 2 n sized, aligned memory area l Concept already in use: TLBs n Patterns Emerge in Space / Time l Exploit for performance & power l Expose to software Supplemental Coarse-Grain Tracking

5 Sept. 11, 2007, JJPAR5Aenao Group/Toronto This Presentation n Examples of Coarse-Grain Optimizations l Snoop Coherence l Thread-level speculation disambiguation n Region-Centric Memory Design l RegionTracker Cache l Snoop Coherence Revisited n Current Activities l Coherence Delegation l Predictor Virtualization

6 Sept. 11, 2007, JJPAR6Aenao Group/Toronto An Example: Snoop Coherence  Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth  Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence?  Remains Attractive: Simple / Design Re-use Yes: Exploit Program Behavior to Dynamically Identify Requests that do not Need Snooping CPU I$D$ CPU I$D$ CPU I$D$ interconnect Main Memory

7 Sept. 11, 2007, JJPAR7Aenao Group/Toronto Coherence Basics n Given request for memory block X (address) n Detect where current value resides Main Memory snoop X hit CPU

8 Sept. 11, 2007, JJPAR8Aenao Group/Toronto Conventional Coherence not Power-Aware/Bandwidth-Effective All L2 tags see all accesses Perf. & Complexity: Have L2 tags why not use them Power: All L2 tags consume power on all accesses Bandwidth: broadcast all coherent requests Main Memory L2 CPU miss CPU

9 Sept. 11, 2007, JJPAR9Aenao Group/Toronto RegionScout Motivation: Sharing is Coarse n Region: large continuous memory area, power of 2 size n CPU X asks for data block in region R 1. No one else has X 2. No one else has any block in R RegionScout Exploits this Behavior Layered Extension over Snoop Coherence Typical Memory Space Snapshot: colored by owner(s) addresses

10 Sept. 11, 2007, JJPAR10Aenao Group/Toronto Optimization Opportunities n Power and Bandwidth l Originating node: avoid asking others l Remote node: avoid tag lookup CPU I$D$ CPU I$D$ Memory SWITCH CPU I$D$

11 Sept. 11, 2007, JJPAR11Aenao Group/Toronto Potential: Region Miss Frequency % of all requests Region Size Even with a 16K Region ~45% of requests miss in all remote nodes better Global Region Misses

12 Sept. 11, 2007, JJPAR12Aenao Group/Toronto RegionScout at Work: Non-Shared Region Discovery First request detects a non-shared region Main Memory CPU Global Region Miss Region Miss 1 22 3 Record: Non-Shared RegionsRecord: Locally Cached Regions

13 Sept. 11, 2007, JJPAR13Aenao Group/Toronto RegionScout at Work: Avoiding Snoops Subsequent request avoids snoops Main Memory CPU Global Region Miss 1 2 Record: Non-Shared RegionsRecord: Locally Cached Regions

14 Sept. 11, 2007, JJPAR14Aenao Group/Toronto RegionScout is Self-Correcting Request from another node invalidates non-shared record Main Memory CPU 1 22 Record: Non-Shared RegionsRecord: Locally Cached Regions

15 Sept. 11, 2007, JJPAR15Aenao Group/Toronto n Requesting Node provides address: n At Originating Node – from CPU: l Have I discovered that this region is not shared? n At Remote Nodes – from Interconnect: l Do I have a block in the region? Implementation: Requirements Region Tag offset lg(Region Size) CPU address

16 Sept. 11, 2007, JJPAR16Aenao Group/Toronto Remembering Non-Shared Regions n Records non-shared regions n Lookup by Region portion prior to issuing a request n Snoop requests and invalidate Region Tag offset address valid Non-Shared Region Table Few entries 16x4 in most experiments

17 Sept. 11, 2007, JJPAR17Aenao Group/Toronto What Regions are Locally Cached? n If we had as many counters as regions: l Block Allocation: counter[region]++ l Block Eviction: counter[region]-- l Region cached only if counter[Region] non-zero n Not Practical: l E.g., 16K Regions and 4G Memory  256K counters Region Tag offset counter

18 Sept. 11, 2007, JJPAR18Aenao Group/Toronto What Regions are Locally Cached? Region Tag offset counter hash() n Imprecise: l Records a superset of locally cached Regions l False positives: lost opportunity, correctness preserved l Small: e.g., 256 entries for 1M cache n Power-Optimized structures described in the paper

19 Sept. 11, 2007, JJPAR19Aenao Group/Toronto LFSR-Based Implementation Region Tag offset LFSR hash() Zero Detector n Linear-Feedback Shift Register Array l Increment/Decrement/Is Zero? n 130nm commercial technology l ISLPED ’06 l Faster: 1.6x to 3.7x l More Energy Efficient: 1.4x to 2.3x l But Area: 3.2x

20 Sept. 11, 2007, JJPAR20Aenao Group/Toronto Filter Rates: SPLASH-II Identified Global Region Misses CRH Size better Jason Cantin@Wisconsin studied commercial workloads 40% filter rate

21 Region-Centric Disambiguation Join work w/ Greg Steffan and Mihai Burcea Patrick Akl Andreas Moshovos

22 Sept. 11, 2007, JJPAR22Aenao Group/Toronto Speculative Parallelization Models n Thread level speculation n Transactional Memory Original Speculative Parallelization time write a read b write a read a Good ScenarioBad Scenario Need to Compare Addresses Across Code Pieces

23 Sept. 11, 2007, JJPAR23Aenao Group/Toronto Ex #2: Region-Centric Disambiguation n Send digest at region level l Region-conflict u Send block-level info n Reduced bandwidth, potential for performance and power Task 1Task 2 Task 1Task 2 Memory Space Conventional Region-Centric

24 Sept. 11, 2007, JJPAR24Aenao Group/Toronto How Much Traffic Can We Save? n TLS benchmarks from STAMPEDE group (G. Steffan) n Approximate timing model Potential for traffic reduction by 38% Better

25 Sept. 11, 2007, JJPAR25Aenao Group/Toronto Exploiting Region-Level Information n Region Coherence Arrays l Cantin, Lipasti and Smith n RegionScout u Both of these reduce snoop lookups (and broadcasts) in snoop coherence protocols Our work n Spatial Memory Prefetching u Leverages spatial memory patterns for prefetching with commercial workloads u Impetus Group at CMU n Stealth Prefetching l Cantin, Lipasti and Smith

26 Sept. 11, 2007, JJPAR26Aenao Group/Toronto Coarse-Grain Techniques Today n Overhead l Storage: e.g., 60% of tags l Functionality: Restrict placement, Region Evictions n Loss of Information Hard to justify for a commercial design CPU I$D$ Auxiliary Tracking DATA TAGS Conventional Cache

27 Sept. 11, 2007, JJPAR27Aenao Group/Toronto Rethinking Cache Design n Can we provide a common substrate for all these optimizations? n Redesign caches: l Regions a first class citizen n RegionTracker Cache CPU I$D$ Embedded Tracking DATA Dual-Grain TAGS

28 Sept. 11, 2007, JJPAR28Aenao Group/Toronto RegionTracker Cache n Goals l Expose region behavior u Is region X cached? u Which blocks are? l Facilitate management at the region level u Evict/migrate region X u Do something with all blocks in X n Constraints: l Data movement only at the block level l No increase in area l No decrease in performance l Complexity l Associativity

29 Sept. 11, 2007, JJPAR29Aenao Group/Toronto Region-Based Caches n Start with conventional 16-way cache and replace tag array n Sector Caches l Hit rate suffers: 20% loss n Sector Pool Caches l High Associavity: 48-way for matching a 16-way cache n Decoupled-Sector Caches l No coarse-grain info l Replacements require searching n No previous design is adequate n RegionTracker: l Meets all requirements l But does not save as much tag resources

30 Sept. 11, 2007, JJPAR30Aenao Group/Toronto Sector Cache n Reduced Area and Power n Increased miss-rates (2.5% - 96% for 1kB sectors) n Replacement? D-way Data { D-way Region Tags RVA Data Array

31 Sept. 11, 2007, JJPAR31Aenao Group/Toronto Sector Pool Cache n M > D n Requires highly associative cache to achieve same performance as RegionTracker (~48-way) D-way Data { Data Array M-way Region Tags RVA 1 DSR

32 Sept. 11, 2007, JJPAR32Aenao Group/Toronto Decoupled-Sectored Cache n Has multiple block evictions l Requires scanning “status” array l No simple mechanism to avoid this n Does NOT expose region-level information

33 Sept. 11, 2007, JJPAR33Aenao Group/Toronto RegionTracker n In practice L <= D n Decouple Data and Lookup organizations n Lower Associativity lookups with no hit-rate penalty n RegionTracker provides complete solution D-way Data { Data Array L-way Region Tags RVA 1 DSR

34 Sept. 11, 2007, JJPAR34Aenao Group/Toronto RegionTracker Cache L1 RVA ERB Data Array BST Block and Region Lookups Region Tag + Way Per Block Evict Region Blocks Lazily Simplify replacement and reduce area Status per block + RVA set backpointer Can be banked and partitioned

35 Sept. 11, 2007, JJPAR35Aenao Group/Toronto Region-Aware Cache: Performance vs. Area n Commercial workloads: DB2, Oracle, TPC-C and TPC-H, Apache, Zeus n SimICS + SimFlex, Sampling, 2K Regions better

36 Sept. 11, 2007, JJPAR36Aenao Group/Toronto RegionTracker-RegionScout n One bit per Region tag: Known to be not shared n 1KB Regions, Commercial workloads n 512KB L2 private caches Filter 41% of snoops at “Zero Cost” compared to conventional cache Reduction in Broadcasts better BlockScout

37 Sept. 11, 2007, JJPAR37Aenao Group/Toronto Directory Optimizations Base Architecture L3 Data DRAM Core Directory L3 Tags L2 Tags L2 Data

38 Sept. 11, 2007, JJPAR38Aenao Group/Toronto Coherence Delegation n Eliminate 3-hop overhead n Attract directory tracking to nodes Directory Lookup Remote L2 containing data Requesting Node Ideal Path

39 Sept. 11, 2007, JJPAR39Aenao Group/Toronto Predictor Virtualization Interconnect L2 CPU L1-DL1-ICPUL1-DL1-ICPUL1-DL1-I CPUL1-DL1-I Main Memory Optimization Engines: Predictors CPUCPU L1-DL1-ICPUL1-DL1-I CPUCPU L1-DL1-I CPU L1-DL1-I CPUCPUCPUCPUCPU L1-DL1-I L1-D L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D

40 Sept. 11, 2007, JJPAR40Aenao Group/Toronto Motivating Trends n Chip multiprocessors l Space dedicated to predictors X #processors n Larger predictor table l Increased performance n Memory hierarchies l Increased capacities Use conventional memory hierarchies to store predictor information

41 Sept. 11, 2007, JJPAR41Aenao Group/Toronto PV Architecture Optimization Engine Predictor Table entryindexprediction

42 Sept. 11, 2007, JJPAR42Aenao Group/Toronto PV Architecture Optimization Engine entryindexprediction Predictor Virtualization

43 Sept. 11, 2007, JJPAR43Aenao Group/Toronto PV Architecture Optimization Engine entryindexprediction + index PVStart PVCache MSHR PVProxy L2 Main Memory PVTable

44 Sept. 11, 2007, JJPAR44Aenao Group/Toronto Virtualized Spatial Memory Streaming Original Prefetcher: Cost: 80KB Virtualized Prefetcer: Cost: <1Kbyte Nearly Identical Performance

45 Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. C. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason Zebchuk, M.A.Sc. C. Andreas Moshovos {pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu

46 Sept. 11, 2007, JJPAR46Aenao Group/Toronto Summary n Caches are getting larger n Time to look at the “big picture” n Region-Centric Memory Design l Expose region-level info l Allow management at the region-level n RegionScout l eliminate broadcasts for snoop coherence n Region-Centric Disambiguation l Reduce bandwidth for TLS or TM n Region-Aware Memory l “Same” area and performance as conventional + region info n Predictor Virtualization


Download ppt "Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason."

Similar presentations


Ads by Google