Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

Similar presentations


Presentation on theme: "Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through."— Presentation transcript:

1 Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through OS-Level Page Allocation” by Sangyeun Cho and Lei Jin, appearing in IEEE/ACM Int'l Symposium on Microarchitecture (MICRO), December 2006.

2 Outline Background and Motivation Page Allocation Specifics of Page Allocation Evaluation of Page Allocation Conclusion

3 Motivation With multicore processors, on-chip memory design and management becomes crucial Increasing L2 cache sizes result in non- uniform cache access latencies, which complicates the management of these caches

4 Private Caches A cache slice is associated with a specific processor core Data must be replicated across processors as it is accessed Advantages?  Data is always close to the processor, reducing hit latency Disadvantages?  Limits overall cache space, resulting in more capacity misses T0T1 T2T3 012……………………...........15 Blocks in memory

5 Shared Caches Each memory block uniquely maps to one (and only one) cache slice that all processors will access Advantages?  Increase effective L2 cache size  Easier to implement coherence protocols (data only exists in one place) Disadvantages?  Requested data is not always close, so hit latency increases  Increase network traffic due to movement of data that is not close to requesting processor T0T1T2T3 T4T5T6T7 T8T9T10T11 T12T13T14T15 012……………………...........15 S = A mod N Blocks in memory

6 Page Allocation T0T1T2T3 T4T5T6T7 T8T9T10T11 T12T13T14T15 S = PPN mod N Add another level of indirection – pages! Built on top of a shared cache architecture Use the physical page number (PPN) to map physical pages to the correct cache slice The OS controls the mapping of virtual pages to physical pages – if the OS knows where a physical page maps to, then it can assign virtual pages based on which cache slice it desires! abcdefgh Pages in VM 012……………..15 Pages in memory

7 How does Page Allocation work? A Congruence Group (CG i ) is the partition of physical pages that map to the unique processor core i Each congruence group needs to maintain a “free list” of available pages To implement private caching, when a page is requested by processor i, allocate a free page from CG i To implement shared caching, when any page is requested, allocate a page from any CG To implement hybrid caching, split the CGs into K groups, keeping track of which CG maps to which group – when a page is requested, allocate a page from any CG in the correct group All of this is controlled by the OS without any additional hardware support!

8 Page Spreading & Page Spilling If the OS always allocates pages from the CG corresponding to the requesting processor, then it acts like a private cache. The OS can choose to direct allocations to cache slices in other cores in order to increase the effective cache size. This is page spreading. When available pages in a CG drop below some threshold, the OS may be forced to allocate pages from another group. This is page spilling. Each tile is on a specific tier that corresponds to how close it is to the target tile. Tier-1 tiles

9 Cache Pressure Add hardware support for counting “unique” page accesses in a cache  But we aren’t supposed to need hardware support? It still doesn’t hurt! When cache pressure is measured to be high, pages are allocated to other tiles on the same tier, or tiles on the next tier

10 Home allocation policy Profitability of choosing a home cache slice depends on different factors:  Recent miss rates of L2 caches  Recent network contention levels  Current page allocation  QoS requirements  Processor configuration (# of processors, etc.) The OS can easily find the cache slice with the highest profitability

11 Virtual Multicore (VM) For parallel applications, the OS should try to coordinate page allocation to minimize latency and traffic – schedule a parallel application onto a set of cores in close proximity When cache pressure increases, pages can be still be allocated outside of the VM

12 Hardware Support The best feature of OS-level page allocation is that it can be built on a simple shared cache organization with no hardware support But additional hardware support can still be leveraged!  Data replication  Data migration  Bloom filter

13 Evaluation Use SimpleScalar tool set to model 4x4 mesh multicore processor chip Demand paging – every memory access is checked against allocated pages; when a memory access is the first access to an unallocated page, a physical page is allocated based on the desired policy No page spilling was ever experienced Used single-threaded, multiprogrammed, and parallel workloads  Single-threaded = variety of SPEC2k benchmarks, integer programs, and floating-point programs  Multiprogrammed = one core (core 5 in the experiments) runs a target benchmark, while other processors run a synthetic benchmark that continuously generates memory accesses  Parallel = SPLASH-2 benchmarks

14 Performance on single-threaded workloads PRV: private  PRV8: 8MB cache size (instead of 512k) SL: shared SP: OS-based page allocation  SP-RR: round-robin allocation  SP-80: 80% allocated locally, 20% spread across tier-1 cores

15 Performance on single-threaded workloads Decreased sharing = higher miss rate Decreased sharing = less on-chip traffic

16 Performance on multiprogrammed workloads SP40-CS: use controlled spreading to limit spreading of unrelated pages onto cores that have data of target application Synthetic benchmarks produce low, mid, or high traffic SP40 usually performs better in high traffic, but performance is similar to SL in low traffic Not shown here, but SP40 reduces on-chip network traffic by 50% (compared to SL)

17 Performance on parallel workloads VM: virtual multicore with round-robin page allocations on participating cores lu and ocean have higher L1 miss rates, so the L2 cache policy had a greater effect on performance No real difference here! VM outperforms the rest!

18 Related Issues Remember NUMA? They used a page scanner that maintained reference counters and generated page faults to allow the OS to take some control In CC-NUMA, hardware-based counters affected OS decisions Big difference: NUMA deals with main memory, while OS-level page allocation presented here deals with distributed L2 caches

19 Conclusion Page allocation allows for a very simple shared cache architecture, but how can we use advances in architecture for our benefit?  Architecture can provide more detailed information about current state of the cores  CMP-NuRAPID, victim replication, cooperative caching Can we apply OS-level modifications also?  Page coloring and page recoloring We are trading hardware complexity for software complexity – where is the right balance?


Download ppt "Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through."

Similar presentations


Ads by Google