Address Translation for Manycore Systems

Address Translation for Manycore Systems
Scott Beamer Henry Cook CS258 Final Presentation May 14th, 2008

ParLab Background Parallel (manycore) is coming, how can we use this opportunity to accomplish high level computing goals? productive, efficient, correct Context: Mobile Consumer Device Low power Single socket Bursty Workloads Quality of Service and Response Time important

Problem Statement Modern processors want translation (from VM to PM), how does this scale to parallel? When a PTE that may be cached in many places is modified, the caches (TLBs) need be kept consistent Differences from cache coherence problem Invalidations are much less frequent Translation can be performed anywhere Removes it from the critical path In ParLab, we are using partitions Spatially dividing tiled cores to work on a single app Shared L2 cache provided within a partition

Coherence Method: Shootdown
Use a conventional TLB per core On a PTE modification, broadcast: Interrupt all other processors Force them flush relevant entries from their TLB’s Modification cannot be completed until all processors comply and respond Can work with any TLB/cache configuration, but synchronization costs are high In modern SMP OS, software handler is responsible for shootdown

Coherence Method: Validation
Allows cached translations to get stale and fixes them at memory controller Every TLB entry stores a timestamp for its translation On a PTE modification, update a generation count associated with the page On a memory access: Translation timestamp is checked at memory controller Outdated translations are fixed and the TLB with the outdated translation is updated Only gets gain with virtual caches Virtual cache could save energy because fewer TLB lookups are needed On context switch virtual cache must be flushed Other overhead as well

Better Schemes Shared Hierarchal Hybrid Let several cores share a TLB
Could benefit from constructive interference L2 is already shared, so TLB could be shared at that level L1 would have to be virtual Hierarchal Add a second or third level TLB to reduce reload penalty Hybrid

Methodology Virtutech Simics system simulator PARSEC
ISA functional simulator enhanced with memory hierarchy and TLB timing modules Can measure latencies from memory access events, count coherence messages 4, 8, 16, 32, 64, 128 SPARC processor systems Running unmodified Solaris 10 Measure behavior over 1B cycles PARSEC Princeton Application Repository for Shared Memory Computers

Applications

Results - Basic Blackscholes, 128 entry

Results – Application

Results – TLB Size

Results – Invalidation Rate
1000x rate leads to 1000x increase in communication, but still not visible For 5% need roughly a PTE write per 1000 cycles 64 entry

Results – Traffic Comparison

Future Work Investigate the “32 problem” Further explore design space
Complete validation scheme Experiment across sharing levels Experiment across levels of hierarchy More applications Several other PARSEC apps recently working Multiple kernels at same time to show time multiplexing

Conclusion TLB size most important observed factor so far
Application has some effect Invalidation rate and type has less effect TLB coherence network traffic insignificant Shootdown not bad as a first pass

Address Translation for Manycore Systems

Similar presentations

Presentation on theme: "Address Translation for Manycore Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Address Translation for Manycore Systems

Similar presentations

Presentation on theme: "Address Translation for Manycore Systems"— Presentation transcript:

Similar presentations

About project

Feedback