Mihai Burcea, J. Gregory Steffan, Cristiana Amza

Mihai Burcea, J. Gregory Steffan, Cristiana Amza
The Potential for Variable-Granularity Access Tracking for Optimistic Parallelism Mihai Burcea, J. Gregory Steffan, Cristiana Amza University of Toronto MSPC 2008

Getting the Most Out of Your CPUs
AMD Barcelona quad-core Ubiquitous CMPs How do we exploit all this parallelism? How do we improve sequential applications? Intel Kentsfield quad-core

Optimistic Parallelism
Flavors: Transactional Memory (TM) Thread-Level Speculation (TLS) Implementations: hardware, software, hybrid Common required support: Buffering speculative memory changes Tracking and detecting memory access conflicts

Traditional Access Tracking
Most approaches use some fixed granularity Hardware TM/TLS: cache-line size Typically 32/64/128 bytes Software TLS: word-, object-level Software TM: word/page/object granularity Hybrid TM: mixture of above (in HW/SW) Is Fixed Granularity the best approach ?

Can We Reduce The Overhead of Dependence Tracking ?
Too much overhead Too many false conflicts Fine Granularity Coarse Key Intuition: “best” granularity likely varies within and across benchmarks

False Conflicts when Using Uniform Coarse Granularity
Measured in a TLS simulator; 32/64/128 = cache line sizes (bytes) Uniform coarse grain approach suffers false conflicts

Is there potential for a variable-granularity approach?

Goals Of Our Work Show potential for Variable-Granularity Access Tracking (VGAT) Finest grain too expensive; which coarse grain? Show that ideal granularity varies across and within applications Suggests need for dynamic, adaptive scheme Show significant reduction in number of tracked memory ranges when using VGAT

Most systems use fixed or object grain - but not necessarily the best
Related Work Hardware TLS / TM: track accesses at cache-line size (32/64/128 bytes) Stampede (Steffan et. al., ACM Trans. 2005), Speculative Versioning Cache (Vijaykumar et. al., HPCA 1998) Unbounded TM (Ananian et. al., HPCA 2005), LogTM (Moore et. al., HPCA 2006) Software TLS: Word (Cintra et. al., PPoPP 2003) Object (Pickett et. al., LCPC 2005) Software TM: Word (McRT-STM – Saha et. al., PPoPP 2006) Page (Manassiev et. al., PPoPP 2006) Object: RSTM (Marathe et. al., PLDI 2006), DSTM (Herlihy et. al., PODC 2003) Most systems use fixed or object grain - but not necessarily the best

Related Work – Bulk Disambiguation
Ceze et. al., ISCA 2006 Encode read/write sets into signatures Detect conflicts by performing operations on signatures (fast) Design of hashing (encoding) addresses into signatures includes false positives Reduce conflict-detection traffic, but increase false conflicts Our goal: minimize false conflicts

Variable Granularity Access Tracking
Approaches: vary granularity across Time: parts of apps. (speculative code regions) Space: ranges of memory Can potentially reduce: Tracking storage Tracking traffic Commit latency False conflicts

Impact On Conflicts Of Increasing Granularity
Granularity (bytes) Number of conflicts 4 100 8 16 103 32 120 True (actual) conflicts  Same nr. of conflicts, still ok Extra (false) conflicts! Coarsest granularity that incurs no false conflicts: Ideal Granularity

Measuring the Potential for VGAT

Experimental Framework
TLS simulator (CMU) Subset of SpecINT2000 benchmarks Instrumented for TLS TLS regions mostly loop-based TLS regions pre-selected based on 32-byte reading and 4-byte writing granularity Focus on specific aspects: Simulate first billion instructions Track only Read-After-Write dependences Speculative code regions pre-selected for 32 bytes -> our results are conservative!

Variable Granularity at Code Region Level
Memory accessed by Region 1 fork Speculative Code Region 1 join Granularity 4 bytes Memory accessed by Region 2 fork Speculative Code Region 2 join Granularity 32 bytes Memory accessed by Region 3 fork Speculative Code Region 3 join Granularity 8 bytes 4 bytes 8 bytes 32 bytes

Ideal Granularity at Code Region Level
page-level (4 k) cache-line level word-level Code regions with no conflicts not shown in figure (in parentheses) Ideal Granularity varies significantly between code regions

Variable Granularity Across Memory Ranges
fork Memory accessed by Region 1 Speculative Code Region 1 join Memory accessed by Region 2 fork Speculative Code Region 2 join Memory accessed by Region 3 fork Speculative Code Region 3 join 4 bytes 8 bytes 32 bytes

Ideal Granularity Across Memory Ranges
Cache-line size sometimes good, sometimes not Word-level rarely necessary Page-level often sufficient Ideal Granularity varies widely across memory ranges

Can VGAT improve performance?

Reducing the Number of Tracked Elements by using Variable Granularity
458 61 31 50 51 35 9 5 3 Gmean: 2.88/5.17/9.15/35.03 VGAT can reduce the # of tracked elements more than 3x!

Ongoing Work Should memory-centric or code-centric accesses determine granularity ? Dynamic, adaptive system for deciding granularity based on iterative sampling How best to use and store profile information May tolerate some percentage of false conflicts Hardware TLS Reduce conflict-detection traffic, possibly power Software TM (lock-based) Reduce number of locks – save space and time Reduce lock contention

Conclusions (for Stampede TLS)
TM/TLS systems with only fixed coarse granularity may suffer many false conflicts 2x – 4x on average Variable granularity can reduce false conflicts and tracking overhead 3x – 35x reduction in tracked ranges Ideal granularity varies widely across memory ranges and speculative code regions

Thank you! Questions ?

Mihai Burcea, J. Gregory Steffan, Cristiana Amza

Similar presentations

Presentation on theme: "Mihai Burcea, J. Gregory Steffan, Cristiana Amza"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mihai Burcea, J. Gregory Steffan, Cristiana Amza

Similar presentations

Presentation on theme: "Mihai Burcea, J. Gregory Steffan, Cristiana Amza"— Presentation transcript:

Similar presentations

About project

Feedback