Mihai Burcea, J. Gregory Steffan, Cristiana Amza

Slides:

Advertisements

Similar presentations

Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.

Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto,

Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Transactional Memory – Implementation Lecture 1 COS597C, Fall 2010 Princeton University Arun Raman 1.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet.

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.

Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.

University of Michigan Electrical Engineering and Computer Science 1 Parallelizing Sequential Applications on Commodity Hardware Using a Low-Cost Software.

Selfishness in Transactional Memory Raphael Eidenbenz, Roger Wattenhofer Distributed Computing Group Game Theory meets Multicore Architecture.

Software-Based Cache Coherence with Hardware-Assisted Selective Self Invalidations Using Bloom Filters Authors ： Thomas J. Ashby, Pedro D´ıaz, Marcelo.

Mrinmoy Ghosh Weidong Shi Hsien-Hsin (Sean) Lee

NVSleep: Using Non-Volatile Memory to Enable Fast Sleep/Wakeup of Idle Cores Xiang Pan and Radu Teodorescu Computer Architecture Research Lab

Defining Anomalous Behavior for Phase Change Memory

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Maximum Benefit from a Minimal HTM Owen Hofmann, Chris Rossbach, and Emmett Witchel The University of Texas at Austin.

Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Thread-Level Speculation Karan Singh CS

Lengthening Traces to Improve Opportunities for Dynamic Optimization Chuck Zhao, Cristiana Amza, Greg Steffan, University of Toronto Youfeng Wu Intel Research.

Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Hybrid Static-Dynamic Analysis for Statically.

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,

Hybrid Transactional Memory Sanjeev Kumar, Michael Chu, Christopher Hughes, Partha Kundu, Anthony Nguyen, Intel Labs University of Michigan Intel Labs.

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.

Application-Specific Signatures for Transactional Memory in Soft Processors Martin Labrecque Mark Jeffrey Gregory Steffan ECE Dept. University of Toronto.

Technology from seed Exploiting Off-the-Shelf Virtual Memory Mechanisms to Boost Software Transactional Memory Amin Mohtasham, Paulo Ferreira and João.

1 The potential for Software-only thread- level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members.

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

Lecture 20: Consistency Models, TM

Gwangsun Kim, Jiyun Jeong, John Kim

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Software Coherence Management on Non-Coherent-Cache Multicores

Xiaodong Wang, Shuang Chen, Jeff Setter,

PHyTM: Persistent Hybrid Transactional Memory

Atomic Operations in Hardware

Lecture 19: Transactional Memories III

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Man Cao Minjia Zhang Aritra Sengupta Michael D. Bond

Hardware Multithreading

Lecture 6: Transactions

Lecture 21: Transactional Memory

Efficient software checkpointing framework for speculative techniques

Lecture 22: Consistency Models, TM

Hybrid Transactional Memory

LogTM-SE: Decoupling Hardware Transactional Memory from Caches

Hardware Multithreading

Lecture 23: Transactional Memory

Lecture 21: Transactional Memory

Lecture: Transactional Memory

Dynamic Performance Tuning of Word-Based Software Transactional Memory

Fast Accesses to Big Data in Memory and Storage Systems

Presentation transcript:

Mihai Burcea, J. Gregory Steffan, Cristiana Amza The Potential for Variable-Granularity Access Tracking for Optimistic Parallelism Mihai Burcea, J. Gregory Steffan, Cristiana Amza University of Toronto MSPC 2008

Getting the Most Out of Your CPUs AMD Barcelona quad-core Ubiquitous CMPs How do we exploit all this parallelism? How do we improve sequential applications? Intel Kentsfield quad-core

Optimistic Parallelism Flavors: Transactional Memory (TM) Thread-Level Speculation (TLS) Implementations: hardware, software, hybrid Common required support: Buffering speculative memory changes Tracking and detecting memory access conflicts

Traditional Access Tracking Most approaches use some fixed granularity Hardware TM/TLS: cache-line size Typically 32/64/128 bytes Software TLS: word-, object-level Software TM: word/page/object granularity Hybrid TM: mixture of above (in HW/SW) Is Fixed Granularity the best approach ?

Can We Reduce The Overhead of Dependence Tracking ? Too much overhead Too many false conflicts Fine Granularity Coarse Key Intuition: “best” granularity likely varies within and across benchmarks

False Conflicts when Using Uniform Coarse Granularity Measured in a TLS simulator; 32/64/128 = cache line sizes (bytes) Uniform coarse grain approach suffers false conflicts

Is there potential for a variable-granularity approach?

Goals Of Our Work Show potential for Variable-Granularity Access Tracking (VGAT) Finest grain too expensive; which coarse grain? Show that ideal granularity varies across and within applications Suggests need for dynamic, adaptive scheme Show significant reduction in number of tracked memory ranges when using VGAT

Most systems use fixed or object grain - but not necessarily the best Related Work Hardware TLS / TM: track accesses at cache-line size (32/64/128 bytes) Stampede (Steffan et. al., ACM Trans. 2005), Speculative Versioning Cache (Vijaykumar et. al., HPCA 1998) Unbounded TM (Ananian et. al., HPCA 2005), LogTM (Moore et. al., HPCA 2006) Software TLS: Word (Cintra et. al., PPoPP 2003) Object (Pickett et. al., LCPC 2005) Software TM: Word (McRT-STM – Saha et. al., PPoPP 2006) Page (Manassiev et. al., PPoPP 2006) Object: RSTM (Marathe et. al., PLDI 2006), DSTM (Herlihy et. al., PODC 2003) Most systems use fixed or object grain - but not necessarily the best

Related Work – Bulk Disambiguation Ceze et. al., ISCA 2006 Encode read/write sets into signatures Detect conflicts by performing operations on signatures (fast) Design of hashing (encoding) addresses into signatures includes false positives Reduce conflict-detection traffic, but increase false conflicts Our goal: minimize false conflicts

Variable Granularity Access Tracking Approaches: vary granularity across Time: parts of apps. (speculative code regions) Space: ranges of memory Can potentially reduce: Tracking storage Tracking traffic Commit latency False conflicts

Impact On Conflicts Of Increasing Granularity Granularity (bytes) Number of conflicts 4 100 8 16 103 32 120 True (actual) conflicts  Same nr. of conflicts, still ok Extra (false) conflicts! Coarsest granularity that incurs no false conflicts: Ideal Granularity

Measuring the Potential for VGAT

Experimental Framework TLS simulator (CMU) Subset of SpecINT2000 benchmarks Instrumented for TLS TLS regions mostly loop-based TLS regions pre-selected based on 32-byte reading and 4-byte writing granularity Focus on specific aspects: Simulate first billion instructions Track only Read-After-Write dependences Speculative code regions pre-selected for 32 bytes -> our results are conservative!

Variable Granularity at Code Region Level Memory accessed by Region 1 fork Speculative Code Region 1 join Granularity 4 bytes Memory accessed by Region 2 fork Speculative Code Region 2 join Granularity 32 bytes Memory accessed by Region 3 fork Speculative Code Region 3 join Granularity 8 bytes 4 bytes 8 bytes 32 bytes

Ideal Granularity at Code Region Level page-level (4 k) cache-line level word-level Code regions with no conflicts not shown in figure (in parentheses) Ideal Granularity varies significantly between code regions

Variable Granularity Across Memory Ranges fork Memory accessed by Region 1 Speculative Code Region 1 join Memory accessed by Region 2 fork Speculative Code Region 2 join Memory accessed by Region 3 fork Speculative Code Region 3 join 4 bytes 8 bytes 32 bytes

Ideal Granularity Across Memory Ranges Cache-line size sometimes good, sometimes not Word-level rarely necessary Page-level often sufficient Ideal Granularity varies widely across memory ranges

Can VGAT improve performance?

Reducing the Number of Tracked Elements by using Variable Granularity 458 61 31 50 51 35 9 5 3 Gmean: 2.88/5.17/9.15/35.03 VGAT can reduce the # of tracked elements more than 3x!

Ongoing Work Should memory-centric or code-centric accesses determine granularity ? Dynamic, adaptive system for deciding granularity based on iterative sampling How best to use and store profile information May tolerate some percentage of false conflicts Hardware TLS Reduce conflict-detection traffic, possibly power Software TM (lock-based) Reduce number of locks – save space and time Reduce lock contention

Conclusions (for Stampede TLS) TM/TLS systems with only fixed coarse granularity may suffer many false conflicts 2x – 4x on average Variable granularity can reduce false conflicts and tracking overhead 3x – 35x reduction in tracked ranges Ideal granularity varies widely across memory ranges and speculative code regions

Thank you! Questions ?