Aritra Sengupta, Man Cao, Michael D. Bond and Milind Kulkarni PPPJ 2015, Melbourne, Florida, USA Toward Efficient Strong Memory Model Support for the Java.

Slides:

Advertisements

Similar presentations

Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.

Advertisements

TRAMP Workshop Some Challenges Facing Transactional Memory Craig Zilles and Lee Baugh University of Illinois at Urbana-Champaign.

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Pallavi Joshi  Chang-Seo Park  Koushik Sen  Mayur Naik ‡  Par Lab, EECS, UC Berkeley‡

Michael Bond (Ohio State) Milind Kulkarni (Purdue)

Michael Bond Milind Kulkarni Man Cao Minjia Zhang Meisam Fathi Salmi Swarnendu Biswas Aritra Sengupta Jipeng Huang Ohio State Purdue.

Memory Models (1) Xinyu Feng University of Science and Technology of China.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

“FENDER” AUTOMATIC MEMORY FENCE INFERENCE Presented by Michael Kuperstein, Technion Joint work with Martin Vechev and Eran Yahav, IBM Research 1.

Goldilocks: Efficiently Computing the Happens-Before Relation Using Locksets Tayfun Elmas 1, Shaz Qadeer 2, Serdar Tasiran 1 1 Koç University, İstanbul,

Verification of Multithreaded Object- Oriented Programs with Invariants Bart Jacobs, K. Rustan M. Leino, Wolfram Schulte.

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Koushik Sen CS 265.

Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan RamanJisheng ZhaoVivek Sarkar Rice University June 13, 2012 Martin.

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

DRF x A Simple and Efficient Memory Model for Concurrent Programming Languages Dan Marino Abhay Singh Todd Millstein Madan Musuvathi Satish Narayanasamy.

Iterative Context Bounding for Systematic Testing of Multithreaded Programs Madan Musuvathi Shaz Qadeer Microsoft Research.

Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California,

ISBN Chapter 3 Describing Syntax and Semantics.

SOS: Saving Time in Dynamic Race Detection with Stationary Analysis Du Li, Witawas Srisa-an, Matthew B. Dwyer.

Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.

Atomicity in Multi-Threaded Programs Prachi Tiwari University of California, Santa Cruz CMPS 203 Programming Languages, Fall 2004.

ADVERSARIAL MEMORY FOR DETECTING DESTRUCTIVE RACES Cormac Flanagan & Stephen Freund UC Santa Cruz Williams College PLDI 2010 Slides by Michelle Goodstein.

Structure-driven Optimizations for Amorphous Data-parallel Programs 1 Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan.

Formalisms and Verification for Transactional Memories Vasu Singh EPFL Switzerland.

Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.

Describing Syntax and Semantics

Cormac Flanagan UC Santa Cruz Velodrome: A Sound and Complete Dynamic Atomicity Checker for Multithreaded Programs Jaeheon Yi UC Santa Cruz Stephen Freund.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

1 Scalable and transparent parallelization of multiplayer games Bogdan Simion MASc thesis Department of Electrical and Computer Engineering.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.

- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.

LOOM: Bypassing Races in Live Applications with Execution Filters Jingyue Wu, Heming Cui, Junfeng Yang Columbia University 1.

School of Information Technologies Michael Cahill 1, Uwe Röhm and Alan Fekete School of IT, University of Sydney {mjc, roehm, Serializable.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

1 Evaluating the Impact of Thread Escape Analysis on Memory Consistency Optimizations Chi-Leung Wong, Zehra Sura, Xing Fang, Kyungwoo Lee, Samuel P. Midkiff,

Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories UIUC.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

DoubleChecker: Efficient Sound and Precise Atomicity Checking Swarnendu Biswas, Jipeng Huang, Aritra Sengupta, and Michael D. Bond The Ohio State University.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.

Aritra Sengupta, Swarnendu Biswas, Minjia Zhang, Michael D. Bond and Milind Kulkarni ASPLOS 2015, ISTANBUL, TURKEY Hybrid Static-Dynamic Analysis for Statically.

Efficient Deterministic Replay of Multithreaded Executions in a Managed Language Virtual Machine Michael Bond Milind Kulkarni Man Cao Meisam Fathi Salmi.

Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.

Drinking from Both Glasses: Adaptively Combining Pessimistic and Optimistic Synchronization for Efficient Parallel Runtime Support Man Cao Minjia Zhang.

Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.

…and region serializability for all JESSICA OUYANG, PETER CHEN, JASON FLINN & SATISH NARAYANASAMY UNIVERSITY OF MICHIGAN.

A Safety-First Approach to Memory Models Madan Musuvathi Microsoft Research ISMM ‘13 Keynote 1.

Detecting Atomicity Violations via Access Interleaving Invariants

Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.

Eraser: A dynamic Data Race Detector for Multithreaded Programs Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, Thomas Anderson Presenter:

Testing Concurrent Programs Sri Teja Basava Arpit Sud CSCI 5535: Fundamentals of Programming Languages University of Colorado at Boulder Spring 2010.

Using Escape Analysis in Dynamic Data Race Detection Emma Harrington `15 Williams College

FastTrack: Efficient and Precise Dynamic Race Detection [FlFr09] Cormac Flanagan and Stephen N. Freund GNU OS Lab. 23-Jun-16 Ok-kyoon Ha.

Prescient Memory: Exposing Weak Memory Model Behavior by Looking into the Future MAN CAO JAKE ROEMER ARITRA SENGUPTA MICHAEL D. BOND 1.

Optimistic Hybrid Analysis

Mihai Burcea, J. Gregory Steffan, Cristiana Amza

Lightweight Data Race Detection for Production Runs

An Operational Approach to Relaxed Memory Models

Aritra Sengupta Man Cao Michael D. Bond and Milind Kulkarni

Threads Cannot Be Implemented As a Library

Effective Data-Race Detection for the Kernel

Specifying Multithreaded Java semantics for Program Verification

Threads and Memory Models Hal Perkins Autumn 2011

Man Cao Minjia Zhang Aritra Sengupta Michael D. Bond

Jipeng Huang, Michael D. Bond Ohio State University

Threads and Memory Models Hal Perkins Autumn 2009

Parallelism and Concurrency

Design and Implementation Issues for Atomicity

Xinyu Feng University of Science and Technology of China

Rethinking Support for Region Conflict Exceptions

Presentation transcript:

Aritra Sengupta, Man Cao, Michael D. Bond and Milind Kulkarni PPPJ 2015, Melbourne, Florida, USA Toward Efficient Strong Memory Model Support for the Java Platform via Hybrid Synchronization 1

Programming Language Semantics? Data Races Java provides weak semantics 2

Weak Semantics T1 T2 a = new A(); init = true; if (init) a.field++; A a = null; boolean init = false; 3

Weak Semantics T1 T2 a = new A(); init = true; if (init) a.field++; No data depende nce 4

Weak Semantics T1 T2 a = new A(); init = true; if (init) a.field++; A a = null; boolean init = false; 5

Weak Semantics T1T2 init = true; a = new A(); if (init) a.field++; 6

Weak Semantics T1T2 init = true; a = new A(); if (init) a.field++; Null derefere nce 7

Java Memory Model JMM (Manson et al., POPL, 2005) variant of DRF0 (Adve and Hill, ISCA, 1990) Atomicity of synchronization-free regions for data-race-free programs Data races: weak semantics 8

Need for Stronger Memory Models “The inability to define reasonable semantics for programs with data races is not just a theoretical shortcoming, but a fundamental hole in the foundation of our languages and systems…” Give better semantics to programs with data races Stronger memory models – Adve and Boehm, CACM,

Memory Models: Run-time cost vs Strength Run-time cost Strength 1. Ouyang et al.... and region serializability for all. In HotPar, DRF0 SC Synchronization- free region serializability 1 10 Statically Bounded Region Serializability

Memory Models: Run-time cost vs Strength Run-time cost Strength 2. Sengupta et al. Hybrid Static-Dynamic Analysis for Statically Bounded Region Serializability. ASPLOS, DRF0 SC Synchronization- free region serializability 11 Statically Bounded Region Serializability 2

Statically Bounded Region Serializability (SBRS) Compiler demarcated regions execute atomically Execution is an interleaving of these regions – Sengupta et al., ASPLOS,

Statically Bounded Region Serializability (SBRS) rel(lock) acq(lock) methodCall() Synchroniz ation operations Method calls Loop backedges 13

Statically Bounded Region Serializability (SBRS) rel(lock) acq(lock) methodCall() 14

Statically Bounded Region Serializability (SBRS) Statically and dynamicall y bounded Loop backedge s 15

Overview 16 Enforcement of SBRS with dynamic locks Precise dynamic locks: EnfoRSer-D (our prior work), low contention Enforcement of SBRS with static locks Imprecise static locks: EnfoRSer-S, low instrumentation overhead Hybridization of locks EnfoRSer-H: static and dynamic locks, right locks for right sites? Results EnfoRSer-H does at least as well as either. Some cases significant benefit

Enforcement of SBRS Prevent two concurrent accesses to the same memory location where one is a write 17

Enforcement of SBRS Prevent two concurrent accesses to the same memory location where one is a write 18 Prevent regions that have races from running concurrently!

Enforcement of SBRS 19 Acquire locks before each memory access Acquire locks at the start of the region

Enforcement of SBRS 20 Acquire locks before each memory access Acquire locks at the start of the region Precise object locks: dynamic locks

Enforcement of SBRS 21 Acquire locks before each memory access Acquire locks at the start of the region Region level locks: statically chosen for an access site

22 EnfoRSer-D Precise dynamic locks Per-access locks with retry mechanism Compiler Transformations: Speculative execution

SBRS with Dynamic Locks 23 Y = = X Z = Dynamic per- access locks

SBRS with Dynamic Locks Y = = X Z = Program access 24

SBRS with Dynamic Locks Y = = X Z = Ownership transferred 25

SBRS with Dynamic Locks Y = = X Z = Ownership transferred 26 Dynamic locks: precise location, hence no reasoning about races statically!

SBRS with Dynamic Locks Y = = X Z = Ownership transferred 27 Precise conflict detection- efficient for high-conflicting regions High instrumentation cost at each access High overhead for low- conflicting programs

Experimental Methodology Benchmarks DaCapo 2006, 9.12-bach Fixed-workload versions of SPECjbb2000 and SPECjbb2005 Platform Intel Xeon system: 32 cores 28

Implementation and Evaluation Developed in Jikes RVM Code publicly available on Jikes RVM Research Archive 29

Run-time Performance 30

Run-time Performance 27% overhead on average 31

Run-time Performance Can we do better? 32 Precise conflict detection but high instrumentation overhead! Complex compiler transformations (additional code)

EnfoRSer with Static Locks 33 Reduce the instrumentation overhead of EnfoRSer-D Less complex code generation

34 EnfoRSer-S Static region level locks Racing sites acquire same lock Coarsened locks to reduce instrumentation overhead

SBRS with Locks on Static Sites 35 Y = = X Z = Static Region Locks L0 L1 L2 Static locks: racy sites protected by same lock

SBRS with Locks on Static Sites 36 Y = = X Z = All locks acquired before access L0 L1 L2

SBRS with Locks on Static Sites 37 Y = = X Z = L0 L1 L2 Ownership transferred

SBRS with Locks on Static Sites 38 Y = = X Z = L0 L1 L2 Ownership transferred Imprecise and does not lower instrumenta tion overhead!

SBRS with Locks on Static Sites 39 Y = = X Z = Coarsened single lock L012

SBRS with Locks on Static Sites 40 Y = = X Z = Coarsened single lock L012 Low instrumentation overhead Imprecise conflict detection

EnfoRSer-S 41 s0 s6 s5 s3 s2 R1 R2 R3 R4 s7 s1s4 s8

EnfoRSer-S 42 s0 s6 s5 s3 s2 R1 R2 R3 R4 s7 s1s4 s8

EnfoRSer-S 43 s0 s6 s5 s3 s2 R1 R2 R3 R4 s7 s1s4 s8 RACE

EnfoRSer-S 44 s0 s6 s5 s3 s2 R1 R2 R3 R4 s7 s1s4 s8 RACE Naik et al.’s 2006 race detection algorithm, Chord

EnfoRSer-S 45 s0 s6 s5 s3 s2 R1 R2 R3 R4 s7 s1s4 s8 L0 L1 L2 L3 L4

EnfoRSer-S 46 R1 R2 R3 R4 L0 L1 L3 s0 s6 s3 s7 s1s4 s8 s5 s2

EnfoRSer-S 47 R1 R2 R3 R4 L0 L1 L3 s0 s6 s5 s3 s2 s7 s1s4 s8 L1 L2 L4 L2

EnfoRSer-S 48 R1 R2 R3 R4 L0 L1 L3 s0 s6 s5 s3 s2 s7 s1s4 s8 L1 L2 L4 L2 Does not reduce instrumentation overhead!

EnfoRSer-S 49 s0 s6 s5 s3 s2 R1 R2 R3 R4 s7 s1s4 s8

EnfoRSer-S 50 s0 s6 s5 s3 s2 R1 R2 R3 R4 s7 s1s4 s8

EnfoRSer-S 51 s0 s6 s5 s3 s2 R1 R2 R3 R4 s7 s1s4 s8 L0 L1

EnfoRSer-S 52 s0 s6 s5 s3 s2 R1 R2 R3 R4 s7 s1s4 s8 L0 L1

EnfoRSer-S 53 s0 s6 s5 s3 s2 R1 R2 R3 R4 s7 s1s4 s8 L0 L1 Reduces instrumentation overhead Increases contention

Run-time Performance 54

Run-time Performance 2600% overhead on average 55

Run-time Performance % overhead on average EnfoRSer-S performs better

Hybridizing Locks High contention sites: Precise dynamic locks (precise conflict detection) Low contention sites: Single static lock (low instrumentation overhead) 57

58 EnfoRSer-H Static locks to reduce instrumentation Dynamic locks for precise conflict detection Correctly and efficiently combine: best of both

Run-time Performance 26% overhead on average 59

Run-time Performance 60 Provides nearly the best of either approaches! 63% reduction in overhead

Run-time Performance 87% reduction in overhead 49% reduction in lock acquires and not increasing contention! 61

Run-time Performance 87% reduction in overhead 49% reduction in lock acquires 62 Dramatically improves on both when hybridization helps!

Hybridizing Locks Right synchronization for right program sites Combining different synchronization mechanisms Best of different synchronization mechanisms 63

EnfoRSer-H 64 Right locks for right sites Cost- benefit model Assignment algorithm Combine static and dynamic locks Profiling

65 Program exectuion Second iterationUse hybridized instrumentation Assignment Algorithm Between iterationsCompute SRSL and estimate cost Program execution First iterationProfiling (use RACE relation) Two-iteration Methodology

Assignment Algorithm 66

67 Compute initial cost Revert to previous state Retain state Change a static lock to dynamic lock Change all its racing sites to dynamic locks Recompute cost current cost < previous cost No Yes Profiled data

Assignment Algorithm 68 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 s9 s1

Assignment Algorithm 69 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 1.Conflicts on each site 2.Lock acquires on each site s9 s1

Assignment Algorithm 70 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 s9 s1

Assignment Algorithm 71 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 s9 s1

Assignment Algorithm 72 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 If (current cost < previous cost) // retain state s9 s1

Assignment Algorithm 73 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 Switch on next site s9 s1

Assignment Algorithm 74 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 s9 s1

Assignment Algorithm 75 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 If (current cost < previous cost) // retain state s9 s1

Assignment Algorithm 76 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 Switch on next site s9 s1

Assignment Algorithm 77 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 current cost > previous cost // revert state s9 s1

Assignment Algorithm 78 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 s9 s1

Assignment Algorithm 79 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 s1, s9 s9 s1

Assignment Algorithm 80 s0 s7 s6 s5 s4 s3 s2 s8 R1 R2 R3 R4 s9 s1 Final State!

Lock Assignment 81 R1 R2 R3 R4 L1L1 s0 s6 s3 s9 s1 s7 s4 s2 s8 s5

Lock Assignment 82 R1 R2 R3 R4 L1L1 L1L1 L1L1 L1L1 s0 s6 s3 s9 s1 s7 s4 s2 s8 s5

Lock Assignment 83 R1 R2 R3 R4 L1L1 L1L1 L1L1 L1L1 s0 s7 s6 s5 s4 s3 s2 s8 s9 s1

Lock Assignment 84 R1 R2 R3 R4 L1L1 L1L1 L1L1 s0 s7 s6 s5 s4 s3 s2 s8 s9 s1

Lock Assignment 85 R1 R2 R3 R4 L1L1 L1L1 L1L1 s0 s7 s6 s5 s4 s3 s2 s8 Per-object locks and transformat ions s9 s1

Related Work Use of static locks Chimera, Lee et al., PLDI 2012 Use of static analysis Static Conflict Analysis for Multi-Threaded Object- Oriented Programs, Von Praun and Gross, PLDI, Goldilocks, Elmas et al., PLDI Red Card, Flanagan and Freund, ECOOP 2013 Hybridizing locks Hybrid Tracking, Cao et al., WODET

Conclusion 87 Combine synchronization mechanisms Best of different kinds of synchronization Efficient enforcement of SBRS Static locks Dynamic locks Precise conflict detection Low instrumentation overhead Select the best for a region