Effective Fine-Grain Synchronization For Automatically Parallelized Programs Using Optimistic Synchronization Primitives Martin Rinard University of California,

Slides:



Advertisements
Similar presentations
Symmetric Multiprocessors: Synchronization and Sequential Consistency.
Advertisements

Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.
Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
CS510 – Advanced Operating Systems 1 The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton.
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Multiple Processor Systems
Chapter 6 Process Synchronization: Part 2. Problems with Semaphores Correct use of semaphore operations may not be easy: –Suppose semaphore variable called.
Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California,
Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers Martin C. Rinard Pedro C. Diniz University of California, Santa Barbara Santa.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers Martin C. Rinard Pedro C. Diniz
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Synchronization CSCI 444/544 Operating Systems Fall 2008.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
Concurrency, Mutual Exclusion and Synchronization.
Thread-Level Speculation Karan Singh CS
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,
DOUBLE INSTANCE LOCKING A concurrency pattern with Lock-Free read operations Pedro Ramalhete Andreia Correia November 2013.
Java Thread and Memory Model
CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Barriers and Condition Variables
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 6: Process Synchronization.
Concurrency Idea. 2 Concurrency idea Challenge –Print primes from 1 to Given –Ten-processor multiprocessor –One thread per processor Goal –Get ten-fold.
Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers Martin C. Rinard University of California, Santa Barbara.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Multiprocessors – Locks
By Michael Greenwald and David Cheriton Presented by Jonathan Walpole
Distributed Shared Memory
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Lecture 19: Coherence and Synchronization
Lecture 5: Synchronization
Atomic Operations in Hardware
Atomic Operations in Hardware
Lecture 18: Coherence and Synchronization
Martin Rinard Laboratory for Computer Science
Designing Parallel Algorithms (Synchronization)
Shared Memory Systems Miodrag Bolic.
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
Kernel Synchronization II
Lecture 2 The Art of Concurrency
Lecture 21: Synchronization & Consistency
Lecture: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
Presentation transcript:

Effective Fine-Grain Synchronization For Automatically Parallelized Programs Using Optimistic Synchronization Primitives Martin Rinard University of California, Santa Barbara

Problem Efficiently Implementing Atomic Operations On Objects Key Issue Mutual Exclusion Locks Versus Optimistic Synchronization Primitives Context Parallelizing Compiler For Irregular Object-Based Programs Linked Data Structures Commutativity Analysis

Talk Outline Histogram Example Advantages and Limitations of Optimistic Synchronization Synchronization Selection Algorithm Experimental Results

Histogram Example class histogram { private: int counts[N]; public:void update(int i) { counts[i]++; } }; parallel for (i = 0; i < iterations; i++) { int c = f(i); h->update(c); }

Cloud Of Parallel Histogram Updates iteration 8 iteration 7 iteration 6 iteration 5 iteration iteration 0 iteration 1 iteration 3 iteration 4 Updates Must Execute Atomically Histogram

One Lock Per Object class histogram { private: int counts[N]; lock mutex; public:void update(int i) { mutex.acquire(); counts[i]++; mutex.release(); } }; Problem: False Exclusion

One Lock Per Item class histogram { private: int counts[N]; lock mutex[N]; public:void update(int i) { mutex[i].acquire(); counts[i]++; mutex[i].release(); } }; Problem: Memory Consumption

Optimistic Synchronization Histogram Load Old Value Compute New Value Into Local Storage Commit Fails Retry Update Commit Succeeds Write New Value Commit Point No Write Between Load and Commit Write Between Load and Commit

Parallel Updates With Optimistic Synchronization Commit Succeeds Write New Value Load Old Value Compute New Value Into Local Storage Load Old Value Compute New Value Into Local Storage Commit Fails Retry Update

Optimistic Synchronization In Modern Processors Load Linked (LL) - Used To Load Old Value Store Conditional (SC) - Used To Commit New Value Atomic Increment Using Optimistic Synchronization Primitives retry:LL$2,0($4)# Load Old Value addiu$3,$2,1# Compute New Value Into # Local Storage SC$3,0($4)# Attempt To Store New Value beq$3,0,retry# Retry If Failure

Optimistically Synchronized Histogram class histogram { private: int counts[N]; public:void update(int i) { do { new_count = LL(counts[i]); new_count++ } while (!SC(new_count, counts[i])); } };

Aspects of Optimistic Synchronization Advantages Slightly More Efficient Than Locked Updates No Memory Overhead No Data Cache Overhead Potentially Fewer Memory Consistency Requirements Advantages In Other Contexts No Deadlock, No Priority Inversions, No Lock Convoys Limitations Existing Primitives Support Only Single Word Updates Each Update Must Be Synchronized Individually Lack of Fairness

Synchronization In Automatically Parallelized Programs Serial Program Unsynchronized Parallel Program Synchronized Parallel Program CommutativityAnalysis Synchronization Selection Assumption: Operations Execute Atomically Requirement: Correctly Synchronize Atomic Operations Goal: Choose An Efficient Synchronization Mechanism for Each Operation

Atomicity Issues In Generated Code Serial Program Synchronized Parallel Program CommutativityAnalysis Synchronization Selection Unsynchronized Parallel Program Assumption: Operations Execute Atomically Requirement: Correctly Synchronize Atomic Operations Goal: Choose An Efficient Synchronization Mechanism For Each Operation

Use Optimistic Synchronization Whenever Possible

Model Of Computation Objects With Instance Variables class histogram { private: int counts[N]; }; Operations Update Objects By Modifying Instance Variables void histogram::update(int i) { counts[i]++; } h->update(1)

Commutativity Analysis Compiler Computes Extent Of Computation Representation of All Operations in Computation In Example: { histogram::update } Do All Pairs Of Operations Commute? No - Generate Serial Code Yes - Automatically Generate Parallel Code In Example: h->update(i) and h->update(j) commute for all i, j

Synchronization Requirements Traditional Parallelizing Compilers Parallelize Loops With Independent Iterations Barrier Synchronization Commutativity Analysis Parallel Operations May Update Same Object For Generated Code To Execute Correctly, Operations Must Execute Atomically Code Generation Algorithm Must Insert Synchronization

Default Synchronization Algorithm class histogram { private: int counts[N]; lock mutex; One Lock Per Object public:void update(int i) { mutex.acquire(); counts[i]++; mutex.release(); } }; Operations Acquire and Release Lock

Synchronization Constraints Operation counts[i] = counts[i]+1; aaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaa temp = counts[i]; counts[i] = counts[j]; counts[j] = temp; Synchronization Constraint Can Use Optimistic Synchronization - Read/Compute/Write Update To A Single Instance Variable Must Use Lock Synchronization - Updates Involve Multiple Interdependent Instance Variables

Synchronization Selection Constraints Can Use Optimistic Synchronization Only For Single Word Updates That All Updates To Same Instance Variable Must Use Same Synchronization Mechanism Read An Instance Variable Compute A New Value That Depends On No Other Updated Instance Variable Write New Value Back Into Instance Variable

Synchronization Selection Algorithm Operates At Granularity Of Instance Variables Compiler Scans All Updates To Each Instance Variable If A Class Has A Lock Synchronized Variable, Class is Marked Lock Synchronized If All Updates Can Use Optimistic Synchronization, Instance Variable Is Marked Optimistically Synchronized If At Least One Update Must Use Lock Synchronization, Instance Variable Is Marked Lock Synchronized

Synchronization Selection In Example class histogram { private: int counts[N]; public:void update(int i) { counts[i]++; } }; Optimistically Synchronized Instance Variable histogram NOT Marked As Lock Synchronized Class

Code Generation Algorithm All Lock Synchronized Classes Augmented With Locks Operations That Update Lock Synchronized Variables Acquire and Release the Lock in the Object Operations That Update Optimistically Synchronized Variables Use Optimistic Synchronization Primitives

Optimistically Synchronized Histogram class histogram { private: int counts[N]; public:void update(int i) { do { new_count = LL(counts[i]); new_count++ } while (!SC(new_count, counts[i])); } };

Experimental Results

Methodology Implemented Parallelizing Compiler Implemented Synchronization Selection Algorithm Parallelized Three Complete Scientific Applications Barnes-Hut, String, Water Produced Four Versions Optimistic (All Updates Optimistically Synchronized) Item Lock (Produced By Hand) Object Lock Coarse Lock Used Inline Intrinsic Locks With Exponential Backoff Measured Performance On SGI Challenge XL

Time For One Update Time for One Cached Update On Challenge XL Time for One Uncached Update On Challenge XL Update Time (microseconds) Locked Optimistic Unsynchronized Update Time (microseconds) Locked Optimistic Unsynchronized Data And Lock On Different Cache Lines

Synchronization Frequency Barnes-Hut String Water Coarse Lock Object Lock Optimistic, Item Lock Object Lock Optimistic, Item Lock Coarse Lock Object Lock Optimistic, Item Lock Microseconds Per Synchronization

Memory Consumption For Barnes-Hut OptimisticItem LockObject LockCoarse Lock Memory Consumption (MBytes) Total Memory Used To Store Objects

Memory Consumption For String Total Memory Used To Store Objects OptimisticItem LockObject Lock Memory Consumption (MBytes)

Memory Consumption For Water Total Memory Used To Store Objects OptimisticItem LockObject LockCoarse Lock Memory Consumption (MBytes)

Speedup Processors Processors Processors Processors OptimisticItem LockObject LockCoarse Lock Speedups For Barnes-Hut

Speedups For String Processors Processors Speedup Processors OptimisticItem LockObject Lock

Speedups For Water Processors Processors Processors Speedup Processors OptimisticItem LockObject LockCoarse Lock

Acknowledgements Pedro Diniz Parallelizing Compiler Silicon Graphics Challenge XL Multiprocessor Rohit Chandra, T.K. Lakshman, Robert Kennedy, Alex Poulos Technical Assistance With SGI Hardware and Software

Bottom Line Optimistic Synchronization Offers No Memory Overhead No Data Cache Overhead Reasonably Small Execution Time Overhead Good Performance On All Applications Good Choice For Parallelizing Compiler Minimal Impact On Parallel Program Simple, Robust, Works Well In Range Of Situations Major Drawback Current Primitives Support Only Single Word Updates Use Optimistic Synchronization Whenever Applicable

Future The Efficient Implementation Of Atomic Operations On Objects Will Become A Crucial Issue For Mainstream Software Small-Scale Shared-Memory Multiprocessors Multithreaded Applications and Libraries Popularity of Object-Oriented Programming Specific Example: Java Standard Library Optimistic Synchronization Primitives Will Play An Important Role