Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Multiple Processor Systems

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Threads Irfan Khan Myo Thein What Are Threads ? a light, fine, string like length of material made up of two or more fibers or strands of spun cotton,

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Science Lecture 6, page 1 CS677: Distributed OS Processes and Threads Processes and their scheduling Multiprocessor scheduling Threads Distributed.

User Level Interprocess Communication for Shared Memory Multiprocessor by Bershad, B.N. Anderson, A.E., Lazowska, E.D., and Levy, H.M.

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

3.5 Interprocess Communication

Chapter 17 Parallel Processing.

Chapter 3: Processes. Process Concept Process Scheduling Operations on Processes Cooperating Processes Interprocess Communication Communication in Client-Server.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.

Informationsteknologi Tuesday, October 9, 2007Computer Systems/Operating Systems - Class 141 Today’s class Scheduling.

1 Chapter 4 Threads Threads: Resource ownership and execution.

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

Threads Chapter 4. Modern Process & Thread –Process is an infrastructure in which execution takes place  (address space + resources) –Thread is a program.

SUPPORTING LOCK-FREE COMPOSITION OF CONCURRENT DATA OBJECTS Daniel Cederman and Philippas Tsigas.

Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Thread. A basic unit of CPU utilization. It comprises a thread ID, a program counter, a register set, and a stack. It is a single sequential flow of control.

Process Management. Processes Process Concept Process Scheduling Operations on Processes Interprocess Communication Examples of IPC Systems Communication.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

May/01/2000HIPS Online Computation of Critical Paths for Multithreaded Languages Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa University of Tokyo.

Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.

Multithreading in Java Project of COCS 513 By Wei Li December, 2000.

Executing Parallel Programs with Potential Bottlenecks Efficiently University of Tokyo Yoshihiro Oyama Kenjiro Taura (visiting UCSD) Akinori Yonezawa.

Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

1 Multiprocessor Scheduling Module 3.1 For a good summary on multiprocessor and real-time scheduling, visit:

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.

Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.

Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.

Sep/05/2001PaCT Fusion of Concurrent Invocations of Exclusive Methods Yoshihiro Oyama (Japan Science and Technology Corporation, working in University.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Department of Computer Science and Software Engineering

Distributed (Operating) Systems -Processes and Threads-

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Distributed Mutual Exclusion Synchronization in Distributed Systems Synchronization in distributed systems are often more difficult compared to synchronization.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Classification of parallel computers Limitations of parallel processing.

An Efficient Compilation Framework for Languages Based on a Concurrent Process Calculus Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa Yonezawa Laboratory.

Copyright ©: Nahrstedt, Angrave, Abdelzaher

PROCESS MANAGEMENT IN MACH

Sujata Ray Dey Maheshtala College Computer Science Department

Processes and Threads Processes and their scheduling

The University of Adelaide, School of Computer Science

Computer Engg, IIT(BHU)

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Background and Motivation

Sujata Ray Dey Maheshtala College Computer Science Department

The University of Adelaide, School of Computer Science

CSE 153 Design of Operating Systems Winter 19

Lecture: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau, University of Tokyo

Bottlenecks bottleneck object (e.g.,shared counter) …….. The execution time here is very large Research context: Implementing a concurrent OO language on SMP or DSM machines concurrent invocations e.g., synchronized methods in Java exclusive method The methods are serialized Update! exclusive method exclusive method exclusive method

Speedup Curves for Programs with Bottlenecks processors time ideal reality Good compilers should give this curve!!! We may execute a program on too many processors (because it is not always easy to predict dynamic behavior).

Goal other parts Naïve Implementation time Ideal Implementation 1PE 50PE bottleneck parts other parts Making the whole execution time on multiprocessors the time to sequentially execute bottlenecks only close to bottleneck parts 50PE bottleneck parts other parts

Experiment using Counter Program in C Solaris threads & Ultra Enterprise Each processor increments a shared counter in parallel

Implementation with Spinlocks object data method Advantage: No need to move “computation” among processors Disadvantage: Frequent cache misses in reading a bottleneck object (because of cache invalidation by other processors) bottleneck object method Each processor executes methods by itself

non-owners Implementation with Simple Blocking Locks bottleneck object a queue of “contexts” owner object data Advantage: Disadvantage: Few cache misses in reading a bottleneck object Overheads to move “computation” Owner dequeues contexts one by one with mutex operations enqueue dequeue

Overview of Our Scheme _ Improvement of simple blocking locks –Overheads in simple blocking locks ` Mutex operations for a queue of contexts ` Waiting time imposed on an owner for mutex ` Cache misses in reading contexts –Solution ` Detaching a whole list of contexts from an object ` Giving higher priority to an owner ` Prefetching context data

Y BCD Our Scheme (Inserting a Context) bottleneck object A When a non-owner invokes a method X a list of contexts Y Z non-ownersowner BCD bottleneck object A X Z context inserted

Our Scheme (Detaching Contexts) When an owner executes methods YBCD bottleneck object A X Z list detached!!! YBCD bottleneck object A XZ Many mutex operations by owner are eliminated contexts inserted contexts are executed in turn without mutex ops for the list

Our Scheme (Low-Level Implementation) Owner no longer has the overhead of waiting time for mutex bottleneck object non-owners (with low priority) owner (with high priority) updating the area with swap updating the area with compare-and-swap one word area Detachment: always succeeds in constant time Insertion: may fail many times Why one word? Why list, not queue? To make our algorithm lock-free and non-blocking

Compile-time Optimizations _ Prefetching context data _ Assigning object data to registers While this context is executed, this context is prefetched passing object data on registers These processing is realized implicitly by the compiler and runtime of a concurrent OO language Schematic The number of cache misses in reading contexts is reduced detached contexts

Experimental Results (1)

Experimental Results (2)

メインの説明はここまで

The Other Interesting Facts _ Waiting time for mutex is very large –70 % of owner’s execution time _ Our scheme gives good performance also on uniprocessor –spinlock: 641 msec –simple blocking lock: 1025 msec –our scheme: 810 msec (the execution time of a simple counter program)

Examples of Bottlenecks _ MT-unsafe libraries –Many libraries assume single-threaded use _ I/O calls –printf, etc. _ Stub objects in distributed systems –One representative object is responsible for all communication in a site _ Shared global variables –e.g., counters to collect statistics information

Limitations _ Our scheme may use large memory –Non-owners create many contexts _ Our scheme does not guarantee FIFO scheduling of methods in an object –Simple solution is reversing a detached list

Future Work _ Solving a potential problem in memory use –Problem: Huge memory may be required for contexts –Simple solution: switch to local-based execution when memory for contexts exceeds some threshold Owner-based execution More efficient in bottlenecks Using more memory Local-based execution Less efficient in bottlenecks Using less memory switch dynamically ……….

Achieving the Same Effect in Low-level Languages (e.g., in C) _ Typical behavior of programmers –Local-based execution in non-bottlenecks –Owner-based execution in bottlenecks Disadvantages Some bottlenecks emerge dynamically (under the effect of the number of processors and runtime parameters) It is tedious to implement owner-based execution (because context data structure varies according to objects and methods)