© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

Mutual Exclusion.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

CMPT 300: Operating Systems I Dr. Mohamed Hefeeda

PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.

EEE 435 Principles of Operating Systems Interprocess Communication Pt I (Modern Operating Systems 2.3)

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.

Chapter 1 and 2 Computer System and Operating System Overview

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

GCSE Computing - The CPU

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.

Multiprocessor Cache Coherency

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Chapter 3 Memory Management: Virtual Memory

Computer System Architectures Computer System Software

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

Lecture 13: Multiprocessors Kai Bu

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.

Performance Tuning John Black CS 425 UNR, Fall 2000.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, ,

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Buffering Techniques Greg Stitt ECE Department University of Florida.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Memory Management.

Computer Architecture Chapter (14): Processor Structure and Function

Distributed Shared Memory

Atomic Operations in Hardware

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Atomic Operations in Hardware

Multiprocessor Cache Coherency

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

CSE451 Virtual Memory Paging Autumn 2002

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 17 Multiprocessors and Thread-Level Parallelism

Programming with Shared Memory Specifying parallelism

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CSE 542: Operating Systems

COT 5611 Operating Systems Design Principles Spring 2014

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig E Rasmussen

Chapter 8 Objectives Understand performance implications of the memory hierarchy Look at Amdahl’s law and the concept of parallel speedup Understand sources of performance overhead that limit parallel speedup © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 2

3 Outline CPU vs. Memory Performance –Impact on concurrent programs Amdahl’s law and parallel speedup Overhead associated with lock-based algorithms Thread overhead considerations

CPU vs. Memory Performance Gap CPU cycle time is much faster than latency to access memory. This means that one memory access will cause many cycles to be wasted if the CPU is waiting for it. –Utilization of machine decreases. John Backus coined the term for this gap between CPU and memory performance in 1977: –The von Neumann bottleneck © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 4

von Neumann Bottleneck Named after the inventor of the model of computation used in modern computers. Stored program computers with a memory connected to a CPU in which instructions and data are fetched, decoded, executed, and stored. Frequent movement across the CPU/Memory boundary to fetch and store. –Implies that latency between CPU and memory is a primary performance factor. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 5

Hardware solution The solution: hardware assistance to hide the latency to memory. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 6 Memory CPU Slow

Hardware solution The solution: hardware assistance to hide the latency to memory. Small, low latency cache memory holds a copy of memory near recent accesses. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 7 Memory CPU SlowFast Cache

Common SMP activity overheads © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 8 The gap between cache latency (L1 cache hit) and main memory access (miss all caches) is big!

Locality The key is to exploit locality properties of programs. –Given an access to location X, it is likely that subsequent accesses will be very close to X (spatial locality). –Furthermore, these access will happen soon after location X was accessed (temporal locality). These properties occur commonly in programs, which is why caches are so successful and widespread. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 9

Violating locality Locality implies that data layout in memory is related to the access pattern. –E.g.: Accessing a matrix one row at a time means we want rows to be contiguous in memory. If access pattern doesn’t match data layout, we lose locality. –Performance enhancement due to cache goes away. –E.g.: Lay out matrix with columns contiguous, but access one row at a time. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 10

Parallelism and memory The cache solution to the memory latency problem becomes complicated when multiple CPUs or cores share a single main memory. This is the case in a common multicore system. How do we: –Maintain the cache structure to hide latency, –While ensuring that cores all see a consistent view of memory? © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 11

The Problem Say core 1 reads a memory location X and pulls in its neighbors on a cache line into its L1 cache. Core 2 then writes to memory location X, changing the value in its own cache and main memory. How do we ensure that core 1 does not use the out of date copy of location X? …Cache coherence protocols. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 12

Cache coherence protocols Cache coherence protocols maintain this coherent view of memory for each core When memory is accessed (either read or written) caches observe activity on the bus and update the state of the data they contain –Invalidating it if it is no longer up to date –Marking values that are replicated between caches as shared –Marking values as modified –Establishing exclusive ownership of a memory location © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 13

Cache coherence protocols The protocol defines a state machine that tells the cache how to treat data that it contains. –Cache reads a value from memory that no other core has read, marks it as exclusively owned. –Cache observes another core reading the same memory, transitions the state from exclusive to shared. –Cache observes another core write to that memory, transitions the state to invalid to force an update if it is accessed again. Many protocols exist, and different ones are used by different CPU manufacturers. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 14

Implications for performance A parallel algorithm should avoid causing the CC protocol to frequently invalidate cached data, which would result in high frequency access to slow main memory. In other words, cores should be used in a way that reduces overlap of accesses. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 15

Example Consider three threads across three cores. In the bad case, cores access memory near each other, resulting in frequent invalidations of cached data. In the good case, they reduce the overlap to a minimal set of locations at the boundaries of the regions that they are using. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 16

Data layout The key is to understand how data layout of a parallel program relates to the access pattern of the concurrently executing threads. For languages, this requires understanding the layout patterns that are assumed for data structures like arrays. Example: row-major versus column-major layout of multidimensional arrays. –Fortran vs. C style © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 17

Measurement How do we measure if our programs suffer from performance problems related to the cache? Performance counters. Most modern CPUs provide registers that count events like cache misses, coherence protocol activities, etc… Profiling code and looking at those counters in regions of the parallel program can give insight into whether this is a cause of performance problems. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 18

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 19 Outline CPU vs. Memory Performance –Impact on concurrent programs Amdahl’s law and parallel speedup Overhead associated with lock-based algorithms Thread overhead considerations

Speedup Why do we go parallel? Often we have a problem with a fixed size that we want to make go faster by having multiple cores working on parts of it at once. How do we measure the benefit of parallelism? Speedup: How much faster did the problem get solved in parallel compared to a sequential version. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 20

Perfect speedup Sequential version of program takes t s time Given p processors, parallel version would ideally take t p = t s /p time A class of problems known as embarassingly parallel problems have speedups that approach this. –No dependencies between parallel threads: all of them execute without ever interacting in any way with the others. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 21

Realistic speedup Even embarassingly parallel problems don’t have perfect speedup because there is typically time required to generate the work to execute in parallel and to gather the results of the parallel threads to form the final solution. –These activities are typically sequential. We break the time to execute a parallel program then into the parts that are sequential and the parts that are parallel. –T = t s + t p © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 22

Realistic speedup T = t s + t p As the number of parallel threads increases, t p approaches a very small constant. Therefore in the limit, the parallel program performance is bounded by the parts that are intrinsically sequential. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 23

Amdahl’s law This is Amdahl’s law. The speedup of a program that has any components that are not concurrent cannot ever be perfect as it is ultimately bound by the sequential parts. In some cases, this means the code will never get faster than the time to set up and finalize the problem. In more realistic cases, sequentializtion occurs due to synchronization between cores during parallel execution. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 24

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 25 Outline CPU vs. Memory Performance –Impact on concurrent programs Amdahl’s law and parallel speedup Overhead associated with lock-based algorithms Thread overhead considerations

Locks and mutual exclusion Recall that mutual exclusion means that for a critical section of code, only one thread will ever be allowed to execute inside it at a given time. Concurrent threads of execution are serialized then through critical sections. Locking is often based on the use of critical sections. –Not always: hardware assistance may exist. –This discussion assumes the case where it does not. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 26

Lock overhead Locking induces overhead. Locking requires entry into mutually exclusive regions of code. –This leads to potential serialization if there is contention for these regions. –Some cores sit idle waiting their turn. The use of locks can cause performance degradation as a result. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 27

Serialization due to locks If many threads simultaneously attempt to access the same lock, they will be serialized. –One at a time they will acquire the lock and will sit waiting until that time. A frequently used lock may cause frequent contention. –Which corresponds to frequent serialization, and high amounts of time spent in an unproductive wait state. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 28

Conservative locking A common problem in concurrent programs is excessive, conservative locking. Acquiring a lock even if, ultimately, it wasn’t actually necessary. –E.g.: protecting a critical section that exists in a rarely executed branch of a conditional, but acquiring the lock before testing the condition. Excessive use of the Java “synchronized” keyword can have this result. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 29

Lock overhead Even in the case where serialization does not occur, locks induce overhead. –Time to acquire lock –Time to release lock If critical section protected by lock is executed frequently, every time it is executed we need to pay this price. –This can add up and reduce speedup since lock overhead is additional work not present in sequential algorithm. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 30

Optimistic schemes Optimistic schemes avoid locking at all costs unless absolutely necessary. Consider a program where multiple threads attempting to perform an activity protected by a lock at the same time is very rare. –It might be cheaper to not lock, but instead check to see if a conflict occurred after doing the work, and pay the penalty for cleaning up the mess in this rare case. This is the basis of the transaction concept. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 31

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 32 Outline CPU vs. Memory Performance –Impact on concurrent programs Amdahl’s law and parallel speedup Overhead associated with lock-based algorithms Thread overhead considerations

Thread overhead Threads themselves are often not free. Some overhead exists to schedule threads onto cores, and to create and destroy them. This overhead varies drastically between systems. –E.g.: Kernel thread vs. process vs. user thread based runtimes. © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 33

Thread overhead Understanding the thread overhead for the system you choose to use is important. How many threads to use? What granularity of computation for individual threads to handle? Reuse of existing threads for new work versus creation of fresh threads? © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 34

Thread overhead Consider these three performance metrics: –Computation requires t c time. –Thread creation and scheduling takes t s time. –Thread destruction and cleanup takes t f time. Overall time for the computation is therefore: –t c +t s +t f © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 35

Implication The implication of this is that we would ideally choose computations to execute in a thread that take enough time such that t s and t f are effectively zero. If we choose too small of a computation, the time to execute the thread may be dominated by the time to start and stop the thread itself. –This reduces the effectiveness of the parallel algorithm! © 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 36