Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

The University of Adelaide, School of Computer Science

CSE 502: Computer Architecture

COREY: AN OPERATING SYSTEM FOR MANY CORES

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.

Cache Optimization Summary

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

/ 6.338: Parallel Computing Project FinalReport Parallelization of Matrix Multiply: A Look At How Differing Algorithmic Approaches and CPU Hardware.

More on Locks: Case Studies

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

.1 Intro to Multiprocessors. .2 The Big Picture: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control.

Multiprocessor Cache Consistency (or, what does volatile mean?) Andrew Whitaker CSE451.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Martin Kruliš by Martin Kruliš (v1.1)1.

Understanding Parallel Computers Parallel Processing EE 613.

Parosh Aziz Abdulla 1, Mohamed Faouzi Atig 1, Zeinab Ganjei 2, Ahmed Rezine 2 and Yunyun Zhu 1 1. Uppsala University, Sweden 2. Linköping University, Sweden.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 14: Memory Hierarchy Chapter 5 (4.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Hardware Architecture

Concurrent programming: From theory to practice Concurrent Algorithms 2015 Vasileios Trigonakis.

CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.

COMP 740: Computer Architecture and Implementation

Concurrent programming: From theory to practice

Lecture: Large Caches, Virtual Memory

Solid State Disks Testing with PROOF

Lecture: Large Caches, Virtual Memory

Alternative system models

Andy Wang COP 5611 Advanced Operating Systems

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Task Scheduling for Multicore CPUs and NUMA Systems

Scaling a file system to many cores using an operation log

Multiprocessors Oracle SPARC M core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB,

Challenges in Concurrent Computing

Anders Gidenstam Håkan Sundell Philippas Tsigas

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

The University of Adelaide, School of Computer Science

Concurrent Data Structures Concurrent Algorithms 2017

Chip-Multiprocessor.

CMSC 611: Advanced Computer Architecture

Yiannis Nikolakopoulos

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

/ Computer Architecture and Design

Morgan Kaufmann Publishers Memory Hierarchy: Introduction

Andy Wang COP 5611 Advanced Operating Systems

Concurrent programming: From theory to practice

Andy Wang COP 5611 Advanced Operating Systems

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 17 Multiprocessors and Thread-Level Parallelism

Jakub Yaghob Martin Kruliš

Problems with Locks Andrew Whitaker CSE451.

Andy Wang COP 5611 Advanced Operating Systems

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos

2 From theory to practice Theoretical (design) Practical (design) Practical (implementation)

3 From theory to practice Theoretical (design) Practical (design) Practical (implementation) Impossibilities Upper/Lower bounds Techniques System models Correctness proofs Design (pseudo-code)

4 From theory to practice Theoretical (design) Practical (design) Practical (implementation) Impossibilities Upper/Lower bounds Techniques System models Correctness proofs Design (pseudo-code) System models shared memory message passing Finite memory Practicality issues re-usable objects Performance Design (pseudo-code, prototype)

5 From theory to practice Theoretical (design) Practical (design) Practical (implementation) Impossibilities Upper/Lower bounds Techniques System models Correctness proofs Design (pseudo-code) System models shared memory message passing Finite memory Practicality issues re-usable objects Performance Design (pseudo-code, prototype) Hardware Which atomic ops Memory consistency Cache coherence Locality Performance Scalability Implementation (code)

6 Outline CPU caches Cache coherence Placement of data Memory consistency Hardware synchronization Concurrent programming techniques on locks

7 Outline CPU caches Cache coherence Placement of data Memory consistency Hardware synchronization Concurrent programming techniques on locks

8 Why do we use caching? Core freq: 2GHz = 0.5 ns / instr Core → Disk = ~ms Core Disk

9 Why do we use caching? Core freq: 2GHz = 0.5 ns / instr Core → Disk = ~ms Core → Memory = ~100ns Core Disk Memory

10 Why do we use caching? Core freq: 2GHz = 0.5 ns / instr Core → Disk = ~ms Core → Memory = ~100ns Cache  Large = slow  Medium = medium  Small = fast Core Disk Memory Cache

11 Why do we use caching? Core freq: 2GHz = 0.5 ns / instr Core → Disk = ~ms Core → Memory = ~100ns Cache  Core → L3 = ~20ns  Core → L2 = ~7ns  Core → L1 = ~1ns Core Disk Memory L3 L2 L1

12 Typical server configurations Intel Xeon  GHz  L1: 32KB  L2: 256KB  L3: 24MB  Memory: 64GB AMD Opteron  8 2.4GHz  L1: 64KB  L2: 512KB  L3: 12MB  Memory: 64GB

13 Experiment Throughput of accessing some memory, depending on the memory size

14 Outline CPU caches Cache coherence Placement of data Memory consistency Hardware synchronization Concurrent programming techniques on locks

15 Until ~2004: single-cores Core freq: 3+GHz Core → Disk Core → Memory Cache  Core → L3  Core → L2  Core → L1 Core Disk Memory L2 L1

16 After ~2004: multi-cores Core freq: ~2GHz Core → Disk Core → Memory Cache  Core → shared L3  Core → L2  Core → L1 Core 0 L3 L2 Core 1 Disk Memory L2 L1

17 Multi-cores with private caches Core 0 L3 L2 Core 1 Disk Memory L2 L1 Private = multiple copies

18 Cache coherence for consistency Core 0 has X and Core 1  wants to write on X  wants to read X  did Core 0 write or read X? Core 0 L3 L2 Core 1 Disk Memory L2 L1 X

19 Cache coherence principles To perform a write  invalidate all readers, or  previous writer To perform a read  find the latest copy Core 0 L3 L2 Core 1 Disk Memory L2 L1 X

20 Cache coherence with MESI A state diagram State (per cache line)  Modified: the only dirty copy  Exclusive: the only clean copy  Shared: a clean copy  Invalid: useless data

21 The ultimate goal for scalability Possible states  Modified: the only dirty copy  Exclusive: the only clean copy  Shared: a clean copy  Invalid: useless data Which state is our “favorite”?

22 The ultimate goal for scalability Possible states  Modified: the only dirty copy  Exclusive: the only clean copy  Shared: a clean copy  Invalid: useless data = threads can keep the data close (L1 cache) = faster

23 Experiment The effects of false sharing

24 Outline CPU caches Cache coherence Placement of data Memory consistency Hardware synchronization Concurrent programming techniques on locks

25 Uniformity vs. non-uniformity Typical desktop machine Typical server machine = Uniform CC Caches Memory Caches Memory CCCC Caches C Memory CCC = non-Uniform

26 Latency (ns) to access data C C Memory C C L1 L2 L3 L1 L2 L1 L2 L1 L3

27 Latency (ns) to access data C C Memory C C L1 L2 L3 L1 L2 L1 L2 L1 L3 1

28 Latency (ns) to access data C C Memory C C L1 L2 L3 L1 L2 L1 L2 L1 L3 1 7

29 Latency (ns) to access data C C Memory C C L1 L2 L3 L1 L2 L1 L2 L1 L

30 Latency (ns) to access data C C Memory C C L1 L2 L3 L1 L2 L1 L2 L1 L

31 Latency (ns) to access data C C Memory C C L1 L2 L3 L1 L2 L1 L2 L1 L

32 Latency (ns) to access data C C Memory C C L1 L2 L3 L1 L2 L1 L2 L1 L

33 Latency (ns) to access data C C Memory C C L1 L2 L3 L1 L2 L1 L2 L1 L

34 Latency (ns) to access data C C Memory C C L1 L2 L3 L1 L2 L1 L2 L1 L3 Conclusion: we need to take care of locality

35 Experiment The effects of locality