Jakub Yaghob Martin Kruliš

Slides:

Advertisements

Similar presentations

The University of Adelaide, School of Computer Science

Advertisements

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.

The University of Adelaide, School of Computer Science

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

Parallel Computer Architectures

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

More on Locks: Case Studies

InputsMetricsCodeResults MAIN MEMORY core Interconnection network Private data (LI) cache Cache controller core Cache controller Private data (LI)

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Lecture 13: Multiprocessors Kai Bu

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,

RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.

Martin Kruliš by Martin Kruliš (v1.1)1.

Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

The University of Adelaide, School of Computer Science

Lecture 13: Multiprocessors Kai Bu

Itanium® 2 Processor Architecture

COSC6385 Advanced Computer Architecture

CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches

תרגול מס' 5: MESI Protocol

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

A Study on Snoop-Based Cache Coherence Protocols

Task Scheduling for Multicore CPUs and NUMA Systems

12.4 Memory Organization in Multiprocessor Systems

Architecture Background

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Flynn’s Taxonomy Flynn classified by data and control streams in 1966

Directory-based Protocol

The University of Adelaide, School of Computer Science

Chip-Multiprocessor.

Lecture 1: Parallel Architecture Intro

CMPT 886: Computer Architecture Primer

Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.

Multiprocessor Highlights

Lecture 25: Multiprocessors

High Performance Computing

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

Introduction, background, jargon

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CSE 486/586 Distributed Systems Cache Coherence

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Jakub Yaghob Martin Kruliš System architecture Jakub Yaghob Martin Kruliš

Terminology – 1 Distributed systems Host/node/system Cluster Grid One computer Cluster A set of nodes connected together by a network Usually homogenous Grid A set of clusters connected by internet

Terminology – 2 One system Socket/package NUMA node Core Physical CPU Multiple sockets connected together by a high-speed connection Cache hierarchy Cores NUMA node Main memory Can contain processing units Core Owns execution units Shared registers Contains logical CPUs Logical CPU/thread Processing unit Executes instructions Private registers Hyper-threading

NUMA system – 4S MEM0 MEM1 CPU0 CPU1 CPU3 CPU2 MEM3 MEM2

NUMA system – 8S CPU0 CPU4 CPU1 CPU7 CPU5 CPU3 CPU6 CPU2

NUMA – physical memory layout Block Interleaving NODE0 NODE1 NODE2 NODE3

Simplified package architecture CORE 0 T0 T4 EU L1I L1D L2 CORE 1 T1 T5 EU L1I L1D L2 CORE 2 T2 T6 EU L1I L1D L2 CORE 3 T3 T7 EU L1I L1D L2 L3/LLC

Cache terminology Cache line Cache hit Cache line load Data transferred between memory and cache in atomic blocks 64B Cache hit Data load/store from/to a cache Cache line load Cache line read from main memory Cache line flush Cache line stored to main memory Cache miss A cache line is selected for eviction If it is modified, cache line will be flushed The cache line is loaded False sharing Private data of different threads in the same cache line

Cache coherency Coherency inside the package Inclusive x exclusive caches Coherency between packages ccNUMA MESI protocol Modified, Exclusive, Shared, and Invalid Snooping Cache line ping-pong Moving cache line among caches/packages in rapid succession

MESI protocol M E S I R R/W W RX+F R R+F RX+F W+RX R/~S+R W+RX R+F

Latencies Action Cycles Time (3GHz) Local L1 cache hit 4 1-2 ns 14 4-6 ns Branch misprediction 16 5-6 ns Local L3 cache hit 40-75 12-40 ns Mutex lock/unlock 75 ns Remote L3 100-300 30-100 ns Local memory 100 ns Remote memory 100-300 ns Send 1KB over FDR InfiniBand 900 ns Send 1KB over 1GB Ethernet 20000 ns

Package schema

Core schema