Jakub Yaghob Martin Kruliš

Slides:



Advertisements
Similar presentations
The University of Adelaide, School of Computer Science
Advertisements

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
The University of Adelaide, School of Computer Science
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Parallel Computer Architectures
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
More on Locks: Case Studies
InputsMetricsCodeResults MAIN MEMORY core Interconnection network Private data (LI) cache Cache controller core Cache controller Private data (LI)
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.
Martin Kruliš by Martin Kruliš (v1.1)1.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
The University of Adelaide, School of Computer Science
Lecture 13: Multiprocessors Kai Bu
Itanium® 2 Processor Architecture
COSC6385 Advanced Computer Architecture
CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches
תרגול מס' 5: MESI Protocol
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
A Study on Snoop-Based Cache Coherence Protocols
Task Scheduling for Multicore CPUs and NUMA Systems
12.4 Memory Organization in Multiprocessor Systems
Architecture Background
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
Directory-based Protocol
The University of Adelaide, School of Computer Science
Chip-Multiprocessor.
Lecture 1: Parallel Architecture Intro
CMPT 886: Computer Architecture Primer
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Multiprocessor Highlights
Lecture 25: Multiprocessors
High Performance Computing
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Introduction, background, jargon
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Jakub Yaghob Martin Kruliš System architecture Jakub Yaghob Martin Kruliš

Terminology – 1 Distributed systems Host/node/system Cluster Grid One computer Cluster A set of nodes connected together by a network Usually homogenous Grid A set of clusters connected by internet

Terminology – 2 One system Socket/package NUMA node Core Physical CPU Multiple sockets connected together by a high-speed connection Cache hierarchy Cores NUMA node Main memory Can contain processing units Core Owns execution units Shared registers Contains logical CPUs Logical CPU/thread Processing unit Executes instructions Private registers Hyper-threading

NUMA system – 4S MEM0 MEM1 CPU0 CPU1 CPU3 CPU2 MEM3 MEM2

NUMA system – 8S CPU0 CPU4 CPU1 CPU7 CPU5 CPU3 CPU6 CPU2

NUMA – physical memory layout Block Interleaving NODE0 NODE1 NODE2 NODE3

Simplified package architecture CORE 0 T0 T4 EU L1I L1D L2 CORE 1 T1 T5 EU L1I L1D L2 CORE 2 T2 T6 EU L1I L1D L2 CORE 3 T3 T7 EU L1I L1D L2 L3/LLC

Cache terminology Cache line Cache hit Cache line load Data transferred between memory and cache in atomic blocks 64B Cache hit Data load/store from/to a cache Cache line load Cache line read from main memory Cache line flush Cache line stored to main memory Cache miss A cache line is selected for eviction If it is modified, cache line will be flushed The cache line is loaded False sharing Private data of different threads in the same cache line

Cache coherency Coherency inside the package Inclusive x exclusive caches Coherency between packages ccNUMA MESI protocol Modified, Exclusive, Shared, and Invalid Snooping Cache line ping-pong Moving cache line among caches/packages in rapid succession

MESI protocol M E S I R R/W W RX+F R R+F RX+F W+RX R/~S+R W+RX R+F

Latencies Action Cycles Time (3GHz) Local L1 cache hit 4 1-2 ns 14 4-6 ns Branch misprediction 16 5-6 ns Local L3 cache hit 40-75 12-40 ns Mutex lock/unlock 75 ns Remote L3 100-300 30-100 ns Local memory 100 ns Remote memory 100-300 ns Send 1KB over FDR InfiniBand 900 ns Send 1KB over 1GB Ethernet 20000 ns

Package schema

Core schema