Multiprocessing and NUMA

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Multiple Processor Systems

L.N. Bhuyan Adapted from Patterson’s slides

Interactive lesson about operating system

Threads, SMP, and Microkernels

Distributed Processing, Client/Server and Clusters

1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.

Distributed Systems CS

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

Multiple Processor Systems

1 Threads, SMP, and Microkernels Chapter 4. 2 Process: Some Info. Motivation for threads! Two fundamental aspects of a “process”: Resource ownership Scheduling.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

1: Operating Systems Overview

1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.

Disco Running Commodity Operating Systems on Scalable Multiprocessors.

Chapter 17 Parallel Processing.

Chapter 11 Operating Systems

©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.

PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.

Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

The Operating System The operation system (OS) is a set of programs that coordinates: Hardware functions Interaction between application software and computer.

Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Computer System Architectures Computer System Software

Chapter 4 Threads, SMP, and Microkernels Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:

Operating System 4 THREADS, SMP AND MICROKERNELS

Lecture 17 Page 1 CS 111 Online Distributed Computing CS 111 On-Line MS Program Operating Systems Peter Reiher.

Operating Systems Lecture 2 Processes and Threads Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.

Processes and OS basics. RHS – SOC 2 OS Basics An Operating System (OS) is essentially an abstraction of a computer As a user or programmer, I do not.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

1.1 Operating System Concepts Introduction What is an Operating System? Mainframe Systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Games Development 2 Concurrent Programming CO3301 Week 9.

Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.

1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.

1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Background Computer System Architectures Computer System Software.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Multiprocessing and NUMA

Multiprocessor System Distributed System

Distributed Processors

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Mechanism: Limited Direct Execution

CS 147 – Parallel Processing

CMSC 611: Advanced Computer Architecture

Chapter 4: Threads.

Shared Memory Multiprocessors

Multithreaded Programming

Multiprocessing and NUMA

Presentation transcript:

Multiprocessing and NUMA

What we sort of assumed so far… Northbridge connects CPU and memory to rest of system Memory controller implemented in Northbridge chipset Devices and CPU can access memory via requests to Northbridge CPU connects using a Front Side Bus

Modern Systems Almost all current systems all have more than one CPU/core IPhones have 2 CPU and 3 GPU cores Galaxy S3 has 4 cores! Multiprocessor: More than one physical CPU SMP: Symmetric multiprocessing, Each CPU is identical to every other Each has the same capabilities and privileges Each CPU is plugged into system via its own slot/socket Multicore More than one CPU in a single physical package Multiple CPUs connect to system via a shared slot /socket Currently most multicores are SMP But this might change soon!

SMP Operation Each processor in system can perform the same tasks Execute same set of instructions Access memory Interact with devices Each proc. connects to system in same way Traditional approach: Bus Modern approach: Interconnect Interacting with the rest of the system (memory/devices) done via communication over the shared bus/interconnect Obviously this can easily lead to chaos Why we need synchronization

SMP architecture First approach to multiprocessing Just connect another CPU to the northbridge Most of these systems used a shared bus CPUs could communicate with each other and with the northbridge But, only one user at a time, so scalability was limited (bus contention)

Multicore architecture During the early/mid 2000s CPUs started to change dramatically Could no longer increase speeds exponentially But: transistor density was still increasing Only thing architects could do was add more computing elements Replicated entire CPUs inside the same processor die The standard architecture is just like SMP, but with only one CPU slot in the system

Multiprocessor-Multicores SMP with multicore CPUs Multiple processor slots in system Each slot hosts multiple CPU cores What does this mean for the OS? Mostly hidden by the hardware OS sees N cpus that are identical, so treats them the same way But the similarity does not always hold for memory More on that in a minute

The Future (?) Manycore CPUs are currently being developed This could be a game changer A local machine starts to look like a distributed system

What does this mean for the OS? Many more resources must be managed OS must ensure that all CPUs cooperate together Example: If two CPUs try to schedule the same process simultaneously How do we identify CPUs? Hardware must provide identification interface X86: Each CPU assigned a number at boot time

Programming models What do we do with all these CPUs? Some ideas… Actually we don’t really know yet… 6 cores are about as much as we can effectively use in a desktop environment Still waiting for the killer app Some ideas… Side core: Dedicate entire cores for a single task I/O core: Dedicate entire core to handle an I/O device GUI core: Dedicate entire core to handle GUI Fine grain parallelization of Apps Pretty difficult… How much parallelism is actually in an interactive task? Virtual Machines Run an entirely separate OS environment on dedicated cores

Dealing with devices Current I/O devices must generally be handled by a single core Device interrupts are delivered to only one core CPUs must coordinate access to the device controller But this is changing Basic approach: Dedicate a single core for I/O All I/O requests forwarded to one CPU core Cores queue up I/O requests that the I/O core then services Slightly more advanced approach I/O devices are balanced across cores E.g. 1 core handles network, another core handles disk Even more advanced approach I/O devices reassigned to cores that are using them Interrupts are routed to the core that is making the most I/O requests

Cross CPU Communication (Shared Memory) OS must still track state of entire system Global data structure updated by each core i.e. the system load avg is computed based on load avg across every core Traditional approach Single copy of data, protected by locks Bad scalability, every CPU constantly takes a global lock to update its own state This is why Vista cannot scale past 32 cores Modern approach Replicate state across all CPUs/cores Each core updates its own local copy (so NO locks!) Contention only when state is read Global lock Is required, but reads are rare

Cross CPU Communication (Signals) System allows CPUs to explicitly signal each other Two approaches: notifications and cross-calls Almost always built on top of interrupts X86: Inter Processor Interrupts (IPIs) Notifications CPU is notified that “something” has happened No other information Mostly used to wakeup a remote CPU Cross Calls The target CPU jumps to a specified instruction Source CPU makes a function call that execs on target CPU Synchronous or asynchronous? Can be both, up to the programmer

CPU interconnects Mechanism by which CPUs communicate Old way: Front Side Bus (FSB) Slow with limited scalability With potentially 100s of CPUs in a system, a bus won’t work Modern Approach: Exploit HPC networking techniques Embed a true interconnect into the system Intel: QPI (QuickPath Interconnects) AMD: HyperTransport Interconnects allow point to point communication Multiple messages can be sent in parallel if they don’t intersect

Interconnects and Memory Interconnects allow for complex message types Can interface directly with memory Memory controllers can be moved onto CPU Memory references no longer have to go through Northbridge Definition of memory has become… less concrete PCIe devices can handle memory operations NVRAM and DRAM can exist in same address space Is it a disk or is it main memory?

Multiprocessing and memory Shared memory is by far the most popular approach to multiprocessing Each CPU can access all of a system’s memory Conflicting accesses resolved via synchronization (locks) Benefits Easy to program, allows direct communication Disadvantages Limits scalability and performance Requires more advanced caching behavior Systems contain a cache hierarchy with different scopes

Multiprocessor caching On multicore CPUs some (but not all) caches are shared Each core has its own private L1 cache L2 cache can either be private to a core, or shared between cores L3 cache almost always shared between cores Caches not shared across physical CPU dies What if two CPUs update the same memory location stored in their L1 caches? Shared memory systems require an absolute ordering of operations Cache coherency ensures this ordering Implemented in hardware to ensure that memory updates are propagated throughout the entire system Utilizes CPU interconnect for communication

Memory Issues As core count increases shared memory becomes harder We already established that lock contention can kill performance and scalability Increasingly difficult for HW to provide shared memory behavior to all CPU cores Example: manycore CPUs To get to memory, it has to cross other cores. So some cores are closer to memory and thus faster On current small scale systems (8-16 cores) we are already seeing issues Memory is slow or fast depending on which CPU is accessing it This is called Non Uniform Memory Access (NUMA)

Non Uniform Memory Access Memory is organized in a non uniform manner Its closer to some CPUs than others Far away memory is slower than close memory Not required to be cache coherent, but usually is ccNUMA: Cache Coherent NUMA Typical organization is to divide system into “zones” A zone usually contains a CPU socket/slot and a portion of the system memory Memory is “local” if its in the CPU’s zone Fast to access

NUMA cont’d Accessing memory in the local zone does not impact performance in other zones Recall: Interconnect is point to point Looks a lot like a distributed shared memory (DSM) system… Local operations are fast, but if you go to another zone you take a performance hit DSM died in the 90s because it couldn’t scale and was hard to program Unclear whether NUMA will share that same fate

Dealing with NUMA Programming a NUMA system is hard Ultimately it’s a failed abstraction Goal: Make all memory ops the same But they aren’t, because some are slower AND the abstraction hides the details Result: Very few people explicitly design an application with NUMA support Those that do are generally in the HPC community So its up to the user and the OS to deal with it But mostly people just ignore it…

Dealing with NUMA (users) Users can query the system for the NUMA layout [jarusl@cambria ~]$ numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 2 3 4 5 6 node 0 size: 8182 MB node 0 free: 7215 MB node 1 cpus: 1 7 8 9 10 11 node 1 size: 8192 MB node 1 free: 7475 MB node distances: node 0 1 0: 10 16 1: 16 10

Dealing with NUMA (users) Users then force OS to confine a process to a specific zone Restricts what memory a process gets allocated Restricts which CPUs process can run on Per process via command line ‘numactl --physcpubind=<cpus> <cmd>’ Groups of processes using scheduling domains Linux: cgroups and containers

Dealing with NUMA (OS) An OS can deal with NUMA systems by restricting its own behavior Force processes to always execute in a zone, and always allocate memory from the same zone This makes balancing resource utilization tricky However, nothing prevents an application from forcing bad behavior E.g. two applications in separate zones want to communicate using shared memory…

Managing NUMA (OS) When should a process switch zones? How can OS know what zone a process should run in? Needs to know what the process behavior will be OS cannot know the future, but it can predict it based on past events Recent OS X and Windows versions profile application behavior When should a process switch zones? If it is communicating with a process in another zone If the system load is currently imbalanced in one zone If we can save power by shutting down a zone’s CPUs How should we layout process memory? Keep all memory in a single zone, or just the working set?

Multiprocessing and Power More cores require more energy (and heat) Managing the energy consumption of a system becoming critically important Modern systems cannot fully utilize all resources for very long Approaches Slow down processors periodically CPUs no longer identical (some faster, some slower) Shutdown entire cores System dynamically powers down CPUs OS must deal with processors coming and going This doesn’t really match the SMP model anymore

Heterogeneous CPUs Systems are beginning to look much different The SMP model is on its way out Heterogeneous computing resources across system Core specialization: CPU resources tailored to specific workloads GPUs, lightweight cores, I/O cores, stream processors OS must manage these dynamically What to schedule where and when? How should the OS approach this issue? Active area of current research