NumaConnect Technology Steffen Persvold, Chief Architect OPM Meeting, 11.-12. March 2015 1.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.
The University of Adelaide, School of Computer Science
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
SGI’2000Parallel Programming Tutorial Supercomputers 2 With the acknowledgement of Igor Zacharov and Wolfgang Mertz SGI European Headquarters.
Multiple Processor Systems
UMA Bus-Based SMP Architectures
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
OGO 2.1 SGI Origin 2000 Robert van Liere CWI, Amsterdam TU/e, Eindhoven 11 September 2001.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Multiprocessors ELEC 6200 Computer Architecture and Design Instructor: Dr. Agrawal Yu-Chun Chen 10/27/06.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 17 Parallel Processing.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Parallel Computer Architectures
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Lecture 17 Page 1 CS 111 Online Distributed Computing CS 111 On-Line MS Program Operating Systems Peter Reiher.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Lecture 13: Multiprocessors Kai Bu
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
Outline Why this subject? What is High Performance Computing?
Understanding Parallel Computers Parallel Processing EE 613.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
The University of Adelaide, School of Computer Science
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
ECE 456 Computer Architecture Lecture #9 – Input/Output Instructor: Dr. Honggang Wang Fall 2013.
Background Computer System Architectures Computer System Software.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13: Multiprocessors Kai Bu
Parallel Architecture
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Shared Memory Multiprocessors
STARFIRE Extending the SMP Envelope
Lecture 1: Parallel Architecture Intro
Chapter 5 Multiprocessor and Thread-Level Parallelism
11 – Snooping Cache and Directory Based Multiprocessors
High Performance Computing
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

NumaConnect Technology Steffen Persvold, Chief Architect OPM Meeting, March

Technology Background Dolphin’s Low Latency Clustering HW Dolphin’s Cache Chip Convex Exemplar (Acquired by HP) First implementation of the ccNUMA architecture from Dolphin in 1994 Data General Aviion (Acquired by EMC) Designed in 1996, deliveries from Dolphin chipset with 3 generations of Intel processor/memory buses I/O Attached Products for Clustering OEMs Sun Microsystems (SunCluster) Siemens RM600 Server (IO Expansion) Siemens Medical (3D CT) Philips Medical (3D Ultra Sound) Dassault/Thales Rafale HPC Clusters (WulfKit w. Scali) First Low Latency Cluster Interconnect 2 Convex Exemplar Supercomputer

NumaChip-1 IBM Microelectronics ASIC FCPBGA1023, 33mm x 33mm, 1mm ball pitch, package IBM cu-11 Technology ~ 2 million gates Chip Size 9x11mm 3

NumaConnect-1 Card 4

SMP - Symmetrical 5 CPU Shared Memory CPU CPU I/O BUS SMP used to mean “Symmetrical Multi Processor”

SMP – Symmetrical, but… 6 Caches complicate - Cache Coherency (CC) required CachesCPUs Caches Shared Memory CachesCPUs Caches CachesCPUs Caches I/O BUS zzz

NUMA – Non Uniform Memory Access 7 Point-to-point Links CachesCPUs I/O Memory CachesCPUs Memory CachesCPUs Memory CachesCachesCaches zzzzzz Access to memory controlled by another CPU is slower - Memory Accesses are Non Uniform (NUMA)

SMP - Shared MP - CC-NUMA 8 CachesCPUs I/O Memory CachesCPUs Memory CachesCPUs Memory CachesCachesCaches Foo zzzzzzzzz Non-Uniform Access Shared Memory with Cache Coherence

SMP - Shared MP - CC-NUMA 9 CachesCPUs I/O Memory CachesCPUs Memory CachesCPUs Memory CachesCachesCaches Foo zzzzzzzzzzz Non-Uniform Access Shared Memory with Cache Coherence

SMP - Shared MP - CC-NUMA 10 Non-Uniform Access Shared Memory with Cache Coherence CachesCPUs I/O Memory CachesCPUs Memory CachesCPUs Memory CachesCachesCaches Foo Fool Foo Zzzzzzz Zzzzzzz!

Numascale System Architecture Shared Everything - One Single Operating System Image Caches CPUs I/O Memory Caches CPUs I/O Memory Caches CPUs I/O Memory Caches I/O Memory Caches CPUs Caches NumaConnect Fabric - On-Chip Distributed Switching NumaChip Numa Cache NumaChip Numa Cache NumaChip Numa Cache NumaChip 11

NumaConnect™ Node Configuration NumaChip Memory Numa Cache+Tags Multi-Core CPU I/O Bridge Memory Memory Memory Memory Multi-Core CPU Memory Memory Memory Memory Coherent HyperTransport 6 x4 SERDES links 12

3-D Torus 2-D Torus NumaConnect™ System Architecture 6 external links - flexible system configurations in multi-dimensional topologies Multi-CPU Node 13 NumaChip Memory Numa Cache Multi-Core CPU I/O Bridge Memory Memory Memory Memory Multi-Core CPU Memory Memory Memory Memory

SPI Init Module NumaChip-1 Block Diagram SDRAM Cache SDRAM Tags HyperTransport SPI XAXBYBYAZAZB LC Config Data Microcode SM CSR ccHT CaveH2SSCCCrossbar switch, LCs, SERDES 14

2-D Dataflow Request Response CPUsCaches NumaChip Memory Memory Memory Memory CPUsCaches Memory Memory Memory Memory NumaChip CPUsCaches Memory Memory Memory Memory NumaChip 15

The Memory Hierarchy 16

LMbench - Chart 17 1,251 6,421 16,

Principal Operation – L1 Cache Hit 18 Memory L3 Cache Mem. Ctrl. HT Interface CPU Cores L1&L2 Caches NumaChip NumaCache Memory L3 Cache Mem. Ctrl. HT Interface CPU Cores L1&L2 Caches NumaChip NumaCache To/From Other nodes in the same dimension L1 Cache HIT

Principal Operation – L2 Cache Hit 19 Memory L3 Cache Mem. Ctrl. HT Interface CPU Cores L1&L2 Caches NumaChip NumaCache Memory L3 Cache Mem. Ctrl. HT Interface CPU Cores L1&L2 Caches NumaChip NumaCache To/From Other nodes in the same dimension L2 Cache HIT

Principal Operation – L3 Cache Hit 20 Memory L3 Cache Mem. Ctrl. HT Interface CPU Cores L1&L2 Caches NumaChip NumaCache Memory L3 Cache Mem. Ctrl. HT Interface CPU Cores L1&L2 Caches NumaChip NumaCache To/From Other nodes in the same dimension L3 Cache HIT

Principal Operation – Local Memory 21 Memory L3 Cache Mem. Ctrl. HT Interface CPU Cores L1&L2 Caches NumaChip NumaCache Memory L3 Cache Mem. Ctrl. HT Interface CPU Cores L1&L2 Caches NumaChip NumaCache To/From Other nodes in the same dimension Local Memory Access, HT Probe for Shared Data

Memory L3 Cache Mem. Ctrl. HT Interface CPU Cores L1&L2 Caches NumaChip Principal Operation – NumaCache Hit 22 NumaCache Memory L3 Cache Mem. Ctrl. HT Interface CPU Cores L1&L2 Caches NumaChip NumaCache To/From Other nodes in the same dimension Remote Memory Access, Remote Cache Hit

Principal Operation – Remote Memory 23 Memory L3 Cache Mem. Ctrl. HT Interface CPU Cores L1&L2 Caches NumaChip NumaCache Memory L3 Cache Mem. Ctrl. HT Interface CPU Cores L1&L2 Caches NumaChip NumaCache To/From Other nodes in the same dimension Remote Memory Access, Remote Cache Miss

NumaChip Features Converts between snoop-based (broadcast) and directory based coherency protocols Write-back to NumaCache Coherent and Non-Coherent Memory Transactions Distributed Coherent Memory Controller Pipelined Memory Access (16 Outstanding Transactions) NumaCache size up to 8GBytes/Node - Current boards support up to 4Gbytes NumaCache 24

Scale-up Capacity Single System Image or Multiple Partitions Limits TeraBytes Physical Address Space Nodes cores Largest and Most Cost Effective Coherent Shared Memory Cache Line (64Bytes) Coherency 25

NumaConnect in Supermicro

Cabling Example 27

5 184 Cores – 20.7 TBytes nodes 6 x 6 x 3 Torus CPU cores 58 TFlops 20.7 TBytes Shared Memory Single Image OS

OPM ”cpchop” scaling Steffen Persvold, Chief Architect OPM Meeting, March 2015

What is scaling ? In High Performance Computing there are two common notions of scaling: Strong scaling How the solution time varies with the number of processors for a fixed total problem size. \eta_p = t1 / ( N * tN ) * 100% Weak scaling How the solution time varies with the number of processors for a fixed problem size per processor. \eta_p = ( t1 / tN ) * 100%

OPM – “cpchop” What was done ? 3 weeks to enable “cpchop” scalability Initial state (Jan ‘14): No scaling beyond 4 threads on a single server node A few changes after code analysis enabled scalability Removed #pragma omp critical sections from opm-porsol (needed fixes to dune- istl/dune-common) (Arne Morten) Removed excessive verbose printouts (not so much for scaling but for “cleaner” multithreaded output). Made sure thread context was allocated locally per thread. Created local copies of the parsed input. Dune::Timer class changed to use clock_gettime(CLOCK_MONOTONIC, &now) instead of std::clock() and getrusage() avoiding kernel spinlock calls When building UMFPACK use –DNO_TIMING in the configuration or modify the code to use calls without spinlocks Reduced excessive use of malloc/free by setting environment variables MALLOC_TRIM_THREASHOLD_=-1, MALLOC_MMAP_MAX_=0, MALLOC_TOP_PAD_= (500MB)

OPM – “cpchop” Making local object copies diff --git a/examples/cpchop.cpp b/examples/cpchop.cpp index 212ac53..da52c6a a/examples/cpchop.cpp +++ b/examples/cpchop.cpp -617,7 +617,16 try #ifdef HAVE_OPENMP threads = omp_get_max_threads(); #endif - std::vector ctx(threads); + std::vector ctx(threads); + +#pragma omp parallel for schedule(static) + for (int i=0;i<threads;++i) { + int thread = 0; +#ifdef HAVE_OPENMP + thread = omp_get_thread_num(); +#endif + ctx[thread] = new ChopThreadContext(); + } // draw the random numbers up front to ensure consistency with or without threads std::vector ris(settings.subsamples);

OPM – “cpchop” Distribute input -638,6 +643,42 try rzs[j] = rz(); } +#ifdef HAVE_HWLOC + Dune::Timer watch1; + Numa::Topology numatopology; + + // Create a local copy of the CornerPointChopper object, one per numa board used + int boards = numatopology.num_boards(); + + std::vector ch_local(boards); + std::atomic have_ch_local[boards]; + for (int i=0;i<boards;++i) { + have_ch_local[i] = false; + } + + // Assign the master instance into our pointer vector + { + int board = numatopology.get_current_board(); + ch_local[board] = &ch; + have_ch_local[board] = true; + }

OPM – “cpchop” Distribute input cont’d + + std::cout << "Distributing input to boards" << std::endl; + int distributed = 0; + +#pragma omp parallel for schedule(static) reduction(+:distributed) + for (int i = 0; i < threads; ++i) { + int board = numatopology.get_current_board(); + + if (!have_ch_local[board].exchange(true)) { + ch_local[board] = new Opm::CornerPointChopper(ch); + distributed++; + } + + std::cout << "Distribution to " << distributed << " boards took " << watch1.elapsed() << " seconds" << std::endl; +#endif // HAVE_HWLOC + double init_elapsed = watch.elapsed(); watch.reset();

OPM – “cpchop” scaling