Nov 2010 1 COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
High Performing Cache Hierarchies for Server Workloads
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
Performance of Cache Memory
Computer Science & Engineering
Chapter 6 Computer Architecture
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
OPTERON (Advanced Micro Devices). History of the Opteron AMD's server & workstation processor line 2003: Original Opteron released o 32 & 64 bit processing.
Processor - Memory Interface
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Multiprocessors II Andreas Klappenecker CPSC321 Computer Architecture.
Systems I Locality and Caching
Discovering Computers Fundamentals, 2012 Edition Chapter Four: The Components of the System Unit.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
ECE Dept., University of Toronto
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.
Virtualisation Front Side Buses SMP systems COMP Jamie Curtis.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
© 2007 Altera Corporation FPGA Coprocessing in Multi-Core Architectures for DSP J Ryan Kenny Bryce Mackin Altera Corporation 101 Innovation Drive San Jose,
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
Introduction to computer architecture April 7th. Access to main memory –E.g. 1: individual memory accesses for j=0, j++, j
Hardware Architecture
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 2. A Computer System for Labs
COSC3330 Computer Architecture
COSC3330 Computer Architecture
Cache Memory and Performance
ECE232: Hardware Organization and Design
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
CSC 4250 Computer Architectures
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Lecture: Cache Hierarchies
QuickPath interconnect GB/s GB/s total To I/O
Cache Memory Presentation I
CS-301 Introduction to Computing Lecture 17
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
CMPT 886: Computer Architecture Primer
Performance metrics for caches
Chap. 12 Memory Organization
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Performance metrics for caches
Lecture 22: Cache Hierarchies, Memory
CS 3410, Spring 2014 Computer Science Cornell University
Lecture 21: Memory Hierarchy
Performance metrics for caches
Jakub Yaghob Martin Kruliš
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Performance metrics for caches
Presentation transcript:

Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

Nov Overview  Processor  AMD Opteron quad-core processor (‘Shanghai’)  Chronos has four processors (i.e. 16 cores)  Cache structure  L1 and L2 cache per core  L3 cache shared between the four cores  Memory  6GB (6 x 1GB memory modules) per processor (24GB total)  Interconnect  AMD ‘Direct Connect Architecture’ (Coherent HyperTransport Technology)  No ‘Front side bus’, as found in some Intel platforms  Performance issues  Further Information

Nov Processor: Quad-Core AMD Opteron Source: Quad-Core AMD Opteron Product Briefwww.amd.com

Nov Processor – AMD Opteron 8378  ‘Shanghai’ 64 bit  2.4GHz clock speed  Separate 64KB level 1 data and instruction caches per core  2-way set associative, LRU replacement, exclusive  512KB level 2 cache per core (exclusive, i.e. data in L1 does not need to be in other caches)  unified (code and data)  16-way set associative, pseudo LRU replacement  6144KB (6MB) level 3 cache per processor (can be inclusive)  Shared by 4 cores  unified  64-way set associative, pseudo LRU replacement  Cache line sizes are 64B (‘unit of coherency’)

Nov AMD Opteron cache behaviour  L1 and L2 are exclusive caches  data is never in both caches. L2 holds data evicted from L1  On L2 hit, data is moved to L1 and removed from L2  L2 evicts data to L3  Access to an address that would lead to an L3 miss brings data straight to L1  Only after eviction from L1 and L2 does data come into L3 (L2 and L3 are ‘victim’ caches)  If data is required in L1 again, L3 keeps a copy (inclusive behaviour) if the data is likely to be shared with other cores but doesn’t keep a copy if the data is unlikely to be shared (exclusive).  Cache behaviour on the Opteron is ‘mostly exclusive’

Nov AMD Opteron latencies  Getting data into the registers  L1 access, 3 cycles then 1 cycle per load (~1.5ns)  L2 access, 9 cycles beyond L1 (~4ns)  L3 access, 29 cycles (at best) (~13ns)  Local memory (read access), ~140ns (not directly related to cpu cycles!)  An average benchmarked figure using, e.g. lmbench  On chronos, 1 cpu cycle is just under ~0.42ns  Memory access time is approximate…  Depends on how much work the memory system has to do to get the data and how ‘busy’ it is

Nov AMD Opteron 4P server architecture Source: AMD 4P Server and Workstation Comparisonwww.amd.com

Nov AMD Quad-quad ccNUMA architecture  Each processor is directly connected to some memory  Each processor has a memory controller  Bandwidth, 12.8GB/s (aggregate over two channels)  Processors are connected to each other with:  Bi-directional Coherent HyperTransport Technology (HT)  Coherency unit is 64 Bytes (i.e. cache line size)  Up to 8.0GB/s per link (4GB/s in each direction)  3 HT links per processor, usually 2 used to connect to other processors and 1 used for I/O (via PCI bridge)  Separate memory and I/O paths  Compare with Front side bus architecture used by, e.g., Intel

Nov Performance issues  Cores on the same processor can access directly some of the system’s memory (local memory) through the cache hierarchy  Can communicate with each other via shared L3 cache  Cores on different processors access remote memory via the cHT (coherent HyperTransport) links which maintains coherency of data in the L3 caches (and memory)  Access to remote memory may take 1 ‘hop’ (to memory on two other processors one cHT link away) or 2 ‘hops’ (to memory on the fourth processor, two cHT links away)

Nov AMD Opteron Memory latencies  Local memory reads, =100% (base case)  Local memory writes, ~113%  1 hop reads, ~108%  2 hop reads, ~130%  1 hop writes, ~128%  2 hop writes, ~150%  Remember, data is placed in physical memory as a result of a ‘first touch’ by a thread policy!  This is bechmarked data, 1 thread, idle machine

Nov Further information  See Follow: Products and Technologies -> Server Products -> Server Processors:  Product Brief  Key Architectural Features  Direct Connect Architecture  HyperTransport Technology  Quad-Core AMD Opteron Processor 4P Server and Workstation Comparison  Another useful, though slightly old, document is:  Performance Guidelines for AMD Athlon and Opteron ccNUMA Multiprocessor Systems. Available at: tech_docs/40555.pdf

Nov Information on chronos  Look in files such as:  /proc/cpuinfo  /proc/meminfo  /sys/devices/system/cpu/cpu0/cache/index0 to index3  From information in /proc/cpuinfo you can create a map of the logical processor ids (in the range [0- 15], one per core) to physical processor ids [0-3] and (physical) core ids [0-3].  You should do this!

Nov log 10 N (bytes) Performance (Mflop/s) Results of vec.f on chronos L1 = 64KB L3 = 6MB L2 = 512KB