Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

AP STUDY SESSION 2.
1
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
UNITED NATIONS Shipment Details Report – January 2006.
RXQ Customer Enrollment Using a Registration Agent (RA) Process Flow Diagram (Move-In) Customer Supplier Customer authorizes Enrollment ( )
David Burdett May 11, 2004 Package Binding for WS CDL.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
CALENDAR.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
1 Discreteness and the Welfare Cost of Labour Supply Tax Distortions Keshab Bhattarai University of Hull and John Whalley Universities of Warwick and Western.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Break Time Remaining 10:00.
PP Test Review Sections 6-1 to 6-6
EU market situation for eggs and poultry Management Committee 20 October 2011.
Bright Futures Guidelines Priorities and Screening Tables
Bellwork Do the following problem on a ½ sheet of paper and turn in.
XML and Databases Exercise Session 3 (courtesy of Ghislain Fourny/ETH)
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
2 |SharePoint Saturday New York City
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
VOORBLAD.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
© 2012 National Heart Foundation of Australia. Slide 2.
Adding Up In Chunks.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Note to the teacher: Was 28. A. to B. you C. said D. on Note to the teacher: Make this slide correct answer be C and sound to be “said”. to said you on.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Subtraction: Adding UP
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.
Presentation transcript:

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

NUMA multicores DRAM memory 32 MC Cache 10 MC DRAM memory Cache IC 2 MC DRAM memory MCIC Processor 0Processor 1

10 MC DRAM memory Cache DRAM memory 32 MC Cache NUMA multicores Two problems: NUMA: interconnect overhead BA MAMA MBMB 3 IC Processor 0Processor 1

DRAM memory 32 MC Cache 10 MC DRAM memory Cache NUMA multicores BA MAMA MBMB 4 Cache Two problems: NUMA: interconnect overhead multicore: cache contention IC Processor 0Processor 1

Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 5

Multi-clone experiments Intel Xeon E clones of soplex (SPEC CPU2006) – local clone – remote clone 6 DRAM memory MC Cache 0 MC DRAM memory Cache IC Memory behavior of unrelated programs MMMM CCCC CCCC C C Processor 0Processor 1

Cache C DRAM Cache C CC Local bandwidth: 100% MMMM Cache C DRAM Cache C CC Local bandwidth: 80% MMMM Cache C DRAM Cache C CC Local bandwidth: 57% MMMM Cache C DRAM Cache C CC Local bandwidth: 32% MMMM Cache C DRAM Cache C CC Local bandwidth: 0% MMMM

Performance of schedules Which is the best schedule? Baseline: single-program execution mode 8 Cache C M

Execution time local clones remote clones average Slowdown relative to baseline C C C 9

Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 10

Two steps: – Step 1: maximum-local mapping – Step 2: cache-aware refinement 11 N-MASS (NUMA-Multicore-Aware Scheduling Scheme)

Step 1: Maximum-local mapping 12 DRAM Cache 0 DRAM Cache BMBMB AMAMA CMCMC DMDMD Processor 0Processor 1

Default OS scheduling 13 DRAM Cache 0 DRAM Cache BAD MBMB MAMA MCMC MDMD C Processor 0Processor 1

Two steps: – Step 1: maximum-local mapping – Step 2: cache-aware refinement 14 N-MASS (NUMA-Multicore-Aware Scheduling Scheme)

Step 2: Cache-aware refinement In an SMP: 15 DRAM Cache 0 DRAM Cache MBMB MAMA MCMC MDMD BA DC Processor 0Processor 1

Step 2: Cache-aware refinement 16 DRAM Cache 0 DRAM Cache MBMB MAMA MCMC MDMD BADC MAMA In an SMP: Processor 0Processor 1

Step 2: Cache-aware refinement 17 ABC D DRAM Cache 0 DRAM Cache MBMB MCMC MDMD BAD C MAMA ABCD Performance degradation In an SMP: NUMA penalty Processor 0Processor 1

Step 2: Cache-aware refinement 18 DRAM Cache 0 DRAM Cache MBMB MAMA MCMC MDMD BACD In a NUMA: Processor 0Processor 1

Step 2: Cache-aware refinement 19 DRAM Cache 0 DRAM Cache MBMB MAMA MCMC MDMD ADCB In a NUMA: Processor 0Processor 1

Step 2: Cache-aware refinement 20 ABC D Performance degradation DRAM Cache 0 DRAM Cache MBMB MCMC MDMD MAMA BADC A B CD NUMA allowance In a NUMA: NUMA penalty Processor 0Processor 1

Performance factors Two factors cause performance degradation: 1. NUMA penalty slowdown due to remote memory access 2. cache pressure local processes: misses / KINST (MPKI) remote processes: MPKI x NUMA penalty 21 NUMA penalty

Implementation User-mode extension to the Linux scheduler Performance metrics – hardware performance counter feedback – NUMA penalty perfect information from program traces estimate based on MPKI All memory for a process allocated on one processor 22

Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 23

Workloads SPEC CPU2006 subset 11 multi-program workloads (WL1  WL11) 4-program workloads (WL1  WL9) 8-program workloads (WL10, WL11) 24 NUMA penalty CPU-boundMemory- bound

Memory allocation setup Where the memory of each process is allocated influences performance Controlled setup: memory allocation maps 25

Memory allocation maps 26 BMBMB ACMCMC DMDMD MAMA DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD

Memory allocation maps 27 BACD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0011 MAMA MBMB MCMC MDMD

Memory allocation maps 28 BACD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0011 MAMA MBMB MCMC MDMD UnbalancedBalanced

Evaluation Baseline: Linux average – Linux scheduler non-deterministic – average performance degradation in all possible cases N-MASS with perfect NUMA penalty information 29

WL9: Linux average 30 Average slowdown relative to single-program mode

WL9: N-MASS 31 Average slowdown relative to single-program mode

WL1: Linux average and N-MASS 32 Average slowdown relative to single-program mode

N-MASS performance N-MASS reduces performance degradation by up to 22% Which factor more important: interconnect overhead or cache contention? Compare: - maximum-local - N-MASS (maximum-local + cache refinement step) 33

Data-locality vs. cache balancing (WL9) 34 Performance improvement relative to Linux average

Data-locality vs. cache balancing (WL1) 35 Performance improvement relative to Linux average

Data locality vs. cache balancing Data-locality more important than cache balancing Cache-balancing gives performance benefits mostly with unbalanced allocation maps What if information about NUMA penalty not available? 36

Estimating NUMA penalty NUMA penalty is not directly measurable Estimate: fit linear regression onto MPKI data 37 NUMA penalty

Estimate-based N-MASS: performance 38 Performance improvement relative to Linux average

Conclusions N-MASS: NUMA  multicore-aware scheduler Data locality optimizations more beneficial than cache contention avoidance Better performance metrics needed for scheduling 39

Thank you! Questions? 40