Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.

Slides:

Advertisements

Similar presentations

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

Advertisements

AP STUDY SESSION 2.

Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Chapter 1 The Study of Body Function Image PowerPoint

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.

Author: Julia Richards and R. Scott Hawley

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

Objectives: Generate and describe sequences. Vocabulary:

UNITED NATIONS Shipment Details Report – January 2006.

RXQ Customer Enrollment Using a Registration Agent (RA) Process Flow Diagram (Move-In) Customer Supplier Customer authorizes Enrollment ( )

David Burdett May 11, 2004 Package Binding for WS CDL.

1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.

Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×

Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

Year 6 mental test 10 second questions

1 Discreteness and the Welfare Cost of Labour Supply Tax Distortions Keshab Bhattarai University of Hull and John Whalley Universities of Warwick and Western.

1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.

REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.

Break Time Remaining 10:00.

PP Test Review Sections 6-1 to 6-6

EU market situation for eggs and poultry Management Committee 20 October 2011.

Bright Futures Guidelines Priorities and Screening Tables

Bellwork Do the following problem on a ½ sheet of paper and turn in.

XML and Databases Exercise Session 3 (courtesy of Ghislain Fourny/ETH)

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

2 |SharePoint Saturday New York City

Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.

Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.

© 2012 National Heart Foundation of Australia. Slide 2.

Adding Up In Chunks.

Understanding Generalist Practice, 5e, Kirst-Ashman/Hull

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

Note to the teacher: Was 28. A. to B. you C. said D. on Note to the teacher: Make this slide correct answer be C and sound to be “said”. to said you on.

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

25 seconds left…...

Subtraction: Adding UP

1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.

Analyzing Genes and Genomes

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Essential Cell Biology

Converting a Fraction to %

Clock will move after 1 minute

Intracellular Compartments and Transport

PSSA Preparation.

Essential Cell Biology

Immunobiology: The Immune System in Health & Disease Sixth Edition

Physics for Scientists & Engineers, 3rd Edition

Energy Generation in Mitochondria and Chlorplasts

Select a time to count down from the clock above

Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.

Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1.

Presentation transcript:

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

NUMA multicores DRAM memory 32 MC Cache 10 MC DRAM memory Cache IC 2 MC DRAM memory MCIC Processor 0Processor 1

10 MC DRAM memory Cache DRAM memory 32 MC Cache NUMA multicores Two problems: NUMA: interconnect overhead BA MAMA MBMB 3 IC Processor 0Processor 1

DRAM memory 32 MC Cache 10 MC DRAM memory Cache NUMA multicores BA MAMA MBMB 4 Cache Two problems: NUMA: interconnect overhead multicore: cache contention IC Processor 0Processor 1

Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 5

Multi-clone experiments Intel Xeon E clones of soplex (SPEC CPU2006) – local clone – remote clone 6 DRAM memory MC Cache 0 MC DRAM memory Cache IC Memory behavior of unrelated programs MMMM CCCC CCCC C C Processor 0Processor 1

Cache C DRAM Cache C CC Local bandwidth: 100% MMMM Cache C DRAM Cache C CC Local bandwidth: 80% MMMM Cache C DRAM Cache C CC Local bandwidth: 57% MMMM Cache C DRAM Cache C CC Local bandwidth: 32% MMMM Cache C DRAM Cache C CC Local bandwidth: 0% MMMM

Performance of schedules Which is the best schedule? Baseline: single-program execution mode 8 Cache C M

Execution time local clones remote clones average Slowdown relative to baseline C C C 9

Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 10

Two steps: – Step 1: maximum-local mapping – Step 2: cache-aware refinement 11 N-MASS (NUMA-Multicore-Aware Scheduling Scheme)

Step 1: Maximum-local mapping 12 DRAM Cache 0 DRAM Cache BMBMB AMAMA CMCMC DMDMD Processor 0Processor 1

Default OS scheduling 13 DRAM Cache 0 DRAM Cache BAD MBMB MAMA MCMC MDMD C Processor 0Processor 1

Two steps: – Step 1: maximum-local mapping – Step 2: cache-aware refinement 14 N-MASS (NUMA-Multicore-Aware Scheduling Scheme)

Step 2: Cache-aware refinement In an SMP: 15 DRAM Cache 0 DRAM Cache MBMB MAMA MCMC MDMD BA DC Processor 0Processor 1

Step 2: Cache-aware refinement 16 DRAM Cache 0 DRAM Cache MBMB MAMA MCMC MDMD BADC MAMA In an SMP: Processor 0Processor 1

Step 2: Cache-aware refinement 17 ABC D DRAM Cache 0 DRAM Cache MBMB MCMC MDMD BAD C MAMA ABCD Performance degradation In an SMP: NUMA penalty Processor 0Processor 1

Step 2: Cache-aware refinement 18 DRAM Cache 0 DRAM Cache MBMB MAMA MCMC MDMD BACD In a NUMA: Processor 0Processor 1

Step 2: Cache-aware refinement 19 DRAM Cache 0 DRAM Cache MBMB MAMA MCMC MDMD ADCB In a NUMA: Processor 0Processor 1

Step 2: Cache-aware refinement 20 ABC D Performance degradation DRAM Cache 0 DRAM Cache MBMB MCMC MDMD MAMA BADC A B CD NUMA allowance In a NUMA: NUMA penalty Processor 0Processor 1

Performance factors Two factors cause performance degradation: 1. NUMA penalty slowdown due to remote memory access 2. cache pressure local processes: misses / KINST (MPKI) remote processes: MPKI x NUMA penalty 21 NUMA penalty

Implementation User-mode extension to the Linux scheduler Performance metrics – hardware performance counter feedback – NUMA penalty perfect information from program traces estimate based on MPKI All memory for a process allocated on one processor 22

Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 23

Workloads SPEC CPU2006 subset 11 multi-program workloads (WL1  WL11) 4-program workloads (WL1  WL9) 8-program workloads (WL10, WL11) 24 NUMA penalty CPU-boundMemory- bound

Memory allocation setup Where the memory of each process is allocated influences performance Controlled setup: memory allocation maps 25

Memory allocation maps 26 BMBMB ACMCMC DMDMD MAMA DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD

Memory allocation maps 27 BACD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0011 MAMA MBMB MCMC MDMD

Memory allocation maps 28 BACD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0011 MAMA MBMB MCMC MDMD UnbalancedBalanced

Evaluation Baseline: Linux average – Linux scheduler non-deterministic – average performance degradation in all possible cases N-MASS with perfect NUMA penalty information 29

WL9: Linux average 30 Average slowdown relative to single-program mode

WL9: N-MASS 31 Average slowdown relative to single-program mode

WL1: Linux average and N-MASS 32 Average slowdown relative to single-program mode

N-MASS performance N-MASS reduces performance degradation by up to 22% Which factor more important: interconnect overhead or cache contention? Compare: - maximum-local - N-MASS (maximum-local + cache refinement step) 33

Data-locality vs. cache balancing (WL9) 34 Performance improvement relative to Linux average

Data-locality vs. cache balancing (WL1) 35 Performance improvement relative to Linux average

Data locality vs. cache balancing Data-locality more important than cache balancing Cache-balancing gives performance benefits mostly with unbalanced allocation maps What if information about NUMA penalty not available? 36

Estimating NUMA penalty NUMA penalty is not directly measurable Estimate: fit linear regression onto MPKI data 37 NUMA penalty

Estimate-based N-MASS: performance 38 Performance improvement relative to Linux average

Conclusions N-MASS: NUMA  multicore-aware scheduler Data locality optimizations more beneficial than cache contention avoidance Better performance metrics needed for scheduling 39

Thank you! Questions? 40