1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

Scenario: EOT/EOT-R/COT Resident admitted March 10th Admitted for PT and OT following knee replacement for patient with CHF, COPD, shortness of breath.
AP STUDY SESSION 2.
1
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 4 Computing Platforms.
Processes and Operating Systems
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Objectives: Generate and describe sequences. Vocabulary:
UNITED NATIONS Shipment Details Report – January 2006.
David Burdett May 11, 2004 Package Binding for WS CDL.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination. Introduction to the Business.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
Custom Statutory Programs Chapter 3. Customary Statutory Programs and Titles 3-2 Objectives Add Local Statutory Programs Create Customer Application For.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
Chapter 6 File Systems 6.1 Files 6.2 Directories
1 Chapter 11 I/O Management and Disk Scheduling Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and.
Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt §¥ ¥ §
Break Time Remaining 10:00.
Improving DRAM Performance by Parallelizing Refreshes with Accesses
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
Bright Futures Guidelines Priorities and Screening Tables
EIS Bridge Tool and Staging Tables September 1, 2009 Instructor: Way Poteat Slide: 1.
Chapter 10: Virtual Memory
Bellwork Do the following problem on a ½ sheet of paper and turn in.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Operating Systems Operating Systems - Winter 2012 Chapter 4 – Memory Management Vrije Universiteit Amsterdam.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Fairness via Source Throttling: A configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi * Chang Joo Lee * Onur.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Subtraction: Adding UP
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
Essential Cell Biology
Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.
Converting a Fraction to %
Clock will move after 1 minute
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Copyright Tim Morris/St Stephen's School
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
Presentation transcript:

1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre

2 Outline Chip Multiprocessors (CMPs) CMP Resource Management Miss Bandwidth Management –Greedy Miss Bandwidth Management –Interference Measurement –Model-Based Miss Bandwidth Management Off-Chip Bandwidth Management Conclusion

3 CHIP MULTIPROCESSORS

4 Historical Processor Performance Technology scaling is used to increase clock frequency Technology scaling is used to add processor cores Moores Law: 50% annual increase in the number of components per chip 52% Annual Performance Increase 20% Annual Performance Increase Aggregate performance still follows Moores Law

5 Power Dissipation Limits Practical Clock Frequency Source: Wikipedia, List of CPU Power Dissipation, Retrieved Technology scaling increases clock frequency Technology scaling adds processor cores Power Wall

6 Chip Multiprocessors (CMPs) CMPs utilize chip resources with a constant power budget How does technology scaling impact CMPs? Intel Nehalem

7 Projected Number of Cores ITRS expects 40% annual increase Observation 2: Software parallelism is needed in the long-term Observation 1: Multiprogramming can provide near-term throughput improvement

8 Processor Memory Gap 7% Annual Memory Latency Improvement Memory Wall Observation 3: Latency hiding techniques are necessary

9 Performance vs. Bandwidth Observation 4: Bandwidth must be used efficiently

10 Software parallelism Multi- programming Latency hiding Bandwidth efficiency Concurrent applications share hardware Complex Memory Systems Shared Resource Management Application TrendsHardware Trends

11 CMP RESOURCE MANAGEMENT

12 Why Manage Shared Resources? Provide predictable performance Support OS scheduler assumptions Cloud: Fulfill Service Level Agreement

13 Performance Variability Metrics Fairness –The performance reduction due to interference between processes is distributed across all processes in proportion to their priorities –Equal priorities: Performance reduction from sharing affects all processes equally Quality of Service –The performance of a process is never drops below a certain limit regardless of the behavior of co-scheduled processes

14 Performance Variability (Fairness) Paper B.I

15 Resource Management Tasks

16 Miss Bandwidth Management Prefetch Scheduling Off-line Interference Measurement On-line Interference Measurement Dynamic Miss Handling Architecture Greedy Miss Bandwidth Allocation Performance Model Based Miss Bandwidth Allocation Low-Cost Open Page Prefetching Opportunistic Prefetch Scheduling Contributions

17 GREEDY MISS BANDWIDTH MANAGEMENT Miss Bandwidth Management

18 Conventional Resource Allocation Implementation

19 Alternative Resource Allocation Implementation Dynamic Miss Handling Architecture

20 A D B E C Cache A B C D1 E1 Miss Handling Architecture (MHA) AddressTarget Info.U Dynamic Miss Handling Architecture A D B E C Accesses Cache is blocked Time A DMHA controls the number of concurrent shared memory system requests that are allowed for each processor

21 Greedy Miss Bandwidth Management Idea: Reduce the number of MSHRs if a metric exceeds a certain threshold Metrics: –Paper A.II: Memory bus utilization –Paper A.III: Simple interference counters (Interference Points) Performance feedback avoids excessive performance degradations Paper A.II and A.III

22 INTERFERENCE MEASUREMENT Miss Bandwidth Management

23 Resource Allocation Baselines Baseline = Interference-free configuration Quantify performance impact from interference Private Mode and Shared Mode

24 Interference Definition Interference Private Mode Latency Estimate Error Private Mode Latency Measurement Private Mode Latency Measurement Shared Mode Latency Private Mode Latency Estimate Private Mode Latency Estimate

25 Offline Interference Measurement Interference Penalty Frequency (IPF) counts the number requests that experienced an interference latency of i cycles Interference Impact Factor (IIF) is the interference latency times the probability of it arising, i.e. IIF(i) = i P(i) Paper B.I

26 Aggregate Interference Impact Paper B.I

27 Resource Management Baselines

28 Baseline Weaknesses Multiprogrammed Baseline –Only accounts for interference in partitioned resources –Static and equal division of DRAM bandwidth does not give equal latency –Complex relationship between resource allocation and performance Single Program Baseline –Does not exist in shared mode Online Interference Measurement Dynamic Interference Estimation Framework (DIEF) Paper B.II

29 Online Interference Measurement Dynamic Interference Estimation Framework (DIEF) Estimates private mode average memory latency General, component-based framework Paper B.II

30 Shared Cache Interference Auxiliary Tag Directories C P U 0 C P U 1 Cache Accesses: Shared Cache MN AMBN Miss Hit C ABC BDN BMAAMB C DBA Eviction may not be interference Interference latency cost = miss penalty D C B A Eviction is interference

31 Bus Interference Requirements Out-of-order memory bus scheduling Shared mode only cache misses and cache hits Shared cache writebacks Computing private latency based on shared mode queue contents is difficult Emulate private scheduling in the shared mode

32 ED Shared Bus Queue CB DCBA Arrival Order Head Pointer Execution Order Latency Lookup Table Bank Open Page Emulation Registers Memory Latency Estimation Buffer Bank/Page Mapping:A (0,15),B (0,19),C (0,15),D (1,32) Estimated Queue Latency = B C D 200

33 MODEL-BASED MISS BANDWIDTH MANAGEMENT Miss Bandwidth Management

34 Model-Based Miss Bandwidth Allocation DIEF provides accurate estimates of the average private mode memory latency Can we use the estimates provided by DIEF to choose miss bandwidth allocations? We need a model that relates average memory latency to performance Paper A.IV

35 Performance Model Paper A.IV Observation: The memory latency performance impact depends on the parallelism of memory requests Very similar in private and shared mode Shared mode measurements can provide private mode performance estimates

36 Bandwidth Management Flow Paper A.IV Measurement Modeling Allocation Shared Mode Memory Latency Private Mode Memory Latency Committed Instructions Number of Memory Requests CPU Stall Time Per-CPU Models Perf. Metric Model Find MSHR allocation that maximizes the chosen performance metric Set number of MSHRs for all last- level private caches

37 OFF-CHIP BANDWIDTH MANAGEMENT

38 Modern DRAM Interfaces Maximize bandwidth with 3D organization Repeated requests to the row buffer are very efficient Row address Column address DRAM Banks Row Buffer Rows Columns

39 Low-Cost Open Page Prefetching Idea: Piggyback prefetches to open DRAM pages on demand reads Performance win if prefetcher accuracy is above ~40% Paper C.I

40 Opportunistic Prefetch Scheduling Page Vector Table (PVT) Demand Access Prefetch Request Idea: Issue prefetches when a page is closed Increased efficiency: 8 transfers for 3 activations Issued Prefetch Paper C.II A A A

41 CONCLUSION

42 Conclusion Managing bandwidth allocations can improve CMP system performance Miss bandwidth management –Greedy allocations –Management guided by accurate measurements and performance models Off-chip bandwidth management with prefetching

43 Thank You Visit our website:

44 EXTRA SLIDES

45 Future Work Performance-directed management of shared caches and the memory bus Improving OS and system software with dynamic measurements Combining dynamic MHAs with prefetching to improve system performance Managing workloads of single-threaded and multi- threaded benchmarks

46 Example Chip Multiprocessor