Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reducing Memory Interference in Multicore Systems

Similar presentations


Presentation on theme: "Reducing Memory Interference in Multicore Systems"— Presentation transcript:

1 Reducing Memory Interference in Multicore Systems
Lavanya Subramanian Department of ECE 11/04/2011 Hello. My name is Lavanya Subramanian. Today, I am going to talk about Application-Aware Memory Channel Partitioning

2 Main Memory is a Bottleneck
Core Memory Core Channel Core Main memory latency is long. Reduces performance stalling core In a multicore system, applications running on multiple cores share the main memory

3 Problem of Inter-Application Interference
Core Memory Channel Req Req Req Req Core Applications’ requests interfere at the main memory. This inter-application interference degrades system performance. Problem further exacerbated due to Fast growing core counts Limited off-chip pin bandwidth

4 Talk Summary Goal Address the problem of inter-application interference at main memory with the goal of improving performance. Outline of this talk Background/Motivation Previous Approaches Our Approach Goal of this talk is to motivate, describe and address the problem of inter-application interference at main memory. I shall give some background on main memory organization/operation, describe previous approaches, their shortcomings and why we need our approach – memory channel partitioning

5 Background: Main Memory Organization

6 DRAM Main Memory Organization
Core Channel The processor accesses the off-chip DRAM main memory through one or more channels. Here, I show a single channel. The smallest accessible unit within a channel is a bank. There are other levels in the hierarchy such as ranks and DIMMs, which I shall not go into in detail. Accesses to multiple banks can proceed in parallel. But, only one bank can send data on the channel at the same time. Bank

7 DRAM Organization: Bank Organization
Column (8 bytes) A bank is a 2D array of DRAM cells Row (4 KB) Row Addr Row Decoder Each bank is a 2D array of DRAM cells. The x dimension is a row. A row is organized as several columns. Row Buffer Column Addr Column Mux

8 DRAM Organization: Accessing data
Required Data Row A B C D E F Column Now, I want to access the highlighted piece of data. Row Buffer

9 DRAM Organization: The Row-Buffer
Required Data Row A B C D E F Destructive Read into Row Buffer Column The entire row is read from the array into the “row-buffer” shown below. The read is destructive. Now, the required piece of data is read from the row-buffer and sent on the channel. The data of that row is now present in the row-buffer. Row Buffer Sent onto channel

10 DRAM Organization: Row Hit
Required Data Row A B C D E F Column So, a subsequent access to another piece/column of data from the same row is serviced in the row buffer. An array access is NOT required. This is called a row hit Row Buffer Sent on channel

11 DRAM Organization: Row Miss
Required Data Row A B C D E F 1. Write back data in row buffer to array 2. Destructive Read into Row Buffer Column On the other hand, if subsequent access is to data in another row, i) row buffer contents have to be written back into the array ii) the new row is read into the row buffer. This is called a row buffer miss Row Buffer Sent onto channel Row Miss latency = 2 x Row hit latency

12 The Memory Controller Medium between the core and the main memory
Channel Bank 0 Bank 1 Memory Controller Core Request Buffer Medium between the core and the main memory Buffers memory requests from core in request buffer Re-orders and schedules requests to main memory banks

13 FR-FCFS (Rixner et al. ISCA’00)
Exploits row hits to minimize overall DRAM access latency FCFS FR-FCFS Bank Bank Req Req Req Req Req Req Service Timeline time Service Timeline time

14 Memory Scheduling in Multicore Systems
FR-FCFS Core 1 App1 Bank Req Req Req Req Req Core 2 App2 time Service Timeline Application 2’s single request starves behind three of Application 1’s requests Low memory-intensity application 2 starves behind application 1 Minimizing overall DRAM access latency != System Performance

15 Need for Application Awareness
Memory Scheduler needs to be aware of application characteristics. Thread Cluster Memory (TCM) Scheduling is the current best application-aware memory scheduling policy. TCM (Kim et al. MICRO’10) always prioritizes low memory-intensity applications shuffles between high memory intensity applications Strength Provides good system performance Shortcoming High hardware complexity due to ranking and prioritization logic

16 Modern Systems have Multiple Channels
Memory Core Memory Controller Channel Memory Core Memory Controller Channel Allocation of data to channels – a new degree of freedom

17 Interleaving rows across channels
Memory Core Memory Controller Channel Memory Core Memory Controller Channel Enables parallelism in access to rows on different channels

18 Interleaving cache lines across channels
Memory Core Memory Controller Channel Memory Core Memory Controller Channel Enables finer grained parallelism at the cache line granularity

19 Key Insight 1 High memory-intensity applications interfere with low memory-intensity applications in shared memory channels Time Units Channel Partitioning Core 0 App A Core 1 App B Channel 0 Bank 1 Bank 0 Time Units 1 2 3 4 5 Channel 1 Channel 0 5 4 3 2 1 Bank 0 Core 0 App A Bank 1 Bank 0 Core 1 App B Bank 1 Channel 1 Conventional Page Mapping Solution: Map data of low and high memory-intensity applications to different channels

20 Key Insight 2 Channel 0 Bank 1 Channel 1 Bank 0 Request Buffer State
D E Request Buffer State Conventional Page Mapping Channel 0 Bank 1 Channel 1 Bank 0 B E C D A Request Buffer State Channel Partitioning Channel 0 Bank 1 Bank 0 B Service Order Channel 1 1 2 3 4 5 6 C D E A Channel 1 Channel 0 Bank 1 Bank 0 B Service Order 1 2 3 4 5 6 C D A E

21 Memory Channel Partitioning (MCP)
Hardware Profile Applications Classify applications into groups Partition available channels between groups Assign a preferred channel to each application Allocate application pages to preferred channel System Software

22 Profile/Classify Applications
Profiling Collect Last Level Misses per Kilo Instruction (MPKI) and Row-buffer hit rate (RBH) of applications online Classification MPKI > MPKIt Low Intensity High Intensity RBH > RBHt Low Row-Buffer Locality No Yes High Row-Buffer Locality

23 Partition Between Low and High Intensity Groups
Channel 1 Low Intensity Channel 2 Assign channels proportional to number of applications in group Channel 3 High Intensity Channel 4

24 Partition b/w Low and High RBH groups
High Intensity Low Row- Buffer Locality Channel 3 Assign channels proportional to bandwidth demand of group High Intensity High Row- Buffer Locality Channel 4

25 Preferred Channel Assignment/Allocation
Load balance group’s bandwidth demand across group’s allocated channels Each application now has a preferred channel allocation Page allocation to preferred channel on first touch Operating system assigns a page to a preferred channel if free page available Else use modified replacement policy to preferentially choose a replacement candidate from preferred channel

26 Integrating Partitioning and Scheduling
Inter-application Interference Mitigation Memory Scheduling Memory Partitioning Integrated Memory Partitioning and Scheduling

27 Integrated Memory Partitioning and Scheduling (IMPS)
Applications with very low memory intensities (< 1 MPKI) do not need dedicated bandwidth In fact, dedicating bandwidth results in wastage These applications need short access latencies interfere minimally with other applications Solution: Always prioritize them in the scheduler. Handle other applications via memory channel partitioning

28 Methodology Core Model Memory Model 4 GHz out-of-order processor
128 entry instruction window 512 KB cache/core Memory Model DDR2 1 GB capacity 4 channels, 4 banks/channel Row interleaved Row hit: 200 cycles Row Miss: 400 cycles

29 Comparison to Previous Scheduling Policies
MCP performs 1% better than TCM (best previous scheduler) at no extra hardware complexity IMPS performs 5% better than TCM (best previous scheduler) at minimal extra hardware complexity Perform consistently well across all intensity categories

30 Comparison to AFT/DPM (Awasthi et al. PACT’11)
MCP/IMPS outperform AFT and DPM by 7/12.4% (across 40 workloads). Application-aware page allocation mitigates inter-application interference better.

31 Future Work Further exploration of integrated memory partitioning and scheduling for system performance Integrated partitioning and scheduling for fairness Workload aware memory scheduling


Download ppt "Reducing Memory Interference in Multicore Systems"

Similar presentations


Ads by Google