A Requests Bundling DRAM Controller for Mixed-Criticality System

A Requests Bundling DRAM Controller for Mixed-Criticality System
April 23, 2017 RTAS 2017 by: Danlu Guo, Rodolfo Pellizzoni

Outline Introduction DRAM Background
Predictable DRAM Controller Evaluation Requests Bundling DRAM Controller Worst Case Latency Analysis Evaluation Conclusion

Introduction Multicore architecture Real-Time system
Shared DRAM main memory Inter-core memory interference Real-Time system Hard Real-Time (HRT) applications Soft Real-Time (SRT) applications What do we want from DRAM Tighter upper bound latency for HRT request Better lower bound bandwidth for SRT request Solution: Innovative predictable DRAM controllers Multicore architecture LL Cache Core 0 CPU Cache Core N DRAM controller DRAM main memory Multicore architecture becomes popular in computer system, where multiple processor can run in parallel to improve the performance. However, the sharing of resource such as the DRAM main memory leads to new challenge and performance bottleneck. Access to the DRAM is long, and due to the inter0core interference, a memory request latency can become very long. As the application becomes more data intensive, the main memory becomes the performance bottleneck. When the multi-core architecture is used in real-time system, there are more constraints due to the temporal requirements. The challenge arise when the processors compete for the same resource. Isolation of shared resources.

Predictable DRAM Controller Evaluation Requests Bundling DRAM Controller Worst Case Latency Analysis Evaluation Conclusion Before we move into real-time DRAM controller design and analysis, we will first give a brief explanation of the DRAM in order to understand the detail of the controller design and the contribution.

DRAM Background Organization Channel: Independent DRAM controller
DRAM Bank 0 DRAM Bank N Organization Channel: Independent DRAM controller Rank: Share Command/Data Bus Bank: Access in Parallel Row, Column, Row Buffer: data cells Row Decoder Row Buffer Row Column DRAM Channel ADDRESS/ COMMAND BUS DATA DRAM Controller DRAM Rank N DRAM Rank 0 DRAM CHIP 7 CHIP 6 CHIP 0 Let’s first look at the organization of DRAM device. In this presentation, we consider a single channel memory device connected with a DRAM controller. There can be multiple ranks within in a channel. And each rank contains a set of DRAM chips. In our case, we only consider a single rank device. Within each chip, there is a number of banks which can be accessed in parallel. Each bank contains the actual 2D data array as Row and Columns and there is a row buffer which temporarily hold the data in the 2D array. One thing to notice is that every memory access must be performed within the row buffer. Now lets look at how data is accessed from the memory.

DRAM Background x y z Operation RD [0,0,1] x y z y
Row Decoder Row Column DRAM Bank 0 x y z Operation Activate (ACT): retrieve data Column-Access-Strobe (RD/WR): access data Precharge (PRE): restore data Timing Constraints (Refer DDR Specifications) RD [0,0,1] x y z y The DRAM device is operated by commands. We only show the three most important commands that are related to latency which are the Activate, CAS, and Precharge to retrieve data, access data and restore data back to cells. There are a number of timing constraints must be satisfied between commands based which bank the command is targeting. The detail can be referred to DDR spec. We will demonstrate the basic timing constraints between these three commands through a simple example. A tRCD R tRL Data P tRTP

DRAM Background Page Policy RD[0,0,1], RD[0,0,0] Close Close
Close-Page: Precharge (PRE) after access (CAS) Open-Page: Precharge (PRE) when required RD[0,0,1], RD[0,0,0] tRC A tRCD R tRL Data P tRTP A tRCD R tRL Data P tRTP Close Close Since the precharge command is not strictly related to memory access, it can issued at any time based on the designer. There are generally two mode for precharging which is page-mode, means when to turn off a row buffer. First, close-page mode precharge the row buffer after each access (CAS) Add precharge for open-page, there is something in the row buffer (Adv Open: may take more delay for first request, successive row hit, take less time, do good if there is high row hit ratio for the application) Guarantee this row ratio is tough.. PRedictable P tRP A tRCD R tRL Data R tRL Data Close (Miss) Open (Hit)

DRAM Background Data Allocation Shared Banks Private Bank Bank 0
Allows data sharing among cores Contention on the same bank Private Bank Allows isolation between cores/banks Limits data sharing Bank 0 Bank 1 Bank 0 Bank 1 1 2 9 10 11 1 2 9 10 11 3 4 5 12 13 14 3 4 5 12 13 14 Another important factor for memory performance is the data allocation used by the applications. The allocation scheme determines what kind of technique can be applied in the controller. There are two basic allocation schemes, which are shared bank and private bank. For shared bank, data of an application is distributed in any banks in the memory. Shared bank allows data sharing between cores, but can cause access contention for the same bank. On the other hand, private bank allocate data for an application in a specific bank and there is no sharing between the cores, which requires the OS to handle the shared data. However, the benefit of such scheme is that it isolates each core from each other and prevent contention on the same bank. 6 7 8 15 16 17 6 7 8 15 16 17 Core 0 Core 1

Predictable DRAM Controller Evaluation Requests Bundling DRAM Controller Worst Case Latency Analysis Evaluation Conclusion As we have gone through the basic operation and allocation of DRAM devices, we can now look at what kind of technique have been used in real-time memory controller based on the allocation scheme and how the analytical model is constructed.

Predictable DRAM Controllers Evaluation
Core 0 Core 2 Core 1 Core 3 Shared bank + Close-Page Private Bank + Open-Page tRC A W tRCD P tRC A R tRCD P tRC A W tRCD tWL Data Bank0 Bank1 Bank2 Bank3 A R tRCD P N – 1 reactivation on the same bank Bank0 Bank1 Bank2 Bank3 P PRE A tRP ACT W tRCD R tRTW tWTR CAS (Open) tWL Data We will demonstrate the worst case analytical model for different kind of techniques with examples. We assume that there are 4 cores running in parallel and 4 banks can be used to allocate the data. In the example, we will show the worst case analytical model used in each design. close-page is commonly applied to shared bank memory because there is a high possibility that all the cores access the same bank at the same time. We can observe the in the worst case, the core under analysis suffer the reactivation process three times and result in a very long latency. In order to improve on the worst-case latency, researchers have pay more attention on utilizing private bank allocation to take the benefit of isolation. Open-page is generally used because there is a high possibility of a row hit for future requests from the same core. For a close request, the worst case analytical model looks like the following: N-1 PRE N-1 ACT N-1 CAS Switching

Ex: DDR3-1600H RD-RD: 4 RD-WR: 7 WR-RD: 18 Private Bank + Open-Page Private Bank and Open-Page + CAS reordering [L.Ecco & R.Ernst,RTSS’15 ] Bank0 P W R tRTW tWTR CAS (Open) tWL Data A Bank1 P A Bank2 P A Bank3 P tRP A tRCD PRE ACT 32 cycles A Bank0 Bank1 tRCD Bank2 Bank3 P tRP PRE ACT W R tRTW tCCD CAS (Open) tWL Data However, by scheduling requests in the parallel manner does not provide a good solution because the switching between CAS command is quite large. As a result, there is new approach to reorder the CAS commands so that the same type of commands are scheduled together and then switch to the other type of CAS commands. By performing the reordering, the worst case latency can be further reduced. But is the reordering is good enough? NO, Let’s take a look on the analytical model. 15 cycles

Current Analytical Model Pipeline System Bank0 P tRP PRE A tRCD ACT Not the actual command arrival time C CAS Data Bank1 Bank2 Bank3 HRT Latency Objective Bank0 Bank1 Bank2 Bank3 P tRP PRE A tRCD ACT C CAS Data The current model for private bank and open page controllers are analyzing each command individually, where they consider the worst case scenario for each command that there are always N-1 other same type of commands that will delay the command under analysis. The fundamental drawback for this analytical model is that it does not reflect the actual arrival time of commands. If a command is issued early, its following command should be able to be issued early as well instead of waiting for the command under analysis. To better demonstrate this pipeline effect, we show the following example. Since commands can be executed in sequence. We can consider the system as a three stage pipeline system with a PRE, ACT, and CAS. By analyze the entire the system as a whole, we can see that there are large area of overlapping between the three stages. The actual arrival time of commands. CAS commands

Mixed Criticality System Co-existing of HRT and SRT applications on different cores Fixed priority can guarantee the HRT latency but limit SRT bandwidth Bank0 Request Request Bank1 Request Request Bank2 Request Request Bank3 Request Request Bank4 SRT Request SRT Request We have looked at the issues of HRT latency, we should also pay an attention to the SRT bandwidth. As a general technique, fixed prioirity has been widely used to treat SRT applications in mixed-criticality system. However, under the worst scenario, it will limit the SRT bandwidth by a large amount. SRT Bandwidth Objective Starvation

Requests Bundling DRAM Controller
Objective Summary Reordering CAS breaks the execution sequence HRT Latency: Apply Pipelining can cover the overlap interference. Apply Reordering can avoid the repetitive CAS switching. SRT Bandwidth: Apply Co-schedule of SRT and HRT requests can avoid the starvation. Applied on no reordering but does not solve the repetitive switching, and cannot be applied directly to reordering because the order of command execution sequence is broken due to the reordering. Therefore, we need some new methodology to apply the pipeline mechanism. We will also need some new way to provide bandwidth to the SRT requests. How Analysis is performed – Sum vs. Pipeline Break the order, reorder priority of requestor change between commands. Another slides on performance with fixed-priority. Mixed. Example. In order to apply a pipeline mechanism, the order of commands must be maintained during the execution. Therefore, reordering the command will break the order, therefore, we use reordering on requests. HRT requests can starve SRT requests depending on the memory intensity Requests Bundling DRAM Controller

Predictable DRAM Controller Classification Requests Bundling DRAM Controller Worst Case Latency Analysis Evaluation Conclusion

Requests Bundling (REQBundle) DRAM Controller
HRT Latency SRT Bandwidth Isolation Private bank Pipelining and Reordering Close-Page => Fixed command sequence Reordering on the request level => Avoid multiple switching => Fixed request sequence Fast Access Shared bank + Open-page Co-schedule SRT and HRT requests Fixed SRT execution slots before HRT

Command Scheduler Switch Switch Schedule HRT & SRT Commands
Bundle same type of requests Switch access type between round HRT Banks InRound Scheduler OutRound Scheduler SRT Banks Command Scheduler Schedule SRT Commands only Read RD WR Ends/Start Switch Write InRound Ends Switch Starts WR RD InRound Bank0 Bank1 Bank2 Bank3 Write SRT Bank OutRound OutRound will schedule SRT commands. InRound scheduler schedule both HRT and SRT commands. There are two task for We will use an example to demonstrate the transition between scheduler and rounds and how access type is determined by the inRound.

InRound Scheduler Execution Time of an InRound
: time to determine the number of HRT requests (N) : time to issue the last SRT CAS : time to issue the last HRT ACT Execution time R(N) = max( (N-1) * , ) RD SRT ACT = 2 Not Care RD Round Starts R Ends SRT ACT SRT CAS Bank3 Bank2 Bank1 Bank0 Data A W SRT Bank A R A As we have derived the worst-case execution time for a round, we can calculate the worst-case analytical latency for a request. By adjusting the number of SRT slots, we can change the guaranteed bandwidth for SRT SRT CAS

Request Arrival Time and Latency
Case0: Arrives before snapshot of same type of round = R(N0) + tRL + tBus LReq R0 Starts R0 Ends SRT ACT Bank0 Bank1 Bank2 A R RD tBus Bank3 A R tRL D LReq

Case1: Arrives before/after snapshot of different type of round = R(No) + R(N1) + tRL + tBus LReq R0 Starts R0 Ends R1 Starts R1 Ends SRT ACT SRT ACT Bank0 Bank1 A R Bank2 A W RD tBus Bank3 A R tRL D LReq

Case2: Arrives after snapshot in the same type of round = R（No) + R(N1) + R(N2) + tRL + tBus (Worst Case) LReq R0 Starts R0 Ends R1 Starts R1 Ends R2 Starts R2 Ends SRT ACT SRT ACT SRT ACT Bank0 Bank1 A W Bank2 A R RD tBus Bank3 A R tRL D LReq

Evaluation Implemented in a general DRAM controller simulation framework in C++ [DRAMController Demo RTSS’16] EEMBC benchmark memory traces generated from MACsim CPU 1GHz Private L1/2 Cache Shared L3 Cache Evaluate against Command Bundling (CMDBundle) DRAM Controller [L.Ecco and R.Ernst,RTSS’15 ] Burst Mode Non-Burst Mode No assumption on the arrival pattern.

Benchmark Worst Case Execution Time (8 HRTs)
HRT0 runs benchmark trace and other 7 HRTs run memory intensive traces Normalized on CMDBundle (non-burst)

Worst Case HRT Request Latency (8 HRTs)
RD Request WR Request

Worst Case SRT Requests Bandwidth (8 HRTs)
RD Bandwidth WR Bandwidth

Mixed-Criticality System (8 HRTs, 8 SRTs)
HRT Latency Implement virtual HRT requestor mechanism for CMDBundle Considered as a HRT cores in the system All SRT requests share the virtual requestors SRT Bandwidth Virtual hard requestors. Latency will increase and trade-off for better bandwidth

Conclusion Employing request bundling with pipelining can improve the worst case request latency. Considering the command timing constraints gaps can provide a good trade-off between the SRT bandwidth and HRT latency. Compared with a state-of-the-art real-time memory controller and show the balance point based on the row-hit ratio of a task. Measurement row hit ratio is lower than 50%. A guaranteed row hit ratio requires static analysis and is lower than measured ratio. Based on our observation on the measurement of row hit ratio of the applications, the row hit ratio is generally lower than 50%, and that is only the measured value, in order to have a guaranteed row hit ratio, it requires the designer to perform a static memory access pattern and can result in even lower row hit ratio. Therefore, in some applications, we can simply skip the analysis process and use close-page for more predictable manner. PRESENTATION TITLE

limit Thank you PRESENTATION TITLE

Evaluation Burst Non-burst move PRESENTATION TITLE

A Requests Bundling DRAM Controller for Mixed-Criticality System

Similar presentations

Presentation on theme: "A Requests Bundling DRAM Controller for Mixed-Criticality System"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Requests Bundling DRAM Controller for Mixed-Criticality System

Similar presentations

Presentation on theme: "A Requests Bundling DRAM Controller for Mixed-Criticality System"— Presentation transcript:

Similar presentations

About project

Feedback