Recent Progress In Embedded Memory Controller Design

Name: Recent Progress In Embedded Memory Controller Design
Uploaded: 2017-08-22T21:14:58+00:00
Duration: PTM29S40
Channel: Kieran Rodgerson
Description: Recent Progress In Embedded Memory Controller Design

Recent Progress In Embedded Memory Controller Design
MEAOW’13 Definition of FPGA Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto

Acknowledgment PhD work of Zefu Dai Definition of FPGA

Memory Hierarchy Latency, Capacity, Bandwidth Cache Controller DRAM
L: 0.5ns, C: 10MB Cache Controller L: 50ns, C: 100GB BW: 100GB/s DRAM L: 10us, C: 2TB BW: 2GB/s Nothing is more important than the concept of memory hierarchy Flash L: 10ms, C: 4TB BW: 600MB/s Disk

DRAM Primer <bank, row, column> Page buffer per bank

DRAM Characteristics DRAM page crossing
Charge ~10K DRAM cells and bitlines Increase power & latency Decrease effective bandwidth Sequential access VS. random access Less page crossing Lower power consumption 4.4x shorter latency 10x better BW More into the DRAM characteristics, one important feature of DRAM memory is that DRAM page crossing has significant impact on the latency and bandwidth of off-chip memories. Each page crossing charges tens of thousands of DRAM cells and bitlines, thus greatly increase the power and decrease the performance. In general DRAM memory prefer sequential access over random access, because the sequential access pattern can have less page crossing which can significantly improve the latency and bw.

Take Away: DRAM = Disk

Embedded Controller Bad News Good News
None available as in general purpose processor Opportunities for customization

Agenda Overview Multi-Port Memory Controller (MPMC) Design
“Out-of-Core” Algorithmic Exploration

Motivating Example: H.264 Decoder
Diverse QoS requirements Bandwidth sensitive Latency sensitive For example, bandwidth sensitive ports want to reserve a certain amount of memory bandwidth, but are less sensitive to the latency of the service, on the contrary, latency sensitive ports such as the caches require their requests to be serviced as fast as possible. Added to the problem is that off-chip memory has dynamic latency and bandwidth depends on the scheduling order. Therefore, it is important to explore the characteristics of DRAM memory when designing memory scheduler. 6.4 9.6 1.2 164.8 Dynamic latency, BW and power 0.09 31.0 156.7 94 MB/s

Wanted Bandwidth guarantee Prioritized access Reduced page crossing
Now let us re-stating the problem, we need a MPMC that can provide minimum bandwidth guarantee for bandwidth sensitive port, and prioritized access for latency sensitive port. Finally, because power consumption is generally important to embedded system, it is necessary that the scheduler can improve performance by reducing the number of DRAM page crossing

Previous Works Bandwidth guarantee Q2: Prioritized access
Q0: Distinguish bandwidth guarantee for different classes of ports Q1: Distinguish bandwidth guarantee for each port Q2: Prioritized access Q3: Residual bandwidth allocation Q4: Effective DRAM bandwidth Previous works have tried to address the challenges of MPMC design, They made different achievements on different aspects of the problem, but none has succeeded in covering them all at the same time. In particular, few of them has paid attention to the residual bandwidth, which is the bandwidth that is statically allocated to one port but is under utilized at run time Q0 Q1 Q2 Q3 Q4 [Rixner,00][McKee,00][Hur,04] ✓ [Heighecker,03,05][Whitty,08] [Lee,05] [Burchard,05] Proposed BCBR

Key Observations Port locality: Service time flexibility
Same port requests  same DRAM page Service time flexibility 1/24 second to decode a video frame 4M cycles at 100 MHz for request reordering Residual bandwidth Statically allocated BW Underutilized at runtime Weighted round robin: Minimum BW guarantee Busting service Credit borrow & repay Reorder requests according to priority Dynamic BW calculation Capture and re-allocate residual BW In order to achieve design objectives, we leverage 3 key observations of the MPMC design. The first one is the port locality, we believe that requests from the same port are more likely to fall into the same DRAM page. We use the wrr to exploit the port locality, because it can provide minimum bandwidth guarantee. More importantly, it supports continue service to the same port, which can potentially reduce the number DRAM page crossing. We use WRR as the baseline scheduler. The second observation is service time flexibility. Bandwidth sensitive ports may not be sensitive to service latency of their requests. for example each video frame has one 24th second of time budget to be decoded, which translate into 4 million cycles at 100 MHz clock. For the service time flexibility, we propose a credit borrow and repay mechanism to reorder the request according to priority. The last observation is that there is residual bandwidth at runtime, ports may underutilize their statically allocated bandwidth we use a dynamic BW calculation scheme to capture residual bandwidth and re-allocate it to important user Base line algorithm Breadth first page crossing Reduce the latency while keep bw guarantee Difference between statical allocation and dynamic usage Can be utilized effectively, not aware

Weighted Round Robin Assume bandwidth requirement Tround = 10
Q2: 30% Q1: 50% Q0: 20% Tround = 10 Time: scheduling cycles T(Rij): arriving time of jth requests for Qi Clock: 1 2 3 4 5 6 7 8 9 Request time: T(R2) R20 R21 R22 Service time: Q2 R20 R21 R22 T(R1) R10 R11 R12 R13 R14 Let us illustrate how WRR works., consider there are 3 ports with different bw requirements. Also assume there are 10 cycles in each round of scheduling. I would like to illustrate the scheduling process in this diagram. The top row shows the clock cycles, and a separate row is used to show the arriving time of requests for each port We use another row to show the service time of requests. In the first cycle, 3 requests arrive to each ports, and the scheduler chooses to service Q2. In following cycles, Q2 gets continuous service. In cycle 4 because Q2 has exhausted its scheduling credits, the scheduler switch to service Q1. so on and so forth, Q0 get scheduled in last two cycles of the round. We can see that besides minimum bandwidth guarantee, the WRR algorithm can provide continuous service to each port, thus can potentially reduce the number of DRAM page crossing by exploiting the port locality. Q1 R10 R11 R12 R13 R14 T(R0) R00 R01 Q0 R00 R01

Problem with WRR Priority: Q0 > Q2 8 cycles of waiting time! Clock:
1 2 3 4 5 6 7 8 9 T(R2) R20 R21 R22 Q2 R20 R21 R22 T(R1) R10 R11 R12 R13 R14 Q1 R10 R11 R12 R13 R14 The problem with WRR is that the average waiting time could be very long. For example, although Q0 is latency sensitive and has higher priority. It needs to wait for 8 cycles in each round of scheduling. And this waiting time can be much larger if each round of scheduling takes more than 10 cycles. T(R0) R00 R01 Q0 R00 R01 8 cycles of waiting time! Could be worse!

Borrow Credits Zero Waiting time for Q0! borrow Clock: T(R2) R20 R21
1 2 3 4 5 6 7 8 9 T(R2) R20 R21 R22 Q2 R20 T(R1) R10 R11 R12 Q1 borrow To solve this problem, we propose the credit borrow and repay technique. We illustrate the technique by attaching a FIFOQ to Q0. Now, Let’s restart the scheduling process In the first cycle 3 request arrives, and the scheduler chooses to service port number 2. However, since Q0 has higher priority, it borrows the scheduling opportunity from Q2 and sends its request to DRAM, at the same time the portID of Q2 is pushed into the debtQ. Then Q0 borrows a second scheduling slot from Q2 and push the ID to the debt Q, In the third cycle, since there is no request in Q0, Q2 keeps the Scheduling opportunity to itself and send a request. T(R0) R00 R01 Q0* R00 R01 debtQ0 Q2 Q2 Q2

Repay Later At Q0’s turn, BW guarantee is recovered repay Clock: T(R2)
1 2 3 4 5 6 7 8 9 T(R2) R20 R21 R22 Q2 R20 R21 R22 T(R1) R10 R11 R12 R13 R14 repay Q1 R10 R11 R12 R13 R14 In the end, when Q0 gets its scheduling opportunities, it repays the credits to Q2. So that the bandwidth guarantee is recovered but Q0 get immediate service for their request. T(R0) R00 R01 Q0* R00 R01 debtQ0 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Prioritized access!

Problem: Depth of DebtQ
DebtQ as residual BW collector BW allocated to Q0 increases to: 20% + residual BW Requirement for the depth of DebtQ0 decreases Clock: 1 2 3 4 5 6 7 8 9 T(R2) R20 R21 R22 Q2 R20 R21 R22 T(R1) R10 R11 R12 R13 Help repay Q1 R10 R11 R12 R13 However, the problem for the CBR is that the performance is limited by the depth of the debtQ For example, if Q0 have one extra request arrived at cycle 2, because the debtQ is full, no borrowing can occur, thus the extra request can not receive immediate service. To solve the problem we either have to increase the depth of the debtQ or we can leverage the residual bandwidth. Now assume Q1 only have 4 requests. At cycle 7, when Q0 gets the scheduling token, it first uses it to send our the extra request. It the use the other scheduling slot to replay part of the debt, However, the debtQ is still not empty. Now if we use the residual bandwidth to help repay the debt. Q0 can have an empty debtQ at the end of the round, while also have the extra request sent out to the DRAM memory. By doing this, the residual bw is actually consumed by Q0 and the requirement for the depth of the debtQ is reduced. T(R0) R00 R01 R03 Q0* R00 R01 R03 debtQ0 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2

Evaluation Framework Simulation Framework Workload: ALPBench suite
DRAMSim: simulates DRAM latency+BW+power Reference schedulers: PQ, RR, WRR, BGPQ To evaluate our algorithm, we build a simulation framework include a cycle accurate DRAM simulator which can simulate the latency bandwidth and power consumption of DRAM memory. A c model of our scheduler, and a set of workload traces including CPU cache trace and multimeida traces. We compare the results with other schedulers include PQ, RR, WRR and BGPQ. Benchmark explaination, pin tool,

Bandwidth Guarantee Bandwidth guarantees: No BW guarantee
P0: 2% P1: 30% P2: 20% P3:20% P4:20% System residual: 8% No BW guarantee Port 1 2 3 4 RR 1.08% 24% PQ 0.73% 80% 18% 0% BGPQ 1.07% 39% 20% WRR 0.76% 33% 22% BCBR We first examine if the proposed scheduler can achieve minimum bandwidth guarantee. We set the bandwidth requirement of each port to be 2 percent for port 0… note that we leave 8% as system residual to test the impact of residual bandwidth. The final bandwidth allocation results of different schedulers are shown in this table. We can see that the RR and the PQ does not achieve bandwidth guarantee. In particular the PQ scheduler causes starvation in port 3 and 4. On the contrary, The BGPQ, WRR and BCBR , all achieve minimum bandwidth guarantees and have similar bw allocation result. However does that mean this three schedulers also have the same service latency. Provides BW guarantee!

Cache Response Latency
Average 16x faster than WRR As fast as PQ (prioritized access) Latency (ns) The figure here shows the cache response latency by different schedulers. We can see that the three scheduler in the right has very different service latency. Especially, the proposed scheduler is 16x faster than WRR and is as fast as the PQ. Which demonstrate that it can provide prioritized access to latency sensitive ports. The WRR is significantly slower than others, and the proposed scheduler is >16x faster than the WRR scheduler >1.6x faster than the RR scheduler Especially, it is as fast as the PQ scheduler. This demonstrate that it can provide prioritized accesses.

DRAM Energy & BW Efficiency
30% less page crossing (compared to RR) 1.4x more energy efficient 1.2x higher effective DRAM BW As good as WRR (exploit port locality) RR BGPQ WRR BCBR GB/J 0.298 0.289 0.412 0.411 Act-Pre Ratio 29.6% 30.1% 23.0% Improvement 1.0x 0.97x 1.38x Finally, we examine the DRAM energy and bw efficiency. The BCBR has 30% less page crossing commands and therefore is 1.4x more energy efficient and achieve 1.2x higher effective bandwidth compared to the RR. It is as good as the WRR, which proof our intuition by providing continuous service the same port, we can exploit the port locality. Looks like we have achieved our design objectives, but does all these benefits come at a cost.

Hardware Cost BCBR: frontend Xilinx MPMC: frontend + backend
1393 LUTs 884 registers 0 BRAM Reference backend: speedy DDRMC 1986 LUTs 1380 registers 4 BRAMs Xilinx MPMC: frontend + backend 3450 LUTs 5540 registers 1-9 BRAMs BCBR + Speedy 3379 LUTs 2264 registers 4 BRAMs To evaluate the hardware cost, we implement the proposed scheduler on virtex6 FPGA, the scheduler cost 1.4K luts, 900 registers and requires no BRAMs. However, this cost only account for the frontend controller. As a reference, the speedy backend memory controller for FPGA cost about 2000LUT and 1.4K registers and 4 BRAMs. The aggregate cost is still less than the number reported by Xilinx for their MPMC design. As a conclusion, our scheduler has better performance without incurring higher cost. Better performance without higher cost!

Agenda Overview Multi-Port Memory Controller (MPMC) Design
“Out-of-Core” Algorithm / Architecture Exploration

Idea Out-of-core algorithms Remember DRAM=DISK So let’s Key questions
Data does not fit DRAM Performance dominated by IO Key questions Reduce #IOs Block granularity Remember DRAM=DISK So let’s Ask the same question Plug-on DRAM parameters Get DRAM-specific answers To evaluate the hardware cost, we implement the proposed scheduler on virtex6 FPGA, the scheduler cost 1.4K luts, 900 registers and requires no BRAMs. However, this cost only account for the frontend controller. As a reference, the speedy backend memory controller for FPGA cost about 2000LUT and 1.4K registers and 4 BRAMs. The aggregate cost is still less than the number reported by Xilinx for their MPMC design. As a conclusion, our scheduler has better performance without incurring higher cost.

Motivating Example: CDN
Caches in CDN Get closer to users Save bandwidth Zipf’s law 80-20 rule  hit rate For services like YouTube to provide good user experience, there needs a content distribution network with lots of caches to bring content close to users and save upstream bandwidth. The idea of caching is well supported by Zipf’s law, also known as the rules, which states that 80% of viewers are attracted by the 20% most popular contents. With a zipf’s law, we don’t need to worry about cache hit rate. However, for flash memory, the metric of cache churn rate becomes important.

Video Cache

Defining the Knobs Transaction
a number of column access commands enclosed by row activation / precharge W: burst size s : # bursts Finally, we examine the DRAM energy and bw efficiency. The BCBR has 30% less page crossing commands and therefore is 1.4x more energy efficient and achieve 1.2x higher effective bandwidth compared to the RR. It is as good as the WRR, which proof our intuition by providing continuous service the same port, we can exploit the port locality. Looks like we have achieved our design objectives, but does all these benefits come at a cost. Function of algorithmic parameters Function of array organization & timing params Function of array organization & timing params

Algorithmic Design Variable:
D-nary Heap Algorithmic Design Variable: Branching Factor Record Size

B+ Tree

Lessons Learned Optimal result can be beautifully derived!
Big O does not matter in some cases Depending on data input characteristics

Recent Progress In Embedded Memory Controller Design

Similar presentations

Presentation on theme: "Recent Progress In Embedded Memory Controller Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Recent Progress In Embedded Memory Controller Design

Similar presentations

Presentation on theme: "Recent Progress In Embedded Memory Controller Design"— Presentation transcript:

Similar presentations

About project

Feedback