Presentation on theme: "Recent Progress In Embedded Memory Controller Design"— Presentation transcript:
1 Recent Progress In Embedded Memory Controller Design MEAOW’13Definition of FPGAJianwen ZhuDepartment of Electrical and Computer EngineeringUniversity of Toronto
2 AcknowledgmentPhD work of Zefu DaiDefinition of FPGA
3 Memory Hierarchy Latency, Capacity, Bandwidth Cache Controller DRAM L: 0.5ns, C: 10MBCacheControllerL: 50ns, C: 100GBBW: 100GB/sDRAML: 10us, C: 2TBBW: 2GB/sNothing is more important than the concept of memory hierarchyFlashL: 10ms, C: 4TBBW: 600MB/sDisk
4 DRAM Primer<bank, row, column>Page buffer per bank
5 DRAM Characteristics DRAM page crossing Charge ~10K DRAM cells and bitlinesIncrease power & latencyDecrease effective bandwidthSequential access VS. random accessLess page crossingLower power consumption4.4x shorter latency10x better BWMore into the DRAM characteristics, one important feature of DRAM memory is thatDRAM page crossing has significant impact on the latency and bandwidth of off-chip memories.Each page crossing charges tens of thousands of DRAM cells and bitlines, thus greatly increase the power and decrease the performance.In general DRAM memory prefer sequential access over random access, because the sequential access pattern can have less page crossing which can significantly improve the latency and bw.
9 Motivating Example: H.264 Decoder Diverse QoS requirementsBandwidth sensitiveLatency sensitiveFor example, bandwidth sensitive ports want to reserve a certain amount of memory bandwidth, but are less sensitive to the latency of the service, on the contrary, latency sensitive ports such as the caches require their requests to be serviced as fast as possible. Added to the problem is that off-chip memory has dynamic latency and bandwidth depends on the scheduling order. Therefore, it is important to explore the characteristics of DRAM memory when designing memory scheduler.18.104.22.1684.8Dynamic latency, BW and power0.0931.0156.794MB/s
10 Wanted Bandwidth guarantee Prioritized access Reduced page crossing Now let us re-stating the problem, we need a MPMC that can provide minimum bandwidth guarantee for bandwidth sensitive port, and prioritized access for latency sensitive port. Finally, because power consumption is generally important to embedded system, it is necessary that the scheduler can improve performance by reducing the number of DRAM page crossing
11 Previous Works Bandwidth guarantee Q2: Prioritized access Q0: Distinguish bandwidth guarantee for different classes of portsQ1: Distinguish bandwidth guarantee for each portQ2: Prioritized accessQ3: Residual bandwidth allocationQ4: Effective DRAM bandwidthPrevious works have tried to address the challenges of MPMC design, They made different achievements on different aspects of the problem, but none has succeeded in covering them all at the same time. In particular, few of them has paid attention to the residual bandwidth, which is the bandwidth that is statically allocated to one port but is under utilized at run timeQ0Q1Q2Q3Q4[Rixner,00][McKee,00][Hur,04]✓[Heighecker,03,05][Whitty,08][Lee,05][Burchard,05]Proposed BCBR
12 Key Observations Port locality: Service time flexibility Same port requests same DRAM pageService time flexibility1/24 second to decode a video frame4M cycles at 100 MHz for request reorderingResidual bandwidthStatically allocated BWUnderutilized at runtimeWeighted round robin:Minimum BW guaranteeBusting serviceCredit borrow & repayReorder requests according to priorityDynamic BW calculationCapture and re-allocate residual BWIn order to achieve design objectives, we leverage 3 key observations of the MPMC design.The first one is the port locality, we believe that requests from the same port are more likely to fall into the same DRAM page. We use the wrr to exploit the port locality, because it can provide minimum bandwidth guarantee. More importantly, it supports continue service to the same port, which can potentially reduce the number DRAM page crossing.We use WRR as the baseline scheduler.The second observation is service time flexibility. Bandwidth sensitive ports may not be sensitive to service latency of their requests. for example each video frame has one 24th second of time budget to be decoded, which translate into 4 million cycles at 100 MHz clock. For the service time flexibility, we propose a credit borrow and repay mechanism to reorder the request according to priority.The last observation is that there is residual bandwidth at runtime, ports may underutilize their statically allocated bandwidthwe use a dynamic BW calculation scheme to capture residual bandwidth and re-allocate it to important userBase line algorithmBreadth first page crossingReduce the latency while keep bw guaranteeDifference between statical allocation and dynamic usageCan be utilized effectively, not aware
13 Weighted Round Robin Assume bandwidth requirement Tround = 10 Q2: 30% Q1: 50% Q0: 20%Tround = 10Time: scheduling cyclesT(Rij): arriving time of jth requests for QiClock:123456789Request time:T(R2)R20R21R22Service time:Q2R20R21R22T(R1)R10R11R12R13R14Let us illustrate how WRR works., consider there are 3 ports with different bw requirements. Also assume there are 10 cycles in each round of scheduling.I would like to illustrate the scheduling process in this diagram. The top row shows the clock cycles, and a separate row is used to show the arriving time of requests for each portWe use another row to show the service time of requests.In the first cycle, 3 requests arrive to each ports, and the scheduler chooses to service Q2. In following cycles, Q2 gets continuous service.In cycle 4 because Q2 has exhausted its scheduling credits, the scheduler switch to service Q1. so on and so forth, Q0 get scheduled in last two cycles of the round.We can see that besides minimum bandwidth guarantee, the WRR algorithm can provide continuous service to each port, thus can potentially reduce the number of DRAM page crossing by exploiting the port locality.Q1R10R11R12R13R14T(R0)R00R01Q0R00R01
14 Problem with WRR Priority: Q0 > Q2 8 cycles of waiting time! Clock: 123456789T(R2)R20R21R22Q2R20R21R22T(R1)R10R11R12R13R14Q1R10R11R12R13R14The problem with WRR is that the average waiting time could be very long. For example, although Q0 is latency sensitive and has higher priority. It needs to wait for 8 cycles in each round of scheduling. And this waiting time can be much larger if each round of scheduling takes more than 10 cycles.T(R0)R00R01Q0R00R018 cycles of waiting time!Could be worse!
15 Borrow Credits Zero Waiting time for Q0! borrow Clock: T(R2) R20 R21 123456789T(R2)R20R21R22Q2R20T(R1)R10R11R12Q1borrowTo solve this problem, we propose the credit borrow and repay technique. We illustrate the technique by attaching a FIFOQ to Q0.Now, Let’s restart the scheduling processIn the first cycle 3 request arrives, and the scheduler chooses to service port number 2. However, sinceQ0 has higher priority, it borrows the scheduling opportunity from Q2 and sends its request to DRAM, at the same time the portID of Q2 is pushed into the debtQ. Then Q0 borrows a second scheduling slot from Q2 and push the ID to the debt Q,In the third cycle, since there is no request in Q0, Q2 keeps theScheduling opportunity to itself and send a request.T(R0)R00R01Q0*R00R01debtQ0Q2Q2Q2
16 Repay Later At Q0’s turn, BW guarantee is recovered repay Clock: T(R2) 123456789T(R2)R20R21R22Q2R20R21R22T(R1)R10R11R12R13R14repayQ1R10R11R12R13R14In the end, when Q0 gets its scheduling opportunities, it repays the credits to Q2. So that the bandwidth guarantee is recovered but Q0 get immediate service for their request.T(R0)R00R01Q0*R00R01debtQ0Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Prioritized access!
17 Problem: Depth of DebtQ DebtQ as residual BW collectorBW allocated to Q0 increases to: 20% + residual BWRequirement for the depth of DebtQ0 decreasesClock:123456789T(R2)R20R21R22Q2R20R21R22T(R1)R10R11R12R13Help repayQ1R10R11R12R13However, the problem for the CBR is that the performance is limited by the depth of the debtQFor example, if Q0 have one extra request arrived at cycle 2, because the debtQ is full, no borrowing can occur, thus the extra request can not receive immediate service.To solve the problem we either have to increase the depth of the debtQ or we can leverage the residual bandwidth.Now assume Q1 only have 4 requests. At cycle 7, when Q0 gets the scheduling token, it first uses it to send our the extra request.It the use the other scheduling slot to replay part of the debt, However, the debtQ is still not empty. Now if we use the residual bandwidth to help repay the debt. Q0 can have an empty debtQ at the end of the round, while also have the extra request sent out to the DRAM memory.By doing this, the residual bw is actually consumed by Q0 and the requirement for the depth of the debtQ is reduced.T(R0)R00R01R03Q0*R00R01R03debtQ0Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2Q2
18 Evaluation Framework Simulation Framework Workload: ALPBench suite DRAMSim: simulates DRAM latency+BW+powerReference schedulers: PQ, RR, WRR, BGPQTo evaluate our algorithm, we build a simulation framework include a cycle accurate DRAM simulator which can simulate the latency bandwidth and power consumption of DRAM memory. A c model of our scheduler, and a set of workload traces including CPU cache trace and multimeida traces. We compare the results with other schedulers include PQ, RR, WRR and BGPQ.Benchmark explaination, pin tool,
19 Bandwidth Guarantee Bandwidth guarantees: No BW guarantee P0: 2% P1: 30% P2: 20% P3:20% P4:20%System residual: 8%No BW guaranteePort1234RR1.08%24%PQ0.73%80%18%0%BGPQ1.07%39%20%WRR0.76%33%22%BCBRWe first examine if the proposed scheduler can achieve minimum bandwidth guarantee. We set the bandwidth requirement of each port to be 2 percent for port 0… note that we leave 8% as system residual to test the impact of residual bandwidth.The final bandwidth allocation results of different schedulers are shown in this table.We can see that the RR and the PQ does not achieve bandwidth guarantee. In particular the PQ scheduler causes starvation in port 3 and 4.On the contrary,The BGPQ, WRR and BCBR , all achieve minimum bandwidth guarantees and have similar bw allocation result. However does that mean this three schedulers also have the same service latency.Provides BW guarantee!
20 Cache Response Latency Average 16x faster than WRRAs fast as PQ (prioritized access)Latency (ns)The figure here shows the cache response latency by different schedulers. We can see that the three scheduler in the right has very different service latency. Especially, the proposed scheduler is 16x faster than WRR and is as fast as the PQ. Which demonstrate that it can provide prioritized access to latency sensitive ports.The WRR is significantly slower than others, and the proposed scheduler is>16x faster than the WRR scheduler>1.6x faster than the RR schedulerEspecially, it is as fast as the PQ scheduler.This demonstrate that it can provide prioritized accesses.
21 DRAM Energy & BW Efficiency 30% less page crossing (compared to RR)1.4x more energy efficient1.2x higher effective DRAM BWAs good as WRR (exploit port locality)RRBGPQWRRBCBRGB/J0.2980.2890.4120.411Act-Pre Ratio29.6%30.1%23.0%Improvement1.0x0.97x1.38xFinally, we examine the DRAM energy and bw efficiency. The BCBR has 30% less page crossing commands and therefore is 1.4x more energy efficient and achieve 1.2x higher effective bandwidth compared to the RR. It is as good as the WRR, which proof our intuition by providing continuous service the same port, we can exploit the port locality.Looks like we have achieved our design objectives, but does all these benefits come at a cost.
22 Hardware Cost BCBR: frontend Xilinx MPMC: frontend + backend 1393 LUTs884 registers0 BRAMReference backend: speedy DDRMC1986 LUTs1380 registers4 BRAMsXilinx MPMC: frontend + backend3450 LUTs5540 registers1-9 BRAMsBCBR + Speedy3379 LUTs2264 registers4 BRAMsTo evaluate the hardware cost, we implement the proposed scheduler on virtex6 FPGA, the scheduler cost 1.4K luts, 900 registers and requires no BRAMs. However, this cost only account for the frontend controller. As a reference, the speedy backend memory controller for FPGA cost about 2000LUT and 1.4K registers and 4 BRAMs. The aggregate cost is still less than the number reported by Xilinx for their MPMC design. As a conclusion, our scheduler has better performance without incurring higher cost.Better performance without higher cost!
24 Idea Out-of-core algorithms Remember DRAM=DISK So let’s Key questions Data does not fit DRAMPerformance dominated by IOKey questionsReduce #IOsBlock granularityRemember DRAM=DISKSo let’sAsk the same questionPlug-on DRAM parametersGet DRAM-specific answersTo evaluate the hardware cost, we implement the proposed scheduler on virtex6 FPGA, the scheduler cost 1.4K luts, 900 registers and requires no BRAMs. However, this cost only account for the frontend controller. As a reference, the speedy backend memory controller for FPGA cost about 2000LUT and 1.4K registers and 4 BRAMs. The aggregate cost is still less than the number reported by Xilinx for their MPMC design. As a conclusion, our scheduler has better performance without incurring higher cost.
25 Motivating Example: CDN Caches in CDNGet closer to usersSave bandwidthZipf’s law80-20 rule hit rateFor services like YouTube to provide good user experience, there needs a content distribution network with lots of caches to bring content close to users and save upstream bandwidth. The idea of caching is well supported by Zipf’s law, also known as the rules, which states that 80% of viewers are attracted by the 20% most popular contents. With a zipf’s law, we don’t need to worry about cache hit rate. However, for flash memory, the metric of cache churn rate becomes important.
27 Defining the Knobs Transaction a number of column access commands enclosed by row activation / prechargeW: burst sizes : # burstsFinally, we examine the DRAM energy and bw efficiency. The BCBR has 30% less page crossing commands and therefore is 1.4x more energy efficient and achieve 1.2x higher effective bandwidth compared to the RR. It is as good as the WRR, which proof our intuition by providing continuous service the same port, we can exploit the port locality.Looks like we have achieved our design objectives, but does all these benefits come at a cost.Function of algorithmic parametersFunction of array organization & timing paramsFunction of array organization & timing params