Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto

Similar presentations


Presentation on theme: "Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto"— Presentation transcript:

1 Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu

2 Acknowledgment PhD work of Zefu Dai

3 Memory Hierarchy Cache DRAM Flash Disk L: 0.5ns, C: 10MB L: 50ns, C: 100GB BW: 100GB/s L: 10us, C: 2TB BW: 2GB/s L: 10ms, C: 4TB BW: 600MB/s Latency, Capacity, Bandwidth Controller

4 DRAM Primer Page buffer per bank

5 DRAM Characteristics  DRAM page crossing  Charge ~10K DRAM cells and bitlines  Increase power & latency  Decrease effective bandwidth  Sequential access VS. random access  Less page crossing  Lower power consumption  4.4x shorter latency  10x better BW 5

6 Take Away: DRAM = Disk

7 Embedded Controller Opportunities for customization Bad News None available as in general purpose processor Good News

8 Agenda  Overview  Multi-Port Memory Controller (MPMC) Design  “Out-of-Core” Algorithmic Exploration

9 9 Motivating Example: H.264 Decoder 6.49.6 1.2 164.80.0931.0156.794 MB/s Dynamic latency, BW and power  Diverse QoS requirements Latency sensitive Bandwidth sensitive

10 10 Wanted  Bandwidth guarantee  Prioritized access  Reduced page crossing

11 Previous Works  Bandwidth guarantee Q0: Distinguish bandwidth guarantee for different classes of ports Q1: Distinguish bandwidth guarantee for each port  Q2: Prioritized access  Q3: Residual bandwidth allocation  Q4: Effective DRAM bandwidth Q0Q1Q2Q3Q4 [Rixner,00][McKee,00][Hur,04] ✓ [Heighecker,03,05][Whitty,08] ✓✓✓ [Lee,05] ✓✓ [Burchard,05] ✓✓ Proposed BCBR ✓✓✓✓ 11

12 12 Key Observations  Port locality:  Same port requests  same DRAM page  Service time flexibility  1/24 second to decode a video frame  4M cycles at 100 MHz for request reordering  Residual bandwidth  Statically allocated BW  Underutilized at runtime  Weighted round robin:  Minimum BW guarantee  Busting service  Credit borrow & repay  Reorder requests according to priority  Dynamic BW calculation  Capture and re-allocate residual BW

13 13 R 20 T(Rij): arriving time of jth requests for Qi Weighted Round Robin  Assume bandwidth requirement  Q2: 30% Q1: 50% Q0: 20% Request time: Service time: Clock: T round = 10 Time: scheduling cycles 0123456789 T(R 2 ) Q2Q2 T(R 1 ) Q1Q1 T(R 0 ) Q0Q0 R 00 R 20 R 10 R 01 R 21 R 11 R 21 R 22 R 12 R 22 R 13 R 10 R 14 R 11 R 12 R 13 R 14 R 00 R 01

14 14 Problem with WRR  Priority: Q0 > Q2 8 cycles of waiting time! Could be worse! R 20 Clock: 0123456789 T(R 2 ) Q2Q2 T(R 1 ) Q1Q1 T(R 0 ) Q0Q0 R 00 R 20 R 10 R 01 R 21 R 11 R 21 R 22 R 12 R 22 R 13 R 10 R 14 R 11 R 12 R 13 R 14 R 00 R 01

15 15 Borrow Credits  Zero Waiting time for Q0! Clock: 0123456789 T(R 2 ) Q2Q2 T(R 1 ) Q1Q1 T(R 0 ) Q0*Q0* R 00 R 20 R 10 R 01 R 21 R 11 R 22 R 12 R 20 R 00 R 01 debtQ 0 Q2 borrow

16 16 Repay Later  At Q0’s turn, BW guarantee is recovered Clock: 0123456789 T(R 2 ) Q2Q2 T(R 1 ) Q1Q1 T(R 0 ) Q0*Q0* R 00 R 20 R 10 R 01 R 21 R 11 R 22 R 12 R 13 R 10 R 14 R 11 R 12 R 13 R 14 R 00 R 01 debtQ 0 Q2 R 20 Q2 R 21 R 22 Q2 repay Prioritized access!

17 17 Problem: Depth of DebtQ  DebtQ as residual BW collector  BW allocated to Q0 increases to: 20% + residual BW  Requirement for the depth of DebtQ0 decreases Clock: 0123456789 T(R 2 ) Q2Q2 T(R 1 ) Q1Q1 T(R 0 ) Q0*Q0* R 00 R 20 R 10 R 01 R 21 R 11 R 22 R 12 R 13 R 10 R 03 R 11 R 12 R 13 R 00 R 01 debtQ 0 Q2 R 20 Q2 R 21 R 22 Q2 Help repay R 03

18 18 Evaluation Framework  Simulation Framework  Workload: ALPBench suite  DRAMSim: simulates DRAM latency+BW+power  Reference schedulers: PQ, RR, WRR, BGPQ

19 Port01234 RR 1.08%24% PQ 0.73%80%18%0% BGPQ 1.07%39%20% WRR 0.76%33%22% BCBR 0.76%33%22% 19 Bandwidth Guarantee  Bandwidth guarantees:  P0: 2% P1: 30% P2: 20% P3:20% P4:20%  System residual: 8% No BW guarantee Provides BW guarantee !

20 20 Cache Response Latency  Average 16x faster than WRR  As fast as PQ (prioritized access) Latency (ns)

21 21 DRAM Energy & BW Efficiency  30% less page crossing (compared to RR)  1.4x more energy efficient  1.2x higher effective DRAM BW  As good as WRR (exploit port locality) RRBGPQWRRBCBR GB/J0.2980.2890.4120.411 Act-Pre Ratio29.6%30.1%23.0% Improvement1.0x0.97x1.38x

22 Hardware Cost 22  Xilinx MPMC: frontend + backend  3450 LUTs  5540 registers  1-9 BRAMs  BCBR + Speedy  3379 LUTs  2264 registers  4 BRAMs  BCBR: frontend  1393 LUTs  884 registers  0 BRAM  Reference backend: speedy DDRMC  1986 LUTs  1380 registers  4 BRAMs Better performance without higher cost!

23 Agenda  Overview  Multi-Port Memory Controller (MPMC) Design  “Out-of-Core” Algorithm / Architecture Exploration

24 Idea 24  Remember DRAM=DISK  So let’s  Ask the same question  Plug-on DRAM parameters  Get DRAM-specific answers  Out-of-core algorithms  Data does not fit DRAM  Performance dominated by IO  Key questions  Reduce #IOs  Block granularity

25 Motivating Example: CDN  Caches in CDN  Get closer to users  Save bandwidth  Zipf’s law  80-20 rule  hit rate 25

26 Video Cache

27 27 Defining the Knobs  Transaction  a number of column access commands enclosed by row activation / precharge  W: burst size  s : # bursts Function of array organization & timing params Function of algorithmic parameters

28 D-nary Heap Algorithmic Design Variable: Branching Factor Record Size

29 B+ Tree

30 Lessons Learned  Optimal result can be beautifully derived!  Big O does not matter in some cases  Depending on data input characteristics


Download ppt "Recent Progress In Embedded Memory Controller Design MEAOW’13 Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto"

Similar presentations


Ads by Google