Recent Progress In Embedded Memory Controller Design

Slides:

Advertisements

Similar presentations

Zehan Cui, Yan Zhu, Yungang Bao, Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences July 28, 2011.

Advertisements

Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Outline Memory characteristics SRAM Content-addressable memory details DRAM © Derek Chiou & Mattan Erez 1.

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

A Case for Refresh Pausing in DRAM Memory Systems

Office of Science U.S. Department of Energy Bassi/Power5 Architecture John Shalf NERSC Users Group Meeting Princeton Plasma Physics Laboratory June 2005.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

Evaluating an Adaptive Framework For Energy Management in Processor- In-Memory Chips Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas.

Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.

Operating Systems 1 K. Salah Module 2.1: CPU Scheduling Scheduling Types Scheduling Criteria Scheduling Algorithms Performance Evaluation.

Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 13, 2002 Topic: Main Memory (DRAM) Organization.

The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,

Fair Scheduling in Web Servers CS 213 Lecture 17 L.N. Bhuyan.

©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

1 Lecture 7: Caching in Row-Buffer of DRAM Adapted from “A Permutation-based Page Interleaving Scheme: To Reduce Row-buffer Conflicts and Exploit Data.

Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.

Analysis of a Memory Architecture for Fast Packet Buffers Sundar Iyer, Ramana Rao Kompella & Nick McKeown (sundaes,ramana, Departments.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Improving Real-Time Performance on Multicore Platforms Using MemGuard University of Kansas Dr. Heechul Yun 10/28/2013.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

A Mixed Time-Criticality SDRAM Controller MeAOW Sven Goossens, Benny Akesson, Kees Goossens COBRA – CA104 NEST.

EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 7: February 6, 2012 Memories.

Lecture 13 Main Memory Computer Architecture COE 501.

1 Scheduling The part of the OS that makes the choice of which process to run next is called the scheduler and the algorithm it uses is called the scheduling.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.

Operating Systems 1 K. Salah Module 2.2: CPU Scheduling Scheduling Types Scheduling Criteria Scheduling Algorithms Performance Evaluation.

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

ExLRU : A Unified Write Buffer Cache Management for Flash Memory EMSOFT '11 Liang Shi 1,2, Jianhua Li 1,2, Chun Jason Xue 1, Chengmo Yang 3 and Xuehai.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Sunpyo Hong, Hyesoon Kim

07/11/2005 Register File Design and Memory Design Presentation E CSE : Introduction to Computer Architecture Slides by Gojko Babić.

Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 8: February 19, 2014 Memories.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

ESE532: System-on-a-Chip Architecture

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

Lecture 16: Data Storage Wednesday, November 6, 2006.

5.2 Eleven Advanced Optimizations of Cache Performance

Lecture 11: DMBS Internals

Short Circuiting Memory Traffic in Handheld Platforms

Lecture: DRAM Main Memory

CPE 631 Lecture 05: Cache Design

Adapted from slides by Sally McKee Cornell University

If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?

Samira Khan University of Virginia Nov 14, 2018

15-740/ Computer Architecture Lecture 19: Main Memory

ESE532: System-on-a-Chip Architecture

Presentation transcript:

Recent Progress In Embedded Memory Controller Design MEAOW’13 Definition of FPGA Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto jzhu@eecg.toronto.edu

Acknowledgment PhD work of Zefu Dai Definition of FPGA

Memory Hierarchy Latency, Capacity, Bandwidth Cache Controller DRAM L: 0.5ns, C: 10MB Cache Controller L: 50ns, C: 100GB BW: 100GB/s DRAM L: 10us, C: 2TB BW: 2GB/s Nothing is more important than the concept of memory hierarchy Flash L: 10ms, C: 4TB BW: 600MB/s Disk

DRAM Primer <bank, row, column> Page buffer per bank

DRAM Characteristics DRAM page crossing Charge ~10K DRAM cells and bitlines Increase power & latency Decrease effective bandwidth Sequential access VS. random access Less page crossing Lower power consumption 4.4x shorter latency 10x better BW More into the DRAM characteristics, one important feature of DRAM memory is that DRAM page crossing has significant impact on the latency and bandwidth of off-chip memories. Each page crossing charges tens of thousands of DRAM cells and bitlines, thus greatly increase the power and decrease the performance. In general DRAM memory prefer sequential access over random access, because the sequential access pattern can have less page crossing which can significantly improve the latency and bw.

Take Away: DRAM = Disk

Embedded Controller Bad News Good News None available as in general purpose processor Opportunities for customization

Agenda Overview Multi-Port Memory Controller (MPMC) Design “Out-of-Core” Algorithmic Exploration

Motivating Example: H.264 Decoder Diverse QoS requirements Bandwidth sensitive Latency sensitive For example, bandwidth sensitive ports want to reserve a certain amount of memory bandwidth, but are less sensitive to the latency of the service, on the contrary, latency sensitive ports such as the caches require their requests to be serviced as fast as possible. Added to the problem is that off-chip memory has dynamic latency and bandwidth depends on the scheduling order. Therefore, it is important to explore the characteristics of DRAM memory when designing memory scheduler. 6.4 9.6 1.2 164.8 Dynamic latency, BW and power 0.09 31.0 156.7 94 MB/s

Wanted Bandwidth guarantee Prioritized access Reduced page crossing Now let us re-stating the problem, we need a MPMC that can provide minimum bandwidth guarantee for bandwidth sensitive port, and prioritized access for latency sensitive port. Finally, because power consumption is generally important to embedded system, it is necessary that the scheduler can improve performance by reducing the number of DRAM page crossing

Previous Works Bandwidth guarantee Q2: Prioritized access Q0: Distinguish bandwidth guarantee for different classes of ports Q1: Distinguish bandwidth guarantee for each port Q2: Prioritized access Q3: Residual bandwidth allocation Q4: Effective DRAM bandwidth Previous works have tried to address the challenges of MPMC design, They made different achievements on different aspects of the problem, but none has succeeded in covering them all at the same time. In particular, few of them has paid attention to the residual bandwidth, which is the bandwidth that is statically allocated to one port but is under utilized at run time Q0 Q1 Q2 Q3 Q4 [Rixner,00][McKee,00][Hur,04] ✓ [Heighecker,03,05][Whitty,08] [Lee,05] [Burchard,05] Proposed BCBR

Key Observations Port locality: Service time flexibility Same port requests  same DRAM page Service time flexibility 1/24 second to decode a video frame 4M cycles at 100 MHz for request reordering Residual bandwidth Statically allocated BW Underutilized at runtime Weighted round robin: Minimum BW guarantee Busting service Credit borrow & repay Reorder requests according to priority Dynamic BW calculation Capture and re-allocate residual BW In order to achieve design objectives, we leverage 3 key observations of the MPMC design. The first one is the port locality, we believe that requests from the same port are more likely to fall into the same DRAM page. We use the wrr to exploit the port locality, because it can provide minimum bandwidth guarantee. More importantly, it supports continue service to the same port, which can potentially reduce the number DRAM page crossing. We use WRR as the baseline scheduler. The second observation is service time flexibility. Bandwidth sensitive ports may not be sensitive to service latency of their requests. for example each video frame has one 24th second of time budget to be decoded, which translate into 4 million cycles at 100 MHz clock. For the service time flexibility, we propose a credit borrow and repay mechanism to reorder the request according to priority. The last observation is that there is residual bandwidth at runtime, ports may underutilize their statically allocated bandwidth we use a dynamic BW calculation scheme to capture residual bandwidth and re-allocate it to important user Base line algorithm Breadth first page crossing Reduce the latency while keep bw guarantee Difference between statical allocation and dynamic usage Can be utilized effectively, not aware

Weighted Round Robin Assume bandwidth requirement Tround = 10 Q2: 30% Q1: 50% Q0: 20% Tround = 10 Time: scheduling cycles T(Rij): arriving time of jth requests for Qi Clock: 1 2 3 4 5 6 7 8 9 Request time: T(R2) R20 R21 R22 Service time: Q2 R20 R21 R22 T(R1) R10 R11 R12 R13 R14 Let us illustrate how WRR works., consider there are 3 ports with different bw requirements. Also assume there are 10 cycles in each round of scheduling. I would like to illustrate the scheduling process in this diagram. The top row shows the clock cycles, and a separate row is used to show the arriving time of requests for each port We use another row to show the service time of requests. In the first cycle, 3 requests arrive to each ports, and the scheduler chooses to service Q2. In following cycles, Q2 gets continuous service. In cycle 4 because Q2 has exhausted its scheduling credits, the scheduler switch to service Q1. so on and so forth, Q0 get scheduled in last two cycles of the round. We can see that besides minimum bandwidth guarantee, the WRR algorithm can provide continuous service to each port, thus can potentially reduce the number of DRAM page crossing by exploiting the port locality. Q1 R10 R11 R12 R13 R14 T(R0) R00 R01 Q0 R00 R01

Problem with WRR Priority: Q0 > Q2 8 cycles of waiting time! Clock: 1 2 3 4 5 6 7 8 9 T(R2) R20 R21 R22 Q2 R20 R21 R22 T(R1) R10 R11 R12 R13 R14 Q1 R10 R11 R12 R13 R14 The problem with WRR is that the average waiting time could be very long. For example, although Q0 is latency sensitive and has higher priority. It needs to wait for 8 cycles in each round of scheduling. And this waiting time can be much larger if each round of scheduling takes more than 10 cycles. T(R0) R00 R01 Q0 R00 R01 8 cycles of waiting time! Could be worse!

Borrow Credits Zero Waiting time for Q0! borrow Clock: T(R2) R20 R21 1 2 3 4 5 6 7 8 9 T(R2) R20 R21 R22 Q2 R20 T(R1) R10 R11 R12 Q1 borrow To solve this problem, we propose the credit borrow and repay technique. We illustrate the technique by attaching a FIFOQ to Q0. Now, Let’s restart the scheduling process In the first cycle 3 request arrives, and the scheduler chooses to service port number 2. However, since Q0 has higher priority, it borrows the scheduling opportunity from Q2 and sends its request to DRAM, at the same time the portID of Q2 is pushed into the debtQ. Then Q0 borrows a second scheduling slot from Q2 and push the ID to the debt Q, In the third cycle, since there is no request in Q0, Q2 keeps the Scheduling opportunity to itself and send a request. T(R0) R00 R01 Q0* R00 R01 debtQ0 Q2 Q2 Q2

Repay Later At Q0’s turn, BW guarantee is recovered repay Clock: T(R2) 1 2 3 4 5 6 7 8 9 T(R2) R20 R21 R22 Q2 R20 R21 R22 T(R1) R10 R11 R12 R13 R14 repay Q1 R10 R11 R12 R13 R14 In the end, when Q0 gets its scheduling opportunities, it repays the credits to Q2. So that the bandwidth guarantee is recovered but Q0 get immediate service for their request. T(R0) R00 R01 Q0* R00 R01 debtQ0 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Prioritized access!

Problem: Depth of DebtQ DebtQ as residual BW collector BW allocated to Q0 increases to: 20% + residual BW Requirement for the depth of DebtQ0 decreases Clock: 1 2 3 4 5 6 7 8 9 T(R2) R20 R21 R22 Q2 R20 R21 R22 T(R1) R10 R11 R12 R13 Help repay Q1 R10 R11 R12 R13 However, the problem for the CBR is that the performance is limited by the depth of the debtQ For example, if Q0 have one extra request arrived at cycle 2, because the debtQ is full, no borrowing can occur, thus the extra request can not receive immediate service. To solve the problem we either have to increase the depth of the debtQ or we can leverage the residual bandwidth. Now assume Q1 only have 4 requests. At cycle 7, when Q0 gets the scheduling token, it first uses it to send our the extra request. It the use the other scheduling slot to replay part of the debt, However, the debtQ is still not empty. Now if we use the residual bandwidth to help repay the debt. Q0 can have an empty debtQ at the end of the round, while also have the extra request sent out to the DRAM memory. By doing this, the residual bw is actually consumed by Q0 and the requirement for the depth of the debtQ is reduced. T(R0) R00 R01 R03 Q0* R00 R01 R03 debtQ0 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2 Q2

Evaluation Framework Simulation Framework Workload: ALPBench suite DRAMSim: simulates DRAM latency+BW+power Reference schedulers: PQ, RR, WRR, BGPQ To evaluate our algorithm, we build a simulation framework include a cycle accurate DRAM simulator which can simulate the latency bandwidth and power consumption of DRAM memory. A c model of our scheduler, and a set of workload traces including CPU cache trace and multimeida traces. We compare the results with other schedulers include PQ, RR, WRR and BGPQ. Benchmark explaination, pin tool,

Bandwidth Guarantee Bandwidth guarantees: No BW guarantee P0: 2% P1: 30% P2: 20% P3:20% P4:20% System residual: 8% No BW guarantee Port 1 2 3 4 RR 1.08% 24% PQ 0.73% 80% 18% 0% BGPQ 1.07% 39% 20% WRR 0.76% 33% 22% BCBR We first examine if the proposed scheduler can achieve minimum bandwidth guarantee. We set the bandwidth requirement of each port to be 2 percent for port 0… note that we leave 8% as system residual to test the impact of residual bandwidth. The final bandwidth allocation results of different schedulers are shown in this table. We can see that the RR and the PQ does not achieve bandwidth guarantee. In particular the PQ scheduler causes starvation in port 3 and 4. On the contrary, The BGPQ, WRR and BCBR , all achieve minimum bandwidth guarantees and have similar bw allocation result. However does that mean this three schedulers also have the same service latency. Provides BW guarantee!

Cache Response Latency Average 16x faster than WRR As fast as PQ (prioritized access) Latency (ns) The figure here shows the cache response latency by different schedulers. We can see that the three scheduler in the right has very different service latency. Especially, the proposed scheduler is 16x faster than WRR and is as fast as the PQ. Which demonstrate that it can provide prioritized access to latency sensitive ports. The WRR is significantly slower than others, and the proposed scheduler is >16x faster than the WRR scheduler >1.6x faster than the RR scheduler Especially, it is as fast as the PQ scheduler. This demonstrate that it can provide prioritized accesses.

DRAM Energy & BW Efficiency 30% less page crossing (compared to RR) 1.4x more energy efficient 1.2x higher effective DRAM BW As good as WRR (exploit port locality) RR BGPQ WRR BCBR GB/J 0.298 0.289 0.412 0.411 Act-Pre Ratio 29.6% 30.1% 23.0% Improvement 1.0x 0.97x 1.38x Finally, we examine the DRAM energy and bw efficiency. The BCBR has 30% less page crossing commands and therefore is 1.4x more energy efficient and achieve 1.2x higher effective bandwidth compared to the RR. It is as good as the WRR, which proof our intuition by providing continuous service the same port, we can exploit the port locality. Looks like we have achieved our design objectives, but does all these benefits come at a cost.

Hardware Cost BCBR: frontend Xilinx MPMC: frontend + backend 1393 LUTs 884 registers 0 BRAM Reference backend: speedy DDRMC 1986 LUTs 1380 registers 4 BRAMs Xilinx MPMC: frontend + backend 3450 LUTs 5540 registers 1-9 BRAMs BCBR + Speedy 3379 LUTs 2264 registers 4 BRAMs To evaluate the hardware cost, we implement the proposed scheduler on virtex6 FPGA, the scheduler cost 1.4K luts, 900 registers and requires no BRAMs. However, this cost only account for the frontend controller. As a reference, the speedy backend memory controller for FPGA cost about 2000LUT and 1.4K registers and 4 BRAMs. The aggregate cost is still less than the number reported by Xilinx for their MPMC design. As a conclusion, our scheduler has better performance without incurring higher cost. Better performance without higher cost!

Agenda Overview Multi-Port Memory Controller (MPMC) Design “Out-of-Core” Algorithm / Architecture Exploration

Idea Out-of-core algorithms Remember DRAM=DISK So let’s Key questions Data does not fit DRAM Performance dominated by IO Key questions Reduce #IOs Block granularity Remember DRAM=DISK So let’s Ask the same question Plug-on DRAM parameters Get DRAM-specific answers To evaluate the hardware cost, we implement the proposed scheduler on virtex6 FPGA, the scheduler cost 1.4K luts, 900 registers and requires no BRAMs. However, this cost only account for the frontend controller. As a reference, the speedy backend memory controller for FPGA cost about 2000LUT and 1.4K registers and 4 BRAMs. The aggregate cost is still less than the number reported by Xilinx for their MPMC design. As a conclusion, our scheduler has better performance without incurring higher cost.

Motivating Example: CDN Caches in CDN Get closer to users Save bandwidth Zipf’s law 80-20 rule  hit rate For services like YouTube to provide good user experience, there needs a content distribution network with lots of caches to bring content close to users and save upstream bandwidth. The idea of caching is well supported by Zipf’s law, also known as the 80-20 rules, which states that 80% of viewers are attracted by the 20% most popular contents. With a zipf’s law, we don’t need to worry about cache hit rate. However, for flash memory, the metric of cache churn rate becomes important.

Video Cache

Defining the Knobs Transaction a number of column access commands enclosed by row activation / precharge W: burst size s : # bursts Finally, we examine the DRAM energy and bw efficiency. The BCBR has 30% less page crossing commands and therefore is 1.4x more energy efficient and achieve 1.2x higher effective bandwidth compared to the RR. It is as good as the WRR, which proof our intuition by providing continuous service the same port, we can exploit the port locality. Looks like we have achieved our design objectives, but does all these benefits come at a cost. Function of algorithmic parameters Function of array organization & timing params Function of array organization & timing params

Algorithmic Design Variable: D-nary Heap Algorithmic Design Variable: Branching Factor Record Size

B+ Tree

Lessons Learned Optimal result can be beautifully derived! Big O does not matter in some cases Depending on data input characteristics