Virtually Pipelined Network Memory

Slides:



Advertisements
Similar presentations
A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.
Advertisements

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Understanding a Problem in Multicore and How to Solve It
Design and Analysis of a Robust Pipelined Memory System Hao Wang †, Haiquan (Chuck) Zhao *, Bill Lin †, and Jun (Jim) Xu * † University of California,
1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
Min-Sheng Lee Efficient use of memory bandwidth to improve network processor throughput Jahangir Hasan 、 Satish ChandraPurdue University T. N. VijaykumarIBM.
CSI 400/500 Operating Systems Spring 2009 Lecture #9 – Paging and Segmentation in Virtual Memory Monday, March 2 nd and Wednesday, March 4 th, 2009.
Nick McKeown 1 Memory for High Performance Internet Routers Micron February 12 th 2003 Nick McKeown Professor of Electrical Engineering and Computer Science,
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
Storage Networks How to Handle Heterogeneity Bálint Miklós January 24th, 2005 ETH Zürich External Memory Algorithms and Data Structures.
Embedded DRAM for a Reconfigurable Array S.Perissakis, Y.Joo 1, J.Ahn 1, A.DeHon, J.Wawrzynek University of California, Berkeley 1 LG Semicon Co., Ltd.
Analysis of a Memory Architecture for Fast Packet Buffers Sundar Iyer, Ramana Rao Kompella & Nick McKeown (sundaes,ramana, Departments.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
Elastic-Buffer Flow-Control for On-Chip Networks
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
CS414 Review Session.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Nick McKeown1 Building Fast Packet Buffers From Slow Memory CIS Roundtable May 2002 Nick McKeown Professor of Electrical Engineering and Computer Science,
Network On Chip Platform
1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.
High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.
Techniques for Fast Packet Buffers Sundar Iyer, Ramana Rao, Nick McKeown (sundaes,ramana, Departments of Electrical Engineering & Computer.
Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Block-Based Packet Buffer with Deterministic Packet Departures Hao Wang and Bill Lin University of California, San Diego HSPR 2010, Dallas.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Towards Optimal Sleep Scheduling in Sensor Networks for Rare-Event Detection Qing Cao, Tarek Abdelzaher, Tian He, John Stankovic Department of Computer.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures. A DRAM Refresh Method By Ishan Thakkar, Sudeep Pasricha
Washington University
CS 704 Advanced Computer Architecture
REAL-TIME OPERATING SYSTEMS
ESE532: System-on-a-Chip Architecture
Reza Yazdani Albert Segura José-María Arnau Antonio González
Reducing Memory Interference in Multicore Systems
Zhichun Zhu Zhao Zhang ECE Department ECE Department
Associativity in Caches Lecture 25
A Scalable Routing Architecture for Prefix Tries
Operating System I/O System Monday, August 11, 2008.
5.2 Eleven Advanced Optimizations of Cache Performance
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Lecture: DRAM Main Memory
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
THE HP AUTORAID HIERARCHICAL STORAGE SYSTEM
Digital Logic & Design Dr. Waseem Ikram Lecture 40.
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Adapted from slides by Sally McKee Cornell University
15-740/ Computer Architecture Lecture 19: Main Memory
CSC3050 – Computer Architecture
DRAM Hwansoo Han.
Chapter-5 Traffic Engineering.
Main Memory Background
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Haonan Wang, Adwait Jog College of William & Mary
Presentation transcript:

Virtually Pipelined Network Memory Banit Agrawal Tim Sherwood UC SANTA BARBARA

Packet classification rules Memory Design is Hard Increasing functionalities Increasing size of data structures Increasing line rate Throughput in the worst case Need to service the traffic at the advertised rate IPv4 routing table size 100k 200k 360k Packet classification rules 2000 5000 10000 10 Gbps 40 Gbps 160 Gbps Banit Agrawal 11/14/2018

What programmers think ? Low cost Low power High capacity High bandwidth incase of some access patterns Network Programmers Network System Memory DRAM What is the problem? Banit Agrawal 11/14/2018

Worst case : Every access conflicts DRAM Bank Conflicts Variable latency Variable throughput bank 0 bank 1 Busy bank 2 bank 3 Busy row decoder row decoder row decoder row decoder sense amplifier sense amplifier sense amplifier sense amplifier column decoder column decoder column decoder column decoder DRAM macro latency, Bank interleaving, dram accesses, bank conflicts, reduction in efficiency address bus data bus Worst case : Every access conflicts Banit Agrawal 11/14/2018

Prior Work Reducing bank-conflicts in common access patterns Prefetching and memory-aware layout [Lin-HPCA’01, Mathew-HPCA’00] Reordering of requests [Hong-HPCA’99, Rixner-ISCA’00] Vector processing domain [Espasa-Micro’97] Good for desktop computing No guarantees for the worst case Reducing bank conflicts for special access patterns Packet buffering : written once write and read once Low bank conflicts - Optimizations including row-locality and scheduling [Hasan-ISCA’03, Nikologiannis-ICC’01] No bank conflicts - Reordering and clever memory management algorithms [Garcia-Micro’03, Iyer-StanTechReport’02] Not applicable in any access patterns Banit Agrawal 11/14/2018

Where network system stands ? 0% deadline failures Full determinism required No exploitable deadline failures Common-case optimized parts Best effort (co-operative) Common-case optimized Parts Banit Agrawal 11/14/2018

Virtually Pipelined Memory Normalize the overall latency Using randomization and buffering Deterministic latency for all accesses Trillions of accesses without any bank conflicts Even in case of any access patterns t Memory Controller DRAM t + D Banit Agrawal 11/14/2018

Outline Memory for networking systems Memory controller Design analysis Hardware design How we compare? Conclusion Banit Agrawal 11/14/2018

Memory Controller t t + D Bank 0 controller Bank 0 key HU 5 → 2,A 6 → 0,F 7 → 2,B 8 → 3,A t Bank 2 controller Bus Scheduler Bank 2 R address t + D Bank 3 controller data Bank 3

Non-conflicting Accesses Bank latency (L) – 15 cycles Normalized delay (D) – 30 cycles 10 20 30 40 50 60 70 80 requests A B C data ready A B C Repeated requests Banit Agrawal 11/14/2018

Redundant Accesses Conflicting requests Bank latency (L) – 15 cycles Normalized delay (D) – 30 cycles 10 20 30 40 50 60 70 80 requests A B A A B data ready A B A A B Conflicting requests Banit Agrawal 11/14/2018

Conflicting Accesses Bank latency (L) – 15 cycles Normalized delay (D) – 30 cycles 10 20 30 40 50 60 70 80 requests A B C D E Stall data ready A B C D E Banit Agrawal 11/14/2018

Implementing Virtual Pipelined Banks Delay Storage Buffer v address incr/decr ++ data words first zero Delay Storage Buffer Bank Access Queue Bank Access Queue r/w row id scheduled-access address row scheduled-access data addr data Write Buffer address data words Set 1 Set 0 out ptr access[t-3] access[t] access[t-d+1] … access[t-2] access[t-d] access[t-d+2] in ptr Circular Delay Buffer Circular Delay Buffer Control Logic Control Logic to memory Write Buffer (FIFO) Interface address Interface data

Implementing Virtual Pipelined Banks Delay Storage Buffer v address incr/decr ++ data words first zero Delay Storage Buffer Bank Access Queue Bank Access Queue r/w row id scheduled-access address row scheduled-access data addr data Write Buffer address data words Set 1 Set 0 out ptr access[t-3] access[t] access[t-d+1] … access[t-2] access[t-d] access[t-d+2] in ptr Circular Delay Buffer Circular Delay Buffer Control Logic Control Logic to memory Write Buffer (FIFO) Interface address Interface data

Delay Storage Buffer Stall Mean-time to stall (MTS) B – number of banks, 1/B is the probability of a request Stall happens when there are more than k accesses in interval D An Illustration Normalized latency (D) - 30 cycles Number of entries in the delay storage buffer (K) - 3 Banit Agrawal 11/14/2018

Delay Storage Buffer Stall +1 +1 +1 +1 +1 +1 requests A B C D E F 10 20 30 40 50 60 70 80 data ready A B C D E F -1 -1 -1 -1 -1 -1 MTS = log ( ) D - 1 K - 1 1 2 log (1 – ( ( ) * ) K-1 )) B + D Banit Agrawal 11/14/2018

MTS = probability of stall state becomes 0.5 Markovian Analysis Bank access queue stall State-based analysis Number of banks (B) - 1/B is the probability of an access to a bank If more than D cycles of work to be done, a stall occurs. An example: Bank access latency (L) = 3 Normalized delay (D) = 6 1 1 B 1 B 1 B stall idle 1 2 3 4 5 6 1- 1 B 1- 1 B MTS = probability of stall state becomes 0.5 Banit Agrawal 11/14/2018

Markovian Analysis P = I M n Find n s.t. P=50% Banit Agrawal 11/14/2018

Hardware Design and Overhead Verilog implementation Verification using ModelSim and C++ simulation model Synthesizing using Synopsys Design Compiler Hardware overhead tool Using Cacti parameters Verify one with the synthesized design Optimal design parameters using this tool 45.7 seconds MTS with area overhead of 34.1 mm2 at 77% efficiency 10 hours MTS with area overhead of 34 mm2 at 71.4% efficiency Banit Agrawal 11/14/2018

How VPNM performs ? Packet buffering Packet reassembly 35% less area Only need to store the head and tail pointers Can support arbitrarily large number of logical queues Packet reassembly Scheme Line rate (Gbps) Area (mm2) Total delay (ns) Supported interfaces RADS [17] 40 10 53 130 CFDS [12] 160 60 10000 850 Our approach 41.9 960 4096 35% less area 10x less latency 5x more queues Banit Agrawal 11/14/2018

Conclusion VPNM provides t Higher throughput Memory DRAM Controller Deterministic latency Randomization and normalization Higher throughput worst case that is impossible to exploit Handles any access patterns Ease of programmability/mapping Packet buffering Packet reassembly t Memory Controller DRAM t + D Banit Agrawal 11/14/2018

Thanks for your attention. Questions?? http://www.cs.ucsb.edu/~arch/ Banit Agrawal 11/14/2018