Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview High Performance Packet Processing Challenges

Similar presentations


Presentation on theme: "Overview High Performance Packet Processing Challenges"— Presentation transcript:

0 Smart Memory Bill Lynch Sailesh Kumar and Team

1 Overview High Performance Packet Processing Challenges
Solution – Smart Memory Smart Memory Architecture Page 1

2 Packet Processing Workload Challenges
Sequential memory references For lookups (L2, L3, L4, and L7) Finite automata traversal Read-modify-write Statistics, counters, token-bucket, mutex, etc Pointer and link-list management Buffer management, packet queues, etc Traditional implementations use Commodity memory to store data NPs and ASICs to process data in memory Tons of memory references and minimal compute Memory Memory Memory Memory Performance Barriers: Memory and chip I/O bandwidth Memory latency Lock for atomic access P P P P P P P P P P P P Page 2

3 Illustration of Performance Barrier I
1 2 3 5 4 7 9 P2 P5 6 P3 8 P4 P1 Memory Memory Memory Memory Interconnection network IP lookup tree P P P P P P P P P P P P Requires several transactions between memory and processors High lookup delay due to high interconnect and memory latency Low IPC Need more processors More latency In interconnect Page 3

4 Illustration of Performance Barrier II
Lookups are read-only so relatively easy Link-list, counters, policers, etc are read-modify-write Requires per memory address lock in multi-core systems Memory Memory Memory Interconnection network P P P P P P P P P P P P Lock free-list Get free node Unlock free-list Unlock list tail Lock list tail Link free node Read list tail Update list tail Enqueue Dequeue Lock counter Read counter Unlock counter Write counter Counters Locks often kept in memory Requires another transaction Adds significant latency Single queue or single counter operations are extremely slow Page 4

5 Overview High Performance Packet Processing Challenges
Memory bandwidth and latency Limited I/O bandwidth Solution – Smart Memory Attach simple compute with data Attach lock with data Enable local memory communication Smart Memory Architecture Hybrid memory – eDRAM + DDR3-DRAM Serial I/O Page 5

6 Introduction to Smart Memory
What is the real problem? Compute occurs far away from data Lock acquire/release occurs far from data Solution: Make memory smarter by: Memory Memory Memory Interconnection network Fortunately, compute for packet processing jobs are very modest! P P P P P P P P P P P P P Interconnection network Memory Enabling local communication compute Managing lock close to data Keeping compute close to data Page 6

7 Introduction to Smart Memory
What is the real problem? Compute occurs far away from data Lock acquire/release occurs far from data Memory Memory Memory Interconnection network Fortunately, compute for packet processing jobs are very modest! P P P P P P P P P P P P Smart Memory Advantages (Get more off fewer transactions!) Lower I/O bandwidth Lower processing latency Higher IPC Significantly higher single counter/queue performance P Interconnection network Memory compute Page 7

8 Overview High Performance Packet Processing Challenges
Memory bandwidth and latency Limited I/O bandwidth Solution – Smart Memory Attach simple compute with data Attach lock with data Enable local memory communication Smart Memory Architecture Hybrid memory – eDRAM + DDR3-DRAM Serial chip I/O Page 8

9 Smart Memory Capacity and Bandwidth @100G
40 20 10 5 2.5 1.25 .62 .31 .15 DPI (string) DPI (regex) ACL (algorithmic) FIB (algorithmic) Memory bandwidth (Billion accesses / packet) Layer2 fwding Queuing/ Scheduling Statistics/ Counter Basic Layer2 Packet Buffer Video Buffer Memory Capacity (MB) Page 9

10 Smart Memory Capacity and Bandwidth @100G
40 20 10 5 2.5 1.25 .62 .31 .15 DPI (string) DPI (regex) Smart Memory uses intelligent algorithms to split the data-structures 64 banks eDRAM ACL (algorithmic) FIB (algorithmic) Memory bandwidth (Billion accesses / packet) Layer2 fwding Queuing/ Scheduling Statistics/ Counter Basic Layer2 Packet Buffer 8 channels of DDR3-DRAM Video Buffer Memory Capacity (MB) Page 10

11 Smart Memory High Level Architecture
DDR3 DRAM Packet processor complex Global interconnect: provides fair communication between processors and smart memory P DRAM SM Engine eDRAM eDRAM eDRAM SM engine eDRAM SM engine SM engine SM engine eDRAM eDRAM SM engine eDRAM SM engine eDRAM SM engine SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine Deterministic and non-deterministic modes of smart memory. For GGSN, we provide much higher cache performance. eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine Local interconnect: provides local communication between smart memory blocks Smart Memory complex Page 11

12 Smart Memory High Level Architecture
DDR3 DRAM Packet processor complex P FIB key result DRAM SM Engine read Computation occurs close to memory reducing latency Requires fewer memory transactions FIB key data eDRAM eDRAM eDRAM SM engine eDRAM SM engine read SM engine SM engine eDRAM eDRAM SM engine eDRAM SM engine eDRAM SM engine Split tables into eDRAM and DRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine Deterministic and non-deterministic modes of smart memory. For GGSN, we provide much higher cache performance. eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine Smart Memory complex Page 12

13 I/O Technology Choice in Smart Memory
Smart Memory reduces the chip I/O bandwidth significantly How to further optimize it? - Bandwidth, latency and I/O bandwidth gap is growing - On-chip bandwidth is much higher than memory I/O Smart Memory use serial I/O - 4X throughput than RLDRAM and QDR - 3X fewer pins than DDR3 and DDR4 - 2.5X reduces I/O power Based on MoSys data Page 13

14 High Speed Line Card with Smart Memory
Power 212- w 472+ cm2 Area 148 cm2 Traditional Line Card 5600+ $ Cost 2520 $ Line Card with SM 2-3 times Page 14

15 Concluding Remarks Packet Processing Bottlenecks Smart Memory
Data away from compute I/O and memory bandwidth Smart Memory Keep compute close to data Keep locking close to data Provide inter-memory connect Advantages Reduced chip I/O bandwidth High performance and low latency Feature rich, flexible and programmable Lower cost One chip for several functions Page 15

16 Thank You Page 16


Download ppt "Overview High Performance Packet Processing Challenges"

Similar presentations


Ads by Google