Hardware Implementation of Fast Forwarding Engine using Standard Memory and Dedicated Circuit Kazuya ZAITSU, Shingo ATA, Ikuo OKA (Osaka City University, Japan) Koji YAMAMOTO (Renesas Design Corporation, Japan) Yasuto KURODA, Kazunari INOUE (Renesas Electronics Corporation, Japan) 1
Outline Background Objective Proposed hardware architecture Hardware architecture evaluation FPGA implementation Hardware evaluation Conclusion 2
What is TCAM? TCAM = Ternary Content Addressable Memory Feature – Very high speed searching – Input data for matching, output memory address – 3 rd matching state of “don’t care” in addition 1s and 0s Application – Looking up the routing table in IP routers Input Output 3 Addr.Prefix *.* * * …… Routing table 3
TCAM problems Manufacturing cost – $/bit is 4 times more expensive than SRAM. Power consumption – All logical gates must be energized for every search. Capacity – Expensive price-per-bit-ratio and power-saving activities – Hard to pursue denser TCAM 4 Search performance Manufacturing cost Power Consumption Capacity Requirements HighLow High TCAM High Low
Objective Propose a new hardware architecture – Focus on the address lookup in the routing table of routers – RAM-based design – Named “Custom Memory” Hardware design of the Custom Memory Verify the effectiveness of the Custom Memory – Effectiveness of our architecture – Dramatically reduce its cost and power consumption Implementation to the FPGA 5 SpeedCostPowerCapacity Custom MemoryHighLow High TCAMHigh Low
Design concepts Divide the memory area into equal-sized tables – Low power RAM-based design – Low cost, low power, high capacity Lookup operation by single access – High search performance Same physical user interface as TCAM – Aim to replace the TCAM in the market 6 SpeedCostPowerCapacityInterface Custom MemoryHighLow HighSame as TCAM
Architectural overview 7 Command Address IP addr. Prefix Table #0 Table #1 ・・・ Search device #0 RAM Table # -1 ・・ ・ Search device #1 Search device #N Custom Memory Same physical user interface as TCAM Same physical user interface as TCAM RAM based design Divide into subtables Comparator
Search device partitioning How to decide a device to store? 8 Search device #0 (prefix length 8) ・・ ・ Search device #1 (prefix length 9) Search device #N (prefix length 32) Partitioning based on prefix length / / / / / /32 Example
Table partitioning How to decide a table to store? – bits in prefix are extracted for “index bits”. – Remainder bits are stored. How to determine the index bits? 9 Extract last bits from prefix Example ( =8) ・・・ empty ・・・ ・・・ empty ・・・ empty ・・・ empty RAM # -1 # -2 # 1 # /16 → Remainder bits Index bits Search device (prefix length 16) # 1
Search operation 10 Search Command Destination IP Address Table #0 Table #1 ・・・ Search device (prefix length 8) RAM Table # -1 ・・ ・ Custom Memory Comparator Input-output controller Index calculator Destination IP Address Table # Search device (prefix length 9) Search device (prefix length 32) ・・・ LPM comparator Hit address
Evaluation of partitioning Which bits are better to use as index bits? – Distribution of table is affected to the cost. Evaluation metric – Maximum number of prefixes in the table 11 ・・・ RAM ・・・ Comp. ・・・ Comp. word lines comparators Table # # of prefixes in table Extract last bits from prefix
Effectiveness of indexing – Top k bits: using the top bits for index bits – proposal: using the last bits for index bits – bottom: ideal value (unrealizable) 12 Prefix length Max # of prefixes in table ( )
FPGA implementation ALTRA Stratix IV GX FPGA Development Kit Verilog-HDL Parameters – 4 search devices – 256 tables/device – 128 prefixes/table 13 Table #0 Table #1 ・・・ Search device #0 RAM Table # 255 Search device #1 Search device #3 Comparator Search device #2 128 prefixes
Hardware evaluation 14 Search performancePower consumption (mA) Chip area (ratio) Custom Memoryevery clock (125MHz)6.53 (52%)62% TCAMevery clock (360MHz) % RAM Comp. ・・・ RAM Comp. ・・・ RAM ・・・ RAM Comp. ・・・ Comp. TCAM ・・・ RAM ・・・ Comp. ・・・ Comp. word lines comparators RAM Custom Memory Operation area RAM (4k bits Array, Vdd=1.0V, Room Temp. 125Msps)
FPGA experiment – Examine the hardware operation – Use a raw data (BGP routing table) 15
Conclusion Design RAM-based fast forwarding engine – Hardware architecture – FPGA implementation Reduce the costs and power – 62% cost (compare with TCAM) – 52% power consumption (compare with TCAM) Future work – Implementation parameter optimization – Handling of the table overflow 16