Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond

VCRIB: Virtual Cloud Rule Information Base Masoud Moshref, Minlan Yu, Abhishek Sharma, Ramesh Govindan HotCloud 2012.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

Supercharging PlanetLab : a high performance, Multi-Application, Overlay Network Platform Written by Jon Turner and 11 fellows. Presented by Benjamin Chervet.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Fast Firewall Implementation for Software and Hardware-based Routers Lili Qiu, Microsoft Research George Varghese, UCSD Subhash Suri, UCSB 9 th International.

Bio Michel Hanna M.S. in E.E., Cairo University, Egypt B.S. in E.E., Cairo University at Fayoum, Egypt Currently is a Ph.D. Student in Computer Engineering.

Parallell Processing Systems1 Chapter 4 Vector Processors.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

Chapter 8 Hardware Conventional Computer Hardware Architecture.

Reference: Message Passing Fundamentals.

1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,

CSIE NCKU High-performance router architecture 高效能路由器的架構與設計.

1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

IXP1200 Microengines Apparao Kodavanti Srinivasa Guntupalli.

Packet Classification on Multiple Fields Pankaj Gupta and Nick McKeown Stanford University {pankaj, September 2, 1999.

Architecture for Network Hub in 2011 David Chinnery Ben Horowitz.

Chess Review May 10, 2004 Berkeley, CA A Comparison of Network Processor Programming Environments Niraj Shah William Plishker Kurt Keutzer.

1 DRES:Dynamic Range Encoding Scheme for TCAM Coprocessors Authors: Hao Che, Zhijun Wang, Kai Zheng and Bin Liu Publisher: IEEE Transactions on Computers,

Performance Analysis of the IXP1200 Network Processor Rajesh Krishna Balan and Urs Hengartner.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Intel IXP1200 Network Processor q Lab 12, Introduction to the Intel IXA q Jonathan Gunner, Sruti.

Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)

ECE 526 – Network Processing Systems Design IXP XScale and Microengines Chapter 18 & 19: D. E. Comer.

March 1, Packet Classification and Filtering for Network Processors JC Ho.

Chapter 9 Classification And Forwarding. Outline.

Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Paper Review Building a Robust Software-based Router Using Network Processors.

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

(TPDS) A Scalable and Modular Architecture for High-Performance Packet Classification Authors: Thilan Ganegedara, Weirong Jiang, and Viktor K. Prasanna.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

To be smart or not to be? Siva Subramanian Polaris R&D Lab, RTP Tal Lavian OPENET Lab, Santa Clara.

Timothy Whelan Supervisor: Mr Barry Irwin Security and Networks Research Group Department of Computer Science Rhodes University Hardware based packet filtering.

Packet Classification on Multiple Fields 참고 논문 : Pankaj Gupta and Nick McKeown SigComm 1999.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 ECSE-6600: Internet Protocols Informal Quiz #14 Shivkumar Kalyanaraman: GOOGLE: “Shiv RPI”

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.

StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:

ECE 526 – Network Processing Systems Design Network Processor Introduction Chapter 11,12: D. E. Comer.

A Smart Pre-Classifier to Reduce Power Consumption of TCAMs for Multi-dimensional Packet Classification Yadi Ma, Suman Banerjee University of Wisconsin-Madison.

Processor Architecture

High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.

Research on TCAM-based OpenFlow Switch Author: Fei Long, Zhigang Sun, Ziwen Zhang, Hui Chen, Longgen Liao Conference: 2012 International Conference on.

Sunpyo Hong, Hyesoon Kim

Packet Classification Using Multidimensional Cutting Sumeet Singh (UCSD) Florin Baboescu (UCSD) George Varghese (UCSD) Jia Wang (AT&T Labs-Research) Reviewed.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

PipeliningPipelining Computer Architecture (Fall 2006)

By: Yaron Levy Supervisors: Dr. Shlomo Greenberg Mr. Hagai David.

Lecture 23: Interconnection Networks

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Anne Pratoomtong ECE734, Spring2002

Parallel and Multiprocessor Architectures

Apparao Kodavanti Srinivasa Guntupalli

Jinquan Dai, Long Li, Bo Huang Intel China Software Center

Towards Effective Packet Classification

High-performance router/switch architecture 高效能路由器/交換器的架構與設計

Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang

COMP60611 Fundamentals of Parallel and Distributed Systems

6- General Purpose GPU Programming

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

IXP C Programming Language

Presentation transcript:

Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University November 18, 2004 IEEE Local Computer Networks

Network Processors Emerging platform for high-speed packet processing –Splice in a statistic here? –Provide device programmability while keeping performance Architectures differ, but common features include… –Multiple processing units executing in parallel –Instruction set customized for network applications –Binary image pre-determined at compile time

Example: Intel’s IXP

IXP Architecture Multi-processor –StrongARM core for slow-path processing –6 microengines for fast-path processing Hardware support for multi threading Each microengine has 4 thread contexts Zero or minimal overhead context switch

Motivation for study NPs offer a programmable, parallel alternative, but current packet processing algorithms are –Written for sequential execution or –Designed using custom, invariant ASICs To use them on NPs –Algorithms must be mapped onto NPs in different ways with each mapping having varying performance

Our study Examine several mappings of a packet classification algorithm onto NP hardware Identify general problems in performing such mappings

Why packet classification? Fundamental function performed by all network devices –Routers, switches, bridges, firewalls, IDS Increasing complexity makes packet classification the bottleneck –Increase in size of rulesets –Increase in dimension of rulesets –Algorithms must perform at high-speed on the fast-path

Picking an algorithm Many algorithms sequential –Do not leverage inherent parallelism in NPs Several parallel algorithms –BitVector [Lakshman98] Parallel lookup implemented via FPGA Maps well onto NP platform

Bit Vector algorithm T.V. Lakshman, D. Stiliadis, “High-speed policy- based packet forwarding using efficient multidimensional range matching”, SIGCOMM –Parallel search algorithm –Preprocessing phase –Two-stage classification phase Perform lookup for each dimension in parallel Combine results to determine matching rule

Example ruleset RuleField 1Field 2Field 3Action r1r1 (10, 11)(2, 4)(8, 11)Allow r2r2 (4, 6)(8, 11)(1, 4)Allow r3r3 (9, 11)(5, 7)(12, 14)Deny r4r4 (6, 8)(1, 3)(5, 9)Allow Number of rules (N) = 4 Number of dimensions (d) = 3 Width of dimension (W) = 4 (bits)

BitVector example Packet = {6, 10, 2} Matching rule = r2

Two design mappings Consider multiple mappings of BitVector onto Intel’s IXP1200 microengines –Option 1: All processing for a single packet handled by one microengine (μEngine) - Parallel –Option 2: Processing for a single packet is split across μEngines - Pipelined Recall: IXP has 6 μEngines

Parallel Mapping

Pipelined Mapping

Memory allocation PurposeType of memory Queue for inter-microengine communicationSRAM List of rules actionsSRAM Tries representing rangesSDRAM Bit VectorsSDRAM

Evaluation platform Intel IXP1200 Developer Workbench –Graphical IDE –Cycle-accurate simulator –Performance statistics All experiments run within simulator –Configurable –Logging facility

Simulator configuration IXP1200 chip –1K microstore –Core frequency (~ 165 MHz) –4 ports receive data Simulations run until packets received by IXP –Simulator sends packets as fast as possible Rulesets used –Experiments use a small, fixed set of rules –Availability of real-world firewall rulesets limited

Performance metrics Performance MetricDescription Transmit rate (Mbps)The overall packet transmit rate of the IXP, for all the ports that are configured to send packets. Microengine execution time (%)The percentage of the total number of microengine cycles that a microengine spent in performing useful tasks. Microengine aborted time (%)The percentage of the total time of a microengine that was wasted due to instructions in its pipeline being aborted, typically due to branch instructions. Microengine idle time (%)The percentage of the total time of a microengine that was wasted due to none of the 4 hardware threads being available to run, typically due to memory access wait time. SDRAM access (%)The total percentage of SDRAM bandwidth utilized by all microengines. SRAM access (%)The total percentage of SRAM bandwidth utilized by all microengines.

Results and Analysis

Throughput

Packets sent/receive ratio

Analysis Overall, Parallel performs better than Pipelined Pipelined : A single packet header in SDRAM is read multiple (3) times

Microengine utilization

Microengine aborted time

Analysis Aborted time is typically caused by branch instructions Algorithms must reduce branch instructions to maximize throughput

Microengine idle time

Distribution of microengine time Parallel Pipelined

Analysis High microengine idle time in Pipelined due to memory latency Lower microengine aborted time in Pipelined due to what?

Discussion Pipelined mappings can bottleneck through memory –Repeated memory reads to send work from μEngine to μEngine –Direct hardware support for pipelining required IXP2xxx = next-neighbor registers Currently re-examining our results on IXP2400 Algorithms with fewer branch instructions result in better microengine utilization (lower aborted time)

Conclusion Packet classification is a fundamental function Parallel nature of NPs well-suited for parallel search algorithms

Conclusion Network processors offer high packet processing speed and programmability –Performance of an algorithm depends on the design mapping chosen Contributions –Demonstrated that mapping has considerable impact on performance Pipelined mappings benefit from hardware support Algorithms with fewer branch instructions result in better processor utilization

Future work Analyze other mappings –Split work across different hardware threads in a single microengine –Placement of data structures in different memory banks IXP2400 –Examine how hardware features change trade-offs in algorithm mapping Algorithms designed specifically for network processors

Backup Slides

Definitions Process of categorizing packets according to pre- defined rules Classifier or ruleset: collection of rules Dimension or field: packet header used Rule: range of field values and action

Packet classification algorithms AlgorithmTime complexity Storage complexity Linear SearchNdNdNdNd Set-pruning TriesdWdWNddWNddW Grid of triesW d-1 NdWNdW Cross productingdWdWNdNd Fat Inverted Segment tree(l + 1)Wl x N 1+1/l Recursive Flow ClassificationdNdNd Hierarchical Intelligent CuttingsdNdNd Tuple Space SearchmN Bit VectordW + N/memory- width dN2dN2 N: number of rules d: number of dimensions W: maximum number of bits l : number of levels occupied by a FIS-tree

Implementation constraints 8 SRAM locks Queue