SCALABLE PACKET CLASSIFICATION USING INTERPRETING A CROSS-PLATFORM MULTI-CORE SOLUTION Author: Haipeng Cheng, Zheng Chen, Bei Hua and Xinan Tang Publisher/Conf.:

Slides:



Advertisements
Similar presentations
Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ.
Advertisements

More Intel machine language and one more look at other architectures.
DSPs Vs General Purpose Microprocessors
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
Computer Architecture and Data Manipulation Chapter 3.
1 TCAM Razor: A Systematic Approach Towards Minimizing Packet Classifiers in TCAMs Department of Computer Science and Information Engineering National.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
1 High-performance packet classification algorithm for multithreaded IXP network processor Authors: Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang.
Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:
Chapter 12 Pipelining Strategies Performance Hazards.
IXP1200 Microengines Apparao Kodavanti Srinivasa Guntupalli.
Packet Classification on Multiple Fields Pankaj Gupta and Nick McKeown Stanford University {pankaj, September 2, 1999.
CS 268: Lectures 13/14 (Route Lookup and Packet Classification) Ion Stoica April 1/3, 2002.
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
CS 268: Route Lookup and Packet Classification Ion Stoica March 11, 2003.
Performance Evaluation of IPv6 Packet Classification with Caching Author: Kai-Yuan Ho, Yaw-Chung Chen Publisher: ChinaCom 2008 Presenter: Chen-Yu Chaug.
1 Energy Efficient Packet Classification Hardware Accelerator Alan Kennedy, Xiaojun Wang HDL Lab, School of Electronic Engineering, Dublin City University.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 12 CPU Structure and Function. Example Register Organizations.
ECE 526 – Network Processing Systems Design IXP XScale and Microengines Chapter 18 & 19: D. E. Comer.
1 Sec (2.3) Program Execution. 2 In the CPU we have CU and ALU, in CU there are two special purpose registers: 1. Instruction Register 2. Program Counter.
Chapter 9 Classification And Forwarding. Outline.
Processor Organization and Architecture
1 Route Table Partitioning and Load Balancing for Parallel Searching with TCAMs Department of Computer Science and Information Engineering National Cheng.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
PARALLEL TABLE LOOKUP FOR NEXT GENERATION INTERNET
Previously Fetch execute cycle Pipelining and others forms of parallelism Basic architecture This week we going to consider further some of the principles.
Packet Classification on Multiple Fields 참고 논문 : Pankaj Gupta and Nick McKeown SigComm 1999.
Multi-Field Range Encoding for Packet Classification in TCAM Author: Yeim-Kuan Chang, Chun-I Lee and Cheng-Chien Su Publisher: INFOCOM 2011 Presenter:
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Today’s topics Architecture overview Architecture overview Machine instructions Machine instructions Instruction Execution Cycle Instruction Execution.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
IPv6-Oriented 4 OC768 Packet Classification with Deriving-Merging Partition and Field- Variable Encoding Scheme Mr. Xin Zhang Undergrad. in Tsinghua University,
Author: Heeyeol Yu and Rabi Mahapatra
E X C E E D I N G E X P E C T A T I O N S VLIW-RISC CSIS Parallel Architectures and Algorithms Dr. Hoganson Kennesaw State University Instruction.
Pipelining and Parallelism Mark Staveley
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
IXP Lab 2012: Part 3 Programming Tips. Outline Memory Independent Techniques – Instruction Selection – Task Partition Memory Dependent Techniques – Reducing.
Lecture on Central Process Unit (CPU)
Parallel tree search: An algorithmic approach for multi- field packet classification Authors: Derek Pao and Cutson Liu. Publisher: Computer communications.
Packet Classification Using Multidimensional Cutting Sumeet Singh (UCSD) Florin Baboescu (UCSD) George Varghese (UCSD) Jia Wang (AT&T Labs-Research) Reviewed.
RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.
Author : Lynn Choi, Hyogon Kim, Sunil Kim, Moon Hae Kim Publisher/Conf : IEEE/ACM TRANSACTIONS ON NETWORKING Speaker : De yu Chen Data :
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Introduction to computer architecture April 7th. Access to main memory –E.g. 1: individual memory accesses for j=0, j++, j
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Immediate Addressing Mode
William Stallings Computer Organization and Architecture 8th Edition
3.3.3 Computer architectures
Toward Advocacy-Free Evaluation of Packet Classification Algorithms
Cache Memory Presentation I
Apparao Kodavanti Srinivasa Guntupalli
Lecture 20: OOO, Memory Hierarchy
Classification of instructions
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
A Level Computer Science Topic 5: Computer Architecture and Assembly
The University of Adelaide, School of Computer Science
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Authors: Duo Liu, Bei Hua, Xianghui Hu and Xinan Tang
Overview Problem Solution CPU vs Memory performance imbalance
Sec (2.3) Program Execution.
Presentation transcript:

SCALABLE PACKET CLASSIFICATION USING INTERPRETING A CROSS-PLATFORM MULTI-CORE SOLUTION Author: Haipeng Cheng, Zheng Chen, Bei Hua and Xinan Tang Publisher/Conf.: ACM/PPoPP '08 (the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming) Speaker: Han-Jhen Guo Date:

OUTLINE Developing TIC Algorithm RFC Reduction Tree TIC Algorithm Description Instruction Encoding The Range Interpreter Architecture-aware Design and Implementation Simulation and Performance Analysis Relative Speedups for Core 2 Duo Relative Speedups for IXP2800

DEVELOPING TIC ALGORITHM - RFC REDUCTION TREE (1/2) A simple example of RFC reduction tree

DEVELOPING TIC ALGORITHM - RFC REDUCTION TREE (2/2) Actual architecture of RFC reduction tree 4 phases 13 memory accesses per packet disadvantage: the cost of memory explosion

DEVELOPING TIC ALGORITHM - TIC ALGORITHM DESCRIPTION Two-stage Interpreting based Classification (TIC) algorithm Stage 1 source IP address destination IP address Retrieve list of range expression (possibly matched rules) from source IP address and destination IP address Stage 2 source portdestination portprotocol Search matched rules with source port, destination port and protocol in the code block by the Range Interpreter (RI)

DEVELOPING TIC ALGORITHM - INSTRUCTION ENCODING (1/2) Operator (8-bit) protocol-srcPORT- desPORT Operand0 (8-bit) protocol Operand1~Operand4 (16-bit) srcPORT (begin|end) desPORT (begin|end) RuleID (16-bit) # of matched rules the maximum # of rules in the classifiers is 64K (2 16 )

DEVELOPING TIC ALGORITHM - INSTRUCTION ENCODING (2/2) Classes of “ClassBench ” port range WC (wildcard) HI ([1024 : 65535]) LO ([0 : 1023]) AR (arbitrary range) EM (exact match) protocol range WC EM A small example of instruction encoding EM-WC-WC use instruction of 4B

DEVELOPING TIC ALGORITHM - THE RANGE INTERPRETER All instruction blocks are stored in external memory we get the address of the first code block after the stage 1

ARCHITECTURE-AWARE DESIGN AND IMPLEMENTATION (1/3) Hardware Intel Core 2 Duo with two levels of cache (multi-core architecture) 4MB L2 cache size and 64B cache line size Intel IXP2800 without cache (multi-threaded architecture)

ARCHITECTURE-AWARE DESIGN AND IMPLEMENTATION (2/3) Space Reduction CISC instruction encoding produces smaller program size than the RISC encoding the variable-size instruction encoding of CISC can save up to 15% of memory than 8-byte RISC encoding

ARCHITECTURE-AWARE DESIGN AND IMPLEMENTATION (3/3) Latency hiding of memory access Core 2 one CPU core can be used as a helper thread to warm up L2 cache the other main thread can be executed faster if the same cache lines are already fetched IXP2800 1) issuing outstanding memory requests whenever possible memory operations in phase 0 at the first stage can be simultaneously issued 2) overlapping the memory access with the ALU execution their memory address calculation can then be overlapped with other memory operations

SIMULATION AND PERFORMANCE ANALYSIS Effective Space Reduction

SIMULATION AND PERFORMANCE ANALYSIS - RELATIVE SPEEDUPS FOR CORE 2 DUO (1/2) In worst cases quite large number of L2 cache misses overhead of RI is insignificant TIC’s worst-case classification speeds are slower than RFC’s in 1, 2- thread cases but faster than RFC in 3, 4-thread cases TIC’s performance is better than RFC’s when more threads available

SIMULATION AND PERFORMANCE ANALYSIS - RELATIVE SPEEDUPS FOR CORE 2 DUO (2/2) In average cases main thread has a very few L2 cache misses the interpreter overhead might be noticeable TIC is faster than RFC in terms of classification speed if the memory space of RFC is bigger than L2 cache size

SIMULATION AND PERFORMANCE ANALYSIS - RELATIVE SPEEDUPS FOR IXP2800 (1/3) TIC’s performance is worse than RFC’s

SIMULATION AND PERFORMANCE ANALYSIS - RELATIVE SPEEDUPS FOR IXP2800 (2/3) Factors of poor performance Less memory access but more long-words access the FIFO size of TIC is bigger than that of RFC for both average and the worst cases a SRAM operation will stay in FIFO longer for TIC than for RFC

SIMULATION AND PERFORMANCE ANALYSIS - RELATIVE SPEEDUPS FOR IXP2800 (3/3) Block Size Impact on IXP2800 The classification speed and the speedup is all higher when the block size is 32B than 64B