Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ.

Slides:



Advertisements
Similar presentations
A: acceleration E: equilibrium Determine whether the following diagrams illustrate acceleration or equilibrium. Number your paper from 1 to 5 and answer.
Advertisements

Mental Mind Gym coming …. 30 Second Challenge - Early Additive.
Artrelle Fragher & Robert walker. 1 you look for the median 1 you look for the median 2 then you look for the min and max 2 then you look for the min.
1 May 19th, 2009 Announcement. 2 Drivers for Web Application Delivery Web traffic continues to increase More processing power at data aggregation points.
Ozone Level ppb (parts per billion)
NG-Mylife Platform Network Research Center of Tsinghua Univ. CERNET Center Aug 30, 2007.
Fraction IX Least Common Multiple Least Common Denominator
WHAT DO THEY ALL MEAN?. Median Is the number that is in the middle of a set of numbers. (If two numbers make up the middle of a set of numbers then the.
Nairobi, Kenya, 30 – 31 July 2010 Interoperability Challenges in ISPs Operations Tamer M. Kamel, Networks Operation & Maintenance Division, TE-DATA Egypt.
Topographic Maps Mr. King.
Half Life. The half-life of a quantity whose value decreases with time is the interval required for the quantity to decay to half of its initial value.
1 1  1 =.
1  1 =.
Science Jeopardy >>>> Topic 1 Topic 2 Topic 4 Topic Topic 5.
Lets Get Ready for Finals! Yellow Packet Note card, sheet of paper How to know when you are ready: –Can do most of the problems in review pkt When I am.
1 Networks for Multi-core Chip A Controversial View Shekhar Borkar Intel Corp.
Jianfeng Yang Wuhan University Enjoy Embedded: Embedded Education in WHU.
EXAMPLE 4 Solve a multi-step problem SHOPPING
Money Math Review.
Money Matters First Grade Math 1. What coin is worth $0.01? 1.Penny 2.Nickel 3.Dime.
Xie, JinLin & LMPD group CAS KEY LABORATORY OF BASIC PLASMA PHYSICS
Area of triangles.
Graphic Communication
Marks out of 100 Mrs Smith’s Class Median Lower Quartile Upper Quartile Minimum Maximum.
Fraction IX Least Common Multiple Least Common Denominator
What does it mean to say… “60% of the cars in the parking lot are blue”?
Ralph Santitoro Carrier Ethernet Market Development 22 March 2011 Connection-Oriented Ethernet for Cloud-based Unified Communications.
Introduction to Reporting And Graphing Scientific Data.
Making Numbers Two-digit numbers Three-digit numbers Click on the HOME button to return to this page at any time.
Using these slides Copy the following 2 slides into your presentation to introduce Prezentt to your audience (click on the first one, hold down shift,
Number bonds to 10,
Beat the Computer Drill Divide 10s Becky Afghani, LBUSD Math Curriculum Office, 2004 Vertical Format.
Grade 10 Mixture Problems
2 x0 0 12/13/2014 Know Your Facts!. 2 x1 2 12/13/2014 Know Your Facts!
Bottoms Up Factoring. Start with the X-box 3-9 Product Sum
Jeopardy Start Final Jeopardy Question Category 1Category 2Category 3Category 4Category
Back to menu category 1 type you categories here– delete these instructions. Final jeopardy question.
Kyle bought a bike from his friend. His friend gave him a 20% discount. He paid $40 for it. How much was the original price of the bike?
Cisco Confidential 1 © 2010 Cisco and/or its affiliates. All rights reserved. CISCO PROPRIETARY.
Using Lowest Common Denominator to add and subtract fractions
Powerpoint Jeopardy Category 1Category 2Category 3Category 4Category
HetnetIP Ethernet BackHaul Configuration Automation Demo.
Dutchess Community College Fire Science program Let’s take a 10 minute break Please be back on time.
Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond
Scalable Packet Classification Using Hybrid and Dynamic Cuttings Authors : Wenjun Li,Xianfeng Li Publisher : Engineering Lab on Intelligent Perception.
SAS Performance on SPARC T4 + Solaris: Customer experience performance study from the U.S. Bureau of Labor Statistics Edmond Cheng, Economist, Bureau of.
1 High-performance packet classification algorithm for multithreaded IXP network processor Authors: Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang.
Two stage packet classification using most specific filter matching and transport level sharing Authors: M.E. Kounavis *,A. Kumar,R. Yavatkar,H. Vin Presenter:
UCB Communication Networks: Big Picture Jean Walrand U.C. Berkeley
OpenFlow-Based Server Load Balancing GoneWild Author : Richard Wang, Dana Butnariu, Jennifer Rexford Publisher : Hot-ICE'11 Proceedings of the 11th USENIX.
High-Performance Packet Classification on GPU Author: Shijie Zhou, Shreyas G. Singapura and Viktor K. Prasanna Publisher: HPEC 2014 Presenter: Gang Chi.
A Novel Adaptive Distributed Load Balancing Strategy for Cluster CHENG Bin and JIN Hai Cluster.
Packet Classification Using Multi-Iteration RFC Author: Chun-Hui Tsai, Hung-Mao Chu, Pi-Chung Wang Publisher: COMPSACW, 2013 IEEE 37th Annual (Computer.
Guangdeng Liao, Xia Zhu, Steen Larsen, Laxmi Bhuyan, Ram Huggahalli University of California, Riverside Intel Labs.
Pattern-Based DFA for Memory- Efficient and Scalable Multiple Regular Expression Matching Author: Junchen Jiang, Yang Xu, Tian Pan, Yi Tang, Bin Liu Publisher:IEEE.
CS 614: Theory and Construction of Compilers Lecture 7 Fall 2002 Department of Computer Science University of Alabama Joel Jones.
SCALABLE PACKET CLASSIFICATION USING INTERPRETING A CROSS-PLATFORM MULTI-CORE SOLUTION Author: Haipeng Cheng, Zheng Chen, Bei Hua and Xinan Tang Publisher/Conf.:
Binary-tree-based high speed packet classification system on FPGA Author: Jingjiao Li*, Yong Chen*, Cholman HO**, Zhenlin Lu* Publisher: 2013 ICOIN Presenter:
LOP_RE: Range Encoding for Low Power Packet Classification Author: Xin He, Jorgen Peddersen and Sri Parameswaran Conference : IEEE 34th Conference on Local.
Step 1 - Defining the Problem  Identify and clearly state what the problem is and what can be done to solve the problem.  Determine which problems are.
Hierarchical Hybrid Search Structure for High Performance Packet Classification Authors : O˜guzhan Erdem, Hoang Le, Viktor K. Prasanna Publisher : INFOCOM,
JA-trie: Entropy-Based Packet Classification Author: Gianni Antichi, Christian Callegari, Andrew W. Moore, Stefano Giordano, Enrico Anastasi Conference.
Reorganized and Compact DFA for Efficient Regular Expression Matching
Mining Dependent Patterns
Company Product with Intel Solution Product Focus
BACK SOLUTION:
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
Solve multi step equations and inequalities.
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Authors: Duo Liu, Bei Hua, Xianghui Hu and Xinan Tang
Presentation transcript:

Intel Compiler Lab and USTC, PPoPP08 Scalable Packet Classification Using Interpreting -- Cross-platform Multi-core Solution Haipeng Cheng & Bei Hua Univ. of Science & Technology of China (USTC) Xinan Tang Intel Compiler Lab.

Intel Compiler Lab and USTC, PPoPP08 Background Packet Classification Problem Review RFC Algorithm TIC Algorithm Experimental Results and Analysis Future Work Outline

Intel Compiler Lab and USTC, PPoPP08 10GbE Smaller, Chapter, and Denser Switch Ports Servers with 10GbE 10%30-40%50-60%>60% Port Cost$2-5K$1-2K<$400<$250

Intel Compiler Lab and USTC, PPoPP08 Background (Networking) 10Gbps offers too much bandwidth for the multi-core computers to handle Traffic complexity: triple-play (voice, video, and data) support is essential Traffic types: P2P packets occupy 70% of the total network traffic Packet classification becomes increasingly important to identify and control the traffic

Intel Compiler Lab and USTC, PPoPP08 Background (Multi-core) Multi-core becomes prevalent Networking (Intel IXP, Cavium Octeon, RMI XLR) Multi-media (IBM Cell, Intel Larabee) General-purpose – Intel Core 2 Duo – AMD Barcelona – IBM Power5 – Sun Niagara Comment: find an efficient solution for one multi-core architecture is hard; find a cross-platform solution even harder

Intel Compiler Lab and USTC, PPoPP08 Classification Problem The process of partitioning packets into groups is called packet classification. Packet classification typically uses 5-tuples Enable value-added services: – Security: classify packets based on security policies – QoS: sort packets and ensure the packets receiving an appropriate bandwidth share – P2P management: tame the P2P traffic

Intel Compiler Lab and USTC, PPoPP08 Which package does it match to ? Packet Classification Example Packet (000, 010) How to match?

Intel Compiler Lab and USTC, PPoPP08 Why Is Packet Classification Hard? Packet classification is NP-hard Heuristic solutions seek O(1) solutions At 10Gbps (OC-192) speed, a 64-byte packet needs to be classified within 40ns – one DARM access time – 100 cycles for a 2.5Ghz CPU

Intel Compiler Lab and USTC, PPoPP08 Packet Classification Solutions At 10Gbps (OC-192) speed, it is done by Special ASIC TCAM Algorithms (?) – Hierarchical Tries – Recursive Flow Classification (RFC) – Two-stage Interpreter based Classification (TIC)

Intel Compiler Lab and USTC, PPoPP08 RFC Example Even though search space is huge (2^3)*(2^3)*(2^3), for a given packet, the actual matched rules per field is limited Class bitmap can be used to describe the matched rules: – 0001 means R4 is the matched rule – 1101 means R1, R2, and R4 are the ones matched

Intel Compiler Lab and USTC, PPoPP08 RFC Exam.

Intel Compiler Lab and USTC, PPoPP08 Recursive Flow Classification Map an S-bit string concatenated from the d fields of the packet header to a T-bit number through multiple phases (T << S ) S-IP(32b) D-IP(32b) S-Port(16b) D-Port(16b) Proto(8b)

Intel Compiler Lab and USTC, PPoPP08 Whats Wrong with RFC? Memory exploded Too slow to do update in practice However, 13-memory-access is the fastest classification algorithm

Intel Compiler Lab and USTC, PPoPP08 Two-stage Interpreting based Classification Domain knowledge: divide the RFC into two stages: Search source-destination prefix pair – 99.9% of the time the number of rules that match a pair of source- destination prefix is no more than 5 Search the list of port-range expressions – Range [2..14] in prefix: 001*, 01**, 10**, 110*, 1110 – Range search is based on calculation (,=) – Encoding the type of the range expressions intelligently – Evaluating them sequentially

Intel Compiler Lab and USTC, PPoPP08 TIC Main Ideas L2 cache size is in the range of mega-bytes Network applications are memory intensive Memory is best accessed sequentially – 64bytes cache line size for Core 2 Duo – 64bytes local-memory for IXP Can compression be used to optimize performance? – CISC encoding for smaller memory footprint

Intel Compiler Lab and USTC, PPoPP08 Putting Everything Together Domain knowledge: two-stage classification Architecture features: – Plenty of CPU cycles – Large L2 cache – Block based sequential access – Branch prediction can eliminate infrequent executed paths

Intel Compiler Lab and USTC, PPoPP08 Port-Range Expressions There are five type of range expressions – WC (wildcard) – HI ([1024, 65535]) – LO ([0, 1023]) – AR (arbitrary range) – EM (exact match) For (s-port, d-port, proto), there are at least 5x5x2=50 operators

Intel Compiler Lab and USTC, PPoPP08 Characteristics of Range Expressions for Destination Port ClassifierWCHILOEMAR seed130.42% %11.6% seed29.25%13.96%-65.75%11.04% seed38.56%12.15%-68.08%11.21% seed430.00%4.08%-60.72%5.20% seed555.46%6.52%-35.48%2.53%

Intel Compiler Lab and USTC, PPoPP08 Encoding and Interpreting Eliminate WC calculation Introduce HI and LO operators without storing the constants – HI ([1024, 65535]) – LO ([0, 1023]) Store AR and EM parameters in the operand fields NOP for code block alignment

Intel Compiler Lab and USTC, PPoPP08 Can we afford to increase #operator? Interpreter is a big switch-case statement. Compiler stores the starting address of each case in a jump table. Interpreter executes two instructions per iteration: – load an address into a register from the jump table – jump to the address in the indirect addressing mode IXP –E compiler can optimize switch-case with – Default Case Removal – Switch Block Packing

Intel Compiler Lab and USTC, PPoPP08 Experimental Setup Intel Xeon 5160 Core 2 Duo running at 3.00GHz with 4MB L2 cache and a 1333MHz system bus Cycle-accurate IXP2800 simulator, and each ME runs at 1.2GHz with 8 threads Generate packet traces from ClassBench, and use the low locality traces to cancel the locality

Intel Compiler Lab and USTC, PPoPP08 Space Reduction SIZE Classifier #Rules RFC(MB)TIC(MB) 2KDB DB DB DB DB KDB DB DB DB DB

Intel Compiler Lab and USTC, PPoPP08 Relative Speedups on Core 2 Duo 1 T2 T3 T4 T DB1 RFC TIC Imp.-16.4%-4.7%3.9%12.8% DB2 RFC TIC Imp.4.9%10.2%8.6%3.1% DB3 RFC TIC Imp.2.1%5.2%9.8%1.5% DB4 RFC TIC Imp.-12%1.9% 1.8% DB5 RFC TIC Imp.17.5%22%42.4%13.1% Ave. RFC TIC Imp.-3.1%5.2%11.4%6.39%

Intel Compiler Lab and USTC, PPoPP08 Speedups on IXP (RFC vs. TIC) 1 T2 T4 T8T16 T32 T DB1 RFC TIC Imp %-33.7%-31.1%-48.1%-46.7%-41.9% DB2 RFC TIC Imp. -4.5%-9.7%-9.3%-14.5%-11.5%-14.3% DB3 RFC TIC Imp. -5.6%-10.4%-9.8%-17.2%-11.9%-14% DB4 RFC TIC Imp %-15.9%-15.6%-15.4%-16.7%-23.4% DB5 RFC TIC Imp. -2.3%-7.9%-8.8%-13.7%-11.8%-15.6%

Intel Compiler Lab and USTC, PPoPP08 Why RFC is better than TIC on IXP? Block size plays an important role in the IXP architecture since SRAM is optimized for 32bit access #SRAM Access#Words Accessed RFC1313W TIC7+1 = 87+8 =15W

Intel Compiler Lab and USTC, PPoPP08 Block Size Impacts on IXP

Intel Compiler Lab and USTC, PPoPP08 Future Work Improve TIC performance on IXP Improve TIC performance on firewall rules Improve update speeds