Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine.

Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine Learning Techniques Terena Networking Conference 2008 Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc.edu

Outline Motivations and objectives Scenario and requirements Existing solutions  Well-known ports  Payload based (pattern matching)  Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions

Motivations and objectives Typical method based on well-known ports is no longer valid to identify applications Network administration and management tasks  Network dimensioning, capacity planning, network performance evaluation, … QoS monitoring  Class-of-Service mapping  Quality-of-Service policies  Possible way of pricing for QoS

Scenario: SMARTxAC SMARTxAC: Traffic Monitoring and Analysis System for the Anella Científica  Operative since July 2003  Developed under a collaboration agreement CESCA-UPC  Tailor-made traffic monitoring system for the Anella Científica Main objectives  Low-cost platform  Continuous monitoring of high-speed links without packet loss  Detection of network anomalies and irregular usage  Multi-user system: Network operators and Institutions Measurement of two full-duplex GigE links  Connection between Anella Científica and RedIRIS

Measurement scenario

Requirements Real-time classification Independent from packet contents High-speed links Without packet loss High accuracy Method implemented in SMARTxAC

Well-known ports Characteristics  Use of well-known ports from IANA  Packet inspection is not needed  Computationally lightweight Limitations (especially due to new P2P applications)  Dynamic ports  HTTP Requests  New applications do not register their ports in IANA Consequence: Very low accuracy

Well-known ports example

Payload based Characteristics  Try to find characteristic signatures in packet/flows payloads  Very high accuracy Limitations  Packet contents are required  Computationally expensive  Difficult to maintain updated  Connection encryption  Privacy legislations Consequence: Not a feasible solution in our scenario

Machine Learning Subfield of Artificial Intelligence Process that allows computers to extract knowledge (to learn) from examples (training set) Characteristics  Packet contents are not required  High accuracy  Respect the privacy legislations  Computationally viable Limitations  Difficult training phase  Needs to be retrained

Supervised learning Classification techniques create knowledge structures that classifies new instances into pre-defined classes. The knowledge learnt can be presented as:  Decision tree  Flowchart  Classifications rules Training dataset:  Object: Represented as a vector of features  Class: Value to be predicated (label obtained “manually”)

Unsupervised learning Clustering methods find out best partition from similarities among the examples Labels are not available for the training phase Clustering methods:  K-Means algorithm  Incremental algorithm  Probability-based

Supervised vs Unsupervised learning Supervised methods:  Need a complete pre-labeled dataset  Better accuracy for predefined classes  No detection of new classes  Difficult detection of retraining necessity Unsupervised methods:  Do not need complete labeled instances  Automatic detection of new classes  Better accuracy for new classes

Feature selection Methods to detect irrelevant or redundant features.  Improve the accuracy  Reduce the computationally load Wrapper methods  Evaluate the performance of different subsets using the ML algorithm for the learning phase.  Depends on the ML algorithm  e.g. Correlation-based Feature Selection (CFS) Filter methods  Make independent assessment based on general characteristics of the data  Independent on the ML algorithm  e.g. Best-First

Proposed method Supervised identification based on C4.5 algorithm  Developed by Ross Quinlan as extension of ID3  Based on the construction of a classification tree  Feature selection based on maximizing the information gain Training set  Actual traffic flows  Pairs  Feature vector contains relevant characteristics of traffic flows  The application of each flow is identified “manually”

Features Requirements  Real-time extraction  Independence from packet contents Feature examples (total: 25)  Packets and bytes per flow  Flow duration  min/avg/max paquet size  min/avg/max TCP window size  min/avg/max packet interarrival time  Packets with flags PUSH, URG, DF, … set  Average increase of IPID  OS estimation (source and destination)  Also ports and protocols (but not in the traditional way)  …

Training Phase (I)‏ Collection of training traffic  Representative of the environment to be monitored  Flow aggregation (at transport level)  Feature extraction Manual classification of training flows  Offline analysis of packet contents  Using pattern matching algorithms (e.g. L7-filter)  Manual inspection of the rest of flows Alternative  Generate artificial traffic under a controlled environment  Manual identification is not required  Solves encryption and privacy issues

Training Phase (II)‏ Construction of the classification tree  C4.5 algorithm  Input: Classified training flows  Output: classification tree (contains flow features only) Software employed: Weka  University of Waikato (New Zealand)  GNU GPL license  Written in Java  http://www.cs.waikato.ac.nz/ml/weka

Deployment Implementation in SMARTxAC  Flow aggregation  Real-time feature extraction (requirement)  Classification of each flow using the classification tree  Computationally lightweight and applicable in real time There is no need to:  Analyze packet contents  Trust only on port numbers  Apply pattern search algorithms  Inspect manually the packets But it is required to:  Retrain the system occasionally –New applications –Changes on existing ones

Accuracy

Application breakdown Port-based Machine learning

Application breakdown timeseries Port-based Machine learning

Summary 1) Collection of the training set Representative flows of the environment to be monitored Alternatively artificially generated 2) Feature extraction from the training flows 3) Manual flow classification → application class Pattern matching and manual inspection It can be simplified if an artificial training set is used in 1) 4) Construction of a C4.5 classification tree E.g. using Weka 5) Deployment of the tree obtained in 4) in the monitoring system 6) Retraining of the system Starting from phase 1)

Conclusions Traditional method based on well-known ports  Low accuracy due to dynamic ports Identification based on pattern matching  Does not feasible in high-speed links due to computation cost  Depends on packet content  Does not work with encryption Identification based on machine learning  Feasible in high-speed links  Does not require packet content  Experimental result shows accuracy > 95%  Requires an occasionally retrain Future work  Retraining the method with the new scenario  Make the training phase as automatic as possible

Thank you for your attention Questions?

Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine.

Similar presentations

Presentation on theme: "Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine.

Similar presentations

Presentation on theme: "Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine."— Presentation transcript:

Similar presentations

About project

Feedback