Presentation is loading. Please wait.

Presentation is loading. Please wait.

Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine.

Similar presentations


Presentation on theme: "Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine."— Presentation transcript:

1 Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine Learning Techniques Terena Networking Conference 2008 Valentín Carela-Español Pere Barlet-Ros Josep Solé-Pareta {vcarela, pbarlet, pareta}@ac.upc.edu

2 Outline Motivations and objectives Scenario and requirements Existing solutions  Well-known ports  Payload based (pattern matching)  Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions

3 Motivations and objectives Typical method based on well-known ports is no longer valid to identify applications Network administration and management tasks  Network dimensioning, capacity planning, network performance evaluation, … QoS monitoring  Class-of-Service mapping  Quality-of-Service policies  Possible way of pricing for QoS

4 Outline Motivations and objectives Scenario and requirements Existing solutions  Well-known ports  Payload based (pattern matching)  Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions

5 Scenario: SMARTxAC SMARTxAC: Traffic Monitoring and Analysis System for the Anella Científica  Operative since July 2003  Developed under a collaboration agreement CESCA-UPC  Tailor-made traffic monitoring system for the Anella Científica Main objectives  Low-cost platform  Continuous monitoring of high-speed links without packet loss  Detection of network anomalies and irregular usage  Multi-user system: Network operators and Institutions Measurement of two full-duplex GigE links  Connection between Anella Científica and RedIRIS

6 Measurement scenario

7 Requirements Real-time classification Independent from packet contents High-speed links Without packet loss High accuracy Method implemented in SMARTxAC

8 Outline Motivations and objectives Scenario and requirements Existing solutions  Well-known ports  Payload based (pattern matching)  Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions

9 Well-known ports Characteristics  Use of well-known ports from IANA  Packet inspection is not needed  Computationally lightweight Limitations (especially due to new P2P applications)  Dynamic ports  HTTP Requests  New applications do not register their ports in IANA Consequence: Very low accuracy

10 Well-known ports example

11 Payload based Characteristics  Try to find characteristic signatures in packet/flows payloads  Very high accuracy Limitations  Packet contents are required  Computationally expensive  Difficult to maintain updated  Connection encryption  Privacy legislations Consequence: Not a feasible solution in our scenario

12 Machine Learning Subfield of Artificial Intelligence Process that allows computers to extract knowledge (to learn) from examples (training set) Characteristics  Packet contents are not required  High accuracy  Respect the privacy legislations  Computationally viable Limitations  Difficult training phase  Needs to be retrained

13 Supervised learning Classification techniques create knowledge structures that classifies new instances into pre-defined classes. The knowledge learnt can be presented as:  Decision tree  Flowchart  Classifications rules Training dataset:  Object: Represented as a vector of features  Class: Value to be predicated (label obtained “manually”)

14 Unsupervised learning Clustering methods find out best partition from similarities among the examples Labels are not available for the training phase Clustering methods:  K-Means algorithm  Incremental algorithm  Probability-based

15 Supervised vs Unsupervised learning Supervised methods:  Need a complete pre-labeled dataset  Better accuracy for predefined classes  No detection of new classes  Difficult detection of retraining necessity Unsupervised methods:  Do not need complete labeled instances  Automatic detection of new classes  Better accuracy for new classes

16 Feature selection Methods to detect irrelevant or redundant features.  Improve the accuracy  Reduce the computationally load Wrapper methods  Evaluate the performance of different subsets using the ML algorithm for the learning phase.  Depends on the ML algorithm  e.g. Correlation-based Feature Selection (CFS) Filter methods  Make independent assessment based on general characteristics of the data  Independent on the ML algorithm  e.g. Best-First

17 Outline Motivations and objectives Scenario and requirements Existing solutions  Well-known ports  Payload based (pattern matching)  Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions

18 Proposed method Supervised identification based on C4.5 algorithm  Developed by Ross Quinlan as extension of ID3  Based on the construction of a classification tree  Feature selection based on maximizing the information gain Training set  Actual traffic flows  Pairs  Feature vector contains relevant characteristics of traffic flows  The application of each flow is identified “manually”

19 Features Requirements  Real-time extraction  Independence from packet contents Feature examples (total: 25)  Packets and bytes per flow  Flow duration  min/avg/max paquet size  min/avg/max TCP window size  min/avg/max packet interarrival time  Packets with flags PUSH, URG, DF, … set  Average increase of IPID  OS estimation (source and destination)  Also ports and protocols (but not in the traditional way)  …

20 Training Phase (I)‏ Collection of training traffic  Representative of the environment to be monitored  Flow aggregation (at transport level)  Feature extraction Manual classification of training flows  Offline analysis of packet contents  Using pattern matching algorithms (e.g. L7-filter)  Manual inspection of the rest of flows Alternative  Generate artificial traffic under a controlled environment  Manual identification is not required  Solves encryption and privacy issues

21 Training Phase (II)‏ Construction of the classification tree  C4.5 algorithm  Input: Classified training flows  Output: classification tree (contains flow features only) Software employed: Weka  University of Waikato (New Zealand)  GNU GPL license  Written in Java  http://www.cs.waikato.ac.nz/ml/weka

22 Deployment Implementation in SMARTxAC  Flow aggregation  Real-time feature extraction (requirement)  Classification of each flow using the classification tree  Computationally lightweight and applicable in real time There is no need to:  Analyze packet contents  Trust only on port numbers  Apply pattern search algorithms  Inspect manually the packets But it is required to:  Retrain the system occasionally –New applications –Changes on existing ones

23 Accuracy

24 Application breakdown Port-based Machine learning

25 Application breakdown timeseries Port-based Machine learning

26 Outline Motivations and objectives Scenario and requirements Existing solutions  Well-known ports  Payload based (pattern matching)  Machine Learning –Supervised –Unsupervised Proposed method Summary and conclusions

27 Summary 1) Collection of the training set Representative flows of the environment to be monitored Alternatively artificially generated 2) Feature extraction from the training flows 3) Manual flow classification → application class Pattern matching and manual inspection It can be simplified if an artificial training set is used in 1) 4) Construction of a C4.5 classification tree E.g. using Weka 5) Deployment of the tree obtained in 4) in the monitoring system 6) Retraining of the system Starting from phase 1)

28 Conclusions Traditional method based on well-known ports  Low accuracy due to dynamic ports Identification based on pattern matching  Does not feasible in high-speed links due to computation cost  Depends on packet content  Does not work with encryption Identification based on machine learning  Feasible in high-speed links  Does not require packet content  Experimental result shows accuracy > 95%  Requires an occasionally retrain Future work  Retraining the method with the new scenario  Make the training phase as automatic as possible

29 Thank you for your attention Questions?


Download ppt "Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine."

Similar presentations


Ads by Google