Traffic classification and applications to traffic monitoring Marco Mellia Electronic and Telecommunication Department Politecnico di Torino

Slides:



Advertisements
Similar presentations
Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
Advertisements

20.1 Chapter 20 Network Layer: Internet Protocol Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
KISS: Stochastic Packet Inspection for UDP Traffic Classification
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
CCNA – Network Fundamentals
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 OSI Transport Layer Network Fundamentals – Chapter 4.
NAT/Firewall Traversal April NAT revisited – “port-translating NAT”
1 TCP - Part I Relates to Lab 5. First module on TCP which covers packet format, data transfer, and connection management.
1 Chapter 3 TCP and IP. Chapter 3 TCP and IP 2 Introduction Transmission Control Protocol (TCP) Transmission Control Protocol (TCP) User Datagram Protocol.
CS3505 The Internet and Info Hiway transport layer protocols : TCP/UDP.
BZUPAGES.COM 1 User Datagram Protocol - UDP RFC 768, Protocol 17 Provides unreliable, connectionless on top of IP Minimal overhead, high performance –No.
The testbed environment for this research to generate real-world Skype behaviors for analyzation is as follows: A NAT-ed LAN consisting of 7 machines running.
Internet Traffic Classification KISS Dario Bonfiglio, Alessandro Finamore, Marco Mellia, Michela Meo, Dario Rossi 1.
Marios Iliofotou (UC Riverside) Brian Gallagher (LLNL)Tina Eliassi-Rad (Rutgers University) Guowu Xi (UC Riverside)Michalis Faloutsos (UC Riverside) ACM.
 Firewalls and Application Level Gateways (ALGs)  Usually configured to protect from at least two types of attack ▪ Control sites which local users.
1 Internet Networking Spring 2004 Tutorial 13 LSNAT - Load Sharing NAT (RFC 2391)
5/1/2006Sireesha/IDS1 Intrusion Detection Systems (A preliminary study) Sireesha Dasaraju CS526 - Advanced Internet Systems UCCS.
Treatment-Based Traffic Signatures Mark Claypool Robert Kinicki Craig Wills Computer Science Department Worcester Polytechnic Institute
1 Summer Report Reporter : Yi-Cheng Lin Data: 2008/09/02.
05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.
5/12/05CS118/Spring051 A Day in the Life of an HTTP Query 1.HTTP Brower application Socket interface 3.TCP 4.IP 5.Ethernet 2.DNS query 6.IP router 7.Running.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #12 LSNAT - Load Sharing NAT (RFC 2391)
Assessing the Nature of Internet traffic: Methods and Pitfalls Wolfgang John Chalmers University of Technology, Sweden together with Min Zhang Beijing.
Gursharan Singh Tatla Transport Layer 16-May
23 October 2002Emmanuel Ormancey1 Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002.
Tracking down Traffic Dario Bonfiglio Marco Mellia Michela Meo Nicolo’ Ritacca Dario Rossi.
FIREWALL TECHNOLOGIES Tahani al jehani. Firewall benefits  A firewall functions as a choke point – all traffic in and out must pass through this single.
Network Planète Chadi Barakat
1 Transport Layer Computer Networks. 2 Where are we?
Revealing Skype Traffic: When Randomness Plays with You D. Bonfiglio 1, M. Mellia 1, M. Meo 1, D. Rossi 2, P. Tofanelli 3 Dipartimento di Elettronica,
DPNM, POSTECH 1/23 NOMS 2010 Jae Yoon Chung 1, Byungchul Park 1, Young J. Won 1 John Strassner 2, and James W. Hong 1, 2 {dejavu94, fates, yjwon, johns,
What is a Protocol A set of definitions and rules defining the method by which data is transferred between two or more entities or systems. The key elements.
1 Understanding VoIP from Backbone Measurements Marco Mellia, Dario Rossi Robert Birke, and Michele Petracca INFOCOM 07’, Anchorage, Alaska, USA Young.
Data mining and machine learning A brief introduction.
Introduction to Networks CS587x Lecture 1 Department of Computer Science Iowa State University.
DoWitcher: Effective Worm Detection and Containment in the Internet Core S. Ranjan et. al in INFOCOM 2007 Presented by: Sailesh Kumar.
Packet Filtering & Firewalls. Stateless Packet Filtering Assume We can classify a “good” packet and/or a “bad packet” Each rule can examine that single.
1 The Internet and Networked Multimedia. 2 Layering  Internet protocols are designed to work in layers, with each layer building on the facilities provided.
TCP1 Transmission Control Protocol (TCP). TCP2 Outline Transmission Control Protocol.
On the processing time for detection of Skype traffic P.M. Santiago del Río, J. Ramos, J.L. García-Dorado, J. Aracil Universidad Autónoma de Madrid A.
IEEE Communications Surveys & Tutorials 1st Quarter 2008.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Network Programming Eddie Aronovich mail:
Data Mining Algorithms for Large-Scale Distributed Systems Presenter: Ran Wolff Joint work with Assaf Schuster 2003.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin In First Workshop on Hot Topics in Understanding Botnets,
CINBAD CERN/HP ProCurve Joint Project on Networking 26 May 2009 Ryszard Erazm Jurga - CERN Milosz Marian Hulboj - CERN.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Presenter: Kuei-Yu Hsu Advisor: Dr. Kai-Wei Ke 2013/4/29 Detecting Skype flows Hidden in Web Traffic.
Transport Protocols.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
9/29/04 GGF Random Thoughts on Application Performance and Network Characteristics Distributed Systems Department Lawrence Berkeley National Laboratory.
McGraw-Hill Chapter 23 Process-to-Process Delivery: UDP, TCP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
@Yuan Xue CS 283Computer Networks Spring 2011 Instructor: Yuan Xue.
2: Transport Layer 11 Transport Layer 1. 2: Transport Layer 12 Part 2: Transport Layer Chapter goals: r understand principles behind transport layer services:
A special acknowledge goes to J.F Kurose and K.W. Ross Some of the slides used in this lecture are adapted from their original slides that accompany the.
Skype.
P.Demestichas (1), S. Vassaki(2,3), A.Georgakopoulos(2,3)
The Transport Layer Implementation Services Functions Protocols
On-line Detection of Real Time Multimedia Traffic
Authors – Johannes Krupp, Michael Backes, and Christian Rossow(2016)
Process-to-Process Delivery, TCP and UDP protocols
Chapter 14 User Datagram Program (UDP)
Transport Protocols Relates to Lab 5. An overview of the transport protocols of the TCP/IP protocol suite. Also, a short discussion of UDP.
TCP - Part I Relates to Lab 5. First module on TCP which covers packet format, data transfer, and connection management.
Chapter 14 User Datagram Protocol (UDP)
Process-to-Process Delivery:
Net 323 D: Networks Protocols
Process-to-Process Delivery: UDP, TCP
Internet Traffic Classification Using Bayesian Analysis Techniques
Presentation transcript:

Traffic classification and applications to traffic monitoring Marco Mellia Electronic and Telecommunication Department Politecnico di Torino 1

3° Cost-TMA PhD school – Krakow – Feb Traffic Classification & Measurement ? Why?  Identify normal and anomalous behavior  Characterize the network and its users  Quality of service  Filtering  … How? How?  By means of passive measurement 2

3° Cost-TMA PhD school – Krakow – Feb Scenario Traffic classifier  Deep packet inspection  Statistical methods Persistent and scalable monitoring platform  Round Robin Database (RRD)  Histograms Internal Clients Edge Router External Servers

3° Cost-TMA PhD school – Krakow – Feb Tstat at a Glance

3° Cost-TMA PhD school – Krakow – Feb Worm and Viruses? Did someone open a Christmas card? Happy new year to Windows!!

3° Cost-TMA PhD school – Krakow – Feb Anomalies (Good!) Spammer Disappear McColo SpamNet shut off on Tuesday, November 11th, 2008 Spammer Disappear McColo SpamNet shut off on Tuesday, November 11th, 2008

3° Cost-TMA PhD school – Krakow – Feb New Applications – P2PTV Fiorentina 4 - Udinese 2 Inter 1 - Juventus 0

3° Cost-TMA PhD school – Krakow – Feb Megaupload blocked 19/01/12

3° Cost-TMA PhD school – Krakow – Feb How to monitor traffic? All previous examples rely on the availability of a CLASSIFIER  A tool that can discriminate classes of traffic Classification: the problem of assigning a class to an observation  The set of classes is pre-defined  The output may be correct or not 9

3° Cost-TMA PhD school – Krakow – Feb Some terminology Question: Is this a cat, a rabbit, or a dog? 10

3° Cost-TMA PhD school – Krakow – Feb Some terminology Question: Is this a cat, a rabbit, or a dog? 11

3° Cost-TMA PhD school – Krakow – Feb How to compute performance? Confusion matrix  On rows we have the actual class  On columns we have the predicted class Allows to see if some confusion arises 12 Predicted class CatDogRabbit Actual class Cat530 Dog231 Rabbit0211

3° Cost-TMA PhD school – Krakow – Feb How to compute performance? Confusion matrix True positive  It was classified as a cat, and it was a cat 13 Predicted class CatDogRabbit Actual class Cat530 Dog231 Rabbit0211

3° Cost-TMA PhD school – Krakow – Feb How to compute performance? Confusion matrix False negative  It was classified NOT as a cat, but it was a cat 14 Predicted class CatDogRabbit Actual class Cat530 Dog231 Rabbit0211

3° Cost-TMA PhD school – Krakow – Feb How to compute performance? Confusion matrix True negative  It was classified NOT as a cat, and it was NOT a cat 15 Predicted class CatDogRabbit Actual class Cat530 Dog231 Rabbit0211

3° Cost-TMA PhD school – Krakow – Feb How to compute performance? Confusion matrix False positive  It was classified as a cat, but it was NOT a cat 16 Predicted class CatDogRabbit Actual class Cat530 Dog231 Rabbit0211

3° Cost-TMA PhD school – Krakow – Feb Other metrics Accuracy: is the ratio of the sum of all True Positives to the sum of all tests, for all classes. It is biased toward the most predominant class in a data set.  Consider for example a test to identify patients that suffer from a disease that affects 10 patient over 100 tests. The classifier that always returns ``sane'' will have accuracy of 90%. 17

3° Cost-TMA PhD school – Krakow – Feb Predicted class CatDogRabbit Actual class Cat530 Dog231 Rabbit0211 Other metrics Recall of a class: is the ratio of the True Positives and the sum of True Positives and False Negatives.  Recall(cat)=5/(5+3+0)  It is a measure of the ability of a classifier to select instances of the given class from a data set 18

3° Cost-TMA PhD school – Krakow – Feb Other metrics Precision of a class: is the ratio of True Positives and the sum of True Positives and False Positive  Precision(cat) = 5/(5+2+0)  It is a metric that measure how precise is the classifier in labeling only samples of a given class 19 Predicted class CatDogRabbit Actual class Cat530 Dog231 Rabbit0211

3° Cost-TMA PhD school – Krakow – Feb Traffic classification Look at the packets… Tell me what protocol and/or application generated them

3° Cost-TMA PhD school – Krakow – Feb Port: Port: 4662/4672 Port: Payload: “bittorrent” Payload: E4/E5 Payload: RTP protocol SkypeBittorrent GtalkeMule Typical approach: Deep Packet Inspection (DPI)

3° Cost-TMA PhD school – Krakow – Feb The problem of traffic classification Deep Packet Inspection  Based on looking for some pre-defined payload patterns, deep in the packet Simple at L2-L4  “if ethertype == 0x0800, then there is an IP packet”  Usually done with a set of if-then-else or even switch- case Ambiguous at L7  TCP port 80 does not mean automatically “protocol HTTP”

3° Cost-TMA PhD school – Krakow – Feb DPI: Rule-set complexity Practical rule-sets:  Snort, as of November rules, 5549 Perl Compatible Regular Expressions  OpenDPI as of February protocols  Tstat as of February 2012 Approx 200 classes/services 23 Deep packet inspection Regular expression matching at line rate Finite Automata based techniques =

3° Cost-TMA PhD school – Krakow – Feb Some notes... Protocol identification… … or application verification?  Skype can use the standard HTTP protocol to exchange data  Is that traffic “Skype” or “HTTP”? Today everything is going over HTTP  Is it Facebook? Twitter? YouTube video? Or HTTP?

3° Cost-TMA PhD school – Krakow – Feb The question Which granularity are you interested into ?? 25

3° Cost-TMA PhD school – Krakow – Feb Several approaches to traffic classification Content- based Packet-based (e.g., Spatscheck) Port-based (stateless) Statistical methods Host social behaviour (e.g., Faloutsos) Traffic statistics (e.g., Salgarelli, Baiocchi, Moore, Mellia) Auto-learning methods (e.g. Bayes) Preclassified bins Message- based Protocol behaviour (e.g., BinPac, SML) Traffic classification Pre-computed or auto- learning signatures Payload-based (stateful)

3° Cost-TMA PhD school – Krakow – Feb Some references

3° Cost-TMA PhD school – Krakow – Feb Some references Some critical overview/tutorial  T.T.T.Nguyen, G.Armitage. A survey of techniques for internet traffic classification using machine learning, Communications Surveys & Tutorials, IEEE, V.10, N.4, pp  H.Kim, KC Claffy, M.Fomenkov, D.Barman, M.Faloutsos, K.Lee. Internet traffic classification demystified: myths, caveats, and the best practices. In Proceedings of the 2008 ACM CoNEXT Conference (CoNEXT '08). ACM, New York, NY, USA, For this class:  D. Bonfiglio, M. Mellia, M. Meo, D. Rossi, P. Tofanelli Revealing skype traffic: when randomness plays with you ACM SIGCOMM Kyoto, JP, ISBN: , 27 August  A. Finamore, M. Mellia, M. Meo, D. Rossi KISS: Stochastic Packet Inspection 1st TMA Workshop, Aachen, 11 May  A. Finamore, M. Mellia, M. Meo, D. Rossi, KISS: Stochastic Packet Inspection Classifier for UDP Traffic, IEEE/ACM Transactions on Networking "5", Vol. 18, pp , ISSN: , October  G. La Mantia, D. Rossi, A. Finamore, M. Mellia, M. Meo, Stochastic Packet Inspection for TCP Traffic, IEEE ICC, Cape Town, South Africa, 23 May 2010.

3° Cost-TMA PhD school – Krakow – Feb Port: Port: 4662/4672 Port: Payload: “bittorrent” Payload: E4/E5 Payload: RTP protocol SkypeBittorrent GtalkeMule Typical approach: Deep Packet Inspection (DPI) It fails more and more: P2P Encryption Proprietary solution Many different flavours

3° Cost-TMA PhD school – Krakow – Feb The Failure of DPI :29 eMule 0.49a released :25 eMule 0.49b released

3° Cost-TMA PhD school – Krakow – Feb Possible Solution: Behavioral Classifier Phase 1 Feature Phase 3 Verify 1. Statistical characterization of traffic 2. Look for the behaviour of unknown traffic and assign the class that better fits it 3. Check for possible classification mistakes Phase 2 Decision Traffic (Known) (Training) (Operation)

3° Cost-TMA PhD school – Krakow – Feb Behavioural classifiers Which statistics?  Packet size Average, std, max, min Len of first X pkts  IPG Average, std, max, min IPG of first X+1 pkts  Total size, duration, #data packets  From client, from server, from both  RTT, #concurrent connection, rtx, dups, …  TCP options, flags, signaling, … Feature selection? Which decision process?  Ad Hoc  Bayesian  Neural Networks  Decision trees  SVM  … Which training set?  Supervised techniques 32

The case of Skype Consider a simple example 33

3° Cost-TMA PhD school – Krakow – Feb Our Goal Identify Skype traffic Motivations  Operators need to know what is running in their network New business models, provisioning, TE, etc.  Understand user behaviour  Traffic characterization, security  …  It’s fun

3° Cost-TMA PhD school – Krakow – Feb Skype Overview Skype offers voice, video, chat and data transfer services over IP Closed design, proprietary solutions  P2P technology  Proprietary protocols  Encrypted communications Easy to use, difficult to reveal  It is the perfect example of DPI failure No server No well-known port … No standard No RFC State-of-the-Art Encryption/Obfuscation Mechanisms

3° Cost-TMA PhD school – Krakow – Feb Our Goal Identify Skype traffic  Voice stream first: both E2E and SkypeOut/In streams  Possible video/chat/file transfers/signaling Constraints  Passive observation of traffic  Protocol ignorance

3° Cost-TMA PhD school – Krakow – Feb Three Classifiers Naïve Bayes Classifier Payload Based Classifier Chi Square Classifier Skype? Traffic Flow Skype? ANDAND

Phase 1 – try to understand it

3° Cost-TMA PhD school – Krakow – Feb Skype as VoIP Application Skype selects the voice codec from a list  Low bit rate: kbps  Regular Inter-Packet-Gap (30 ms frames) Redundancy may be added to mitigate packet loss Framing may be modified from the original codec one Multiplexes different source into the same message (voice, video, chat, … )

3° Cost-TMA PhD school – Krakow – Feb Skype Source Model Skype Message TCP/UDP IP

Skype Header Formats (What we guess about it) Can we design a DPI classifier?

3° Cost-TMA PhD school – Krakow – Feb Possible Skype Messages Signaling and data messages  Use TCP, with ciphered payload Login, lookup, signaling… Data flow  Use UDP whenever possible: payload is encrypted … but some header MUST be exposed Question:  Some header MUST be exposed…  Why?!?? Impossible to exploit. Everything is ciphered

3° Cost-TMA PhD school – Krakow – Feb Possible Skype Messages Signaling and data messages  Use TCP, with ciphered payload Login, lookup, signaling… Data flow  Use UDP whenever possible: payload is encrypted … but some header MUST be exposed Source AES Receiver AES Impossible to exploit. Everything is ciphered Unreliable

3° Cost-TMA PhD school – Krakow – Feb Skype Source Model Skype Message TCP/UDP IP

3° Cost-TMA PhD school – Krakow – Feb SoM Format for E2E Messages Start of Message (SoM) of End2End messages carried by UDP has: ID : 16 bits long random identifier FUNC : 5 bits long function (multiplexing?), obfuscated in a Byte | ID | FUNC |

3° Cost-TMA PhD school – Krakow – Feb Function Values 0x01 = ??Query message 0x02 = ??Query 0x0d = Data 0x07 = NAK Voice Video Chat File

3° Cost-TMA PhD school – Krakow – Feb PBC SoM can be used to identify Skype flows carried by UDP  5bits long signature Question:  Which is the chance that you have a false positive? Classic signature based classifier

3° Cost-TMA PhD school – Krakow – Feb PBC SoM can be used to identify Skype flows carried by UDP  5bits long signature Question:  Which is the chance that you have a false positive? We look for 1 string out of 32 possible strings  1/32 of false detection possible Can we improve it by checking multiple packets  Yet we can have a very high false positive rate Classic signature based classifier

3° Cost-TMA PhD school – Krakow – Feb PBC Any other smart way of improving accuracy?  Hint: this is UDP 49

3° Cost-TMA PhD school – Krakow – Feb PBC SoM can be used to identify Skype flows carried by UDP  5bits long signature Classic signature based classifier

3° Cost-TMA PhD school – Krakow – Feb PBC SoM can be used to identify Skype flows carried by UDP  5bits long signature IMPROVE: Identify Skype socket address at clients  The UDP port is FIXED and not random (as in TCP)  Then, look for Skype flows with the same UDP port It works  with UDP only  at edge node only Complex Cannot discriminate VOICE/VIDEO/CHAT/DATA Classic signature based classifier

Skype Encrypts Traffic Can we leverage this?

3° Cost-TMA PhD school – Krakow – Feb Skype Source Model Skype Message TCP/UDP IP

3° Cost-TMA PhD school – Krakow – Feb Randomness Classifier Skype encrypts traffic  payload looks like random Some headers are constant ( FUNC ) Apply randomness test to the payload bits  Chi-Square test: statistic test for random sequences

3° Cost-TMA PhD school – Krakow – Feb CSC Split the payload into groups Apply the test on the values assumed at each group  Each message is an observation Some groups will contain  Random bits  Mixed bits  Deterministic bits | ID | FUNC |

3° Cost-TMA PhD school – Krakow – Feb CSC Set a threshold

Skype is a VoIP Application Which are the features that make it different from a bulk download?

3° Cost-TMA PhD school – Krakow – Feb Skype Source Model Skype Message TCP/UDP IP

3° Cost-TMA PhD school – Krakow – Feb Which features? Question: Which features would you select to differentiate a VoIP stream from a data download? 59

3° Cost-TMA PhD school – Krakow – Feb Sample Trace Regular IPG Small/regular packets

3° Cost-TMA PhD school – Krakow – Feb Naive Baysean Classifier Simple classifier: based on the a-priori prob, evaluate the a-posteriori prob  How similar is this flow to a Skype voice flow? What makes “VoIP” traffic different from other traffic?  Packet size, i.e., small packets (packet NBC)  Inter-Packet-Gap, i.e., small IPG (IPG NBC)

3° Cost-TMA PhD school – Krakow – Feb Design of the behavioral classifier We consider windows of N packets In each window, we check the IPG and the payloadsize distribution We compare against the expected distribution and compute a belief  One belief for each “mode” After K windows, take a decision  Consider the most likely mode in each windows  Average over all K windows to get the average max belief  Compare it against a threshold 62

3° Cost-TMA PhD school – Krakow – Feb Skype Naive Bayes Classifier W (k) W (k+1) W (k+2) W (k+3) W (k+4) X Y max AVG min B s (k,j) E[B s (,j) ] B  (k) E[B  ] B Packet NBC IPG NBC IPG NBC``

3° Cost-TMA PhD school – Krakow – Feb NBC over Time Set a threshold

Results

3° Cost-TMA PhD school – Krakow – Feb Three Classifiers Naïve Bayes Classifier Payload Based Classifier Chi Square Classifier Skype? Traffic Flow Skype? ANDAND UDP benchmark dataset

3° Cost-TMA PhD school – Krakow – Feb Scenario Testbed traces: 100% accuracy Campus  Simple scenario, no P2P, no VoIP Italian ISP  Stiff scenario: lot of P2P, tons of VoIP Results consider  True positive (OK): Skype, and identified  False positive (FP): Not Skype, but identified  False negative (FN): Skype, but discarded

3° Cost-TMA PhD school – Krakow – Feb Performance Evaluation: UDP NBC + CSC Chi Square Classifier Naïve Based Classifier Payload Based Classifier

3° Cost-TMA PhD school – Krakow – Feb CSC Threshold Impact

3° Cost-TMA PhD school – Krakow – Feb NBC Threshold Impact FP [%] E2E FN [%] Minimum Belief E2E

Kiss: chi square stocastich classifier or Stocastic Packet Inspection Generalize it 71

3° Cost-TMA PhD school – Krakow – Feb Phase 1 Feature Phase 3 Verify Phase 2 Decision Traffic (Known) Our Approach Statistical characterization of bits in a flow Do NOT look at the SEMANTIC and TIMING … but rather look at the protocol FORMAT Test  2

3° Cost-TMA PhD school – Krakow – Feb Source PortDestination Port Sequence Number Acknowledgment Number Checksum Options Window Urgent Pointer HLEN ResvControl flag Padding Question: Which protocol is this? CONSTANT COUNTER RANDOM

3° Cost-TMA PhD school – Krakow – Feb CSC Split the payload into groups Apply the test on the values assumed at each group  Each message is an observation Some groups will contain  Random bits  Mixed bits  Deterministic bits | ID | FUNC |

3° Cost-TMA PhD school – Krakow – Feb Chunking and  2 First N payload bytes C chunks Each of b bits  2 1  2 C [], …, Vector of Statistics The provides an implicit measure of entropy or randomness  2 Observed distribution Expected distribution (uniform)

3° Cost-TMA PhD school – Krakow – Feb Consider a chunk of 2 bits: Random Values Deterministic Value Counter OiOi and different beaviour EiEi

Question: Why comparing against a UNIFORM pdf?? 77

3° Cost-TMA PhD school – Krakow – Feb bit long chunks: evolution random xxxx  2

3° Cost-TMA PhD school – Krakow – Feb random Deterministic bit long chunks: evolution  2

3° Cost-TMA PhD school – Krakow – Feb random deterministic mixed x000x0x00xxx 4 bit long chunks: evolution  2 Question: is it dependent on WHICH bits are fixed And WHICH are random???

3° Cost-TMA PhD school – Krakow – Feb KISS

3° Cost-TMA PhD school – Krakow – Feb Question: What is this?!?! 2 byte long counter MSGL2L1LSG Most Significant Group Less Significant Group

3° Cost-TMA PhD school – Krakow – Feb Protocol format as seen from the  2

3° Cost-TMA PhD school – Krakow – Feb Statistical characterization of bits in a flow Decision process Test Minimum distance / maximum likelihood  2 Phase 1 Feature Phase 3 Verify Phase 2 Decision Traffic (Known) Our Approach

3° Cost-TMA PhD school – Krakow – Feb C-dimension space  2 1  2 C [], …, Iperspace Classification Regions Euclidean Distance Support Vector Machine  2 i  2 j Class My Point

3° Cost-TMA PhD school – Krakow – Feb Example considering the  2

3° Cost-TMA PhD school – Krakow – Feb  2 i  2 j Centroid Center of mass Euclidean Distance Classifier

3° Cost-TMA PhD school – Krakow – Feb  2 i  2 j True Negative Are “Far” True Positives Are “Nearby” Centroid Center of mass Euclidean Distance Classifier

3° Cost-TMA PhD school – Krakow – Feb  2 i  2 j False Positives Centroid Center of mass Iper-sphere Euclidean Distance Classifier

3° Cost-TMA PhD school – Krakow – Feb  2 i  2 j Centroid Center of mass Iper-sphere False negatives Radius Euclidean Distance Classifier

3° Cost-TMA PhD school – Krakow – Feb  2 i  2 j Centroid Center of mass Iper-sphere max { True Pos. } min { False Neg. } Confidence The distance is a measure of the condifence of the decision Euclidean Distance Classifier

3° Cost-TMA PhD school – Krakow – Feb Radius True Positive – False positive How to define the sphere radius?

3° Cost-TMA PhD school – Krakow – Feb Space of samples (dim. C) Kernel function Space of feature (dim. ∞) Kernel functions Move point so that borders are simple Support Vector Machine

3° Cost-TMA PhD school – Krakow – Feb Support vectors Kernel functions Move point so that borders are simple Borders are planes Simple surface! Nice math Support Vectors LibSVM Support Vector Machine

3° Cost-TMA PhD school – Krakow – Feb Decision Distance from the border Confidence is a probability p (  class ) Kernel functions Borders are planes Simple surface! Nice math Support Vectors LibSVM Support Vector Machine

3° Cost-TMA PhD school – Krakow – Feb Performance evaluation How accurate is all this? Our Approach Phase 1 Feature Phase 3 Verify Phase 2 Decision Traffic (Known) Statistical characterization of bits in a flow Decision process Test Minimum distance / maximum likelihood  2

3° Cost-TMA PhD school – Krakow – Feb Per flow and per endpoint What are we going to classify?  It can be applied to both single flows  And to endpoints Question:  Do we assume to monitor ALL packets?  Do we assume to monitor since the first packet? 97

3° Cost-TMA PhD school – Krakow – Feb Per flow and per endpoint What are we going to classify?  It can be applied to both single flows  And to endpoints Question:  Do we assume to monitor ALL packets?  Do we assume to monitor since the FIRST packet? NO!  It is robust to sampling  It can start from any point 98

3° Cost-TMA PhD school – Krakow – Feb Per flow and per endpoint 99

3° Cost-TMA PhD school – Krakow – Feb Real traffic traces Internet Fastweb Known + Other Training Known Traffic False Negatives Unknown traffic False Positives Trace RTP eMule DNS Oracle (DPI + Manual ) other Other Unknown Traffic 1 day long trace 20 GByte di UDP traffic

3° Cost-TMA PhD school – Krakow – Feb Definition of false positive/negative Traffic Oracle (DPI) eMule RTP DNS Other Classifing “known” true positives false negatives KISS true negatives false positives Classifing “other” KISS

3° Cost-TMA PhD school – Krakow – Feb Case ACase B Rtp Edk Dns Case ACase B Case ACase B other Euclidean Distance SVM Case ACase B Results Known traffic (False Neg.) [%] Other (False Pos.) [%]

3° Cost-TMA PhD school – Krakow – Feb Tuning trainset size % True positives False positives Samples per class Small training set For “known”: Mbyte For “other”: 300 Mbyte

3° Cost-TMA PhD school – Krakow – Feb  2 packets % True positives False positives Tuning Num. of Packets for (confidence 5%) Protocols with volumes at least pkts per flow

3° Cost-TMA PhD school – Krakow – Feb Real traffic trace RTP errors are oracle mistakes (do not identify RTP v1) DNS errors are due to impure training set (for the oracle all port 53 is DNS traffic) EDK errors are (maybe) Xbox Live (proper training for “other”) FN are always below 3%!!!

3° Cost-TMA PhD school – Krakow – Feb P2P-TV applications P2P-TV applications are becoming popular They heavly rely on UDP at the transport protocol They are based on proprietary protocols They are evolving over time very quickly How to identify them?... After 6 hours, KISS give you results

3° Cost-TMA PhD school – Krakow – Feb Putting all together Now with  9 classes  3 different networks 107

Another example of behavioral classifier Abacus 108

3° Cost-TMA PhD school – Krakow – Feb Abacus: Rationale Applications are like people in a party room  Some prefer brief exchanges with many other people  Some likes long talks with few other people “Attitudes” are different across P2P applications...  Some prefer to download small pieces of data from many peers  Some prefer to download all data from almost the same peers... enough to classify them  Observe a host for a given time  Count the number of peers contacted and the number of packets exchanged which represent the attitude

3° Cost-TMA PhD school – Krakow – Feb Abacus signature definition Consider a host X which in a fixed time- window ΔT = 5s is contacted by N=5 peers Y i for each peer Y i count the number of packets sent to X in ΔT Consider a set of bins of exponential width Divide the peers in bins according to the number of exchanged packets Normalize the bins, i.e. divide for the total number of peers N The final signature is an empirical probability distribution function In the example N=5, bins = (1, 0, 2, 2)‏ Abacus signature (0.2, 0, 0.4, 0.4)‏ X Y1Y1 Y2Y Y3Y3 Y5Y5 Y4Y4 9-16

3° Cost-TMA PhD school – Krakow – Feb Signature comparison PPLiveTVAnts Joost SopCast

3° Cost-TMA PhD school – Krakow – Feb Performance evaluation How accurate is all this? Our Approach Phase 1 Feature Phase 3 Verify Phase 2 Decision Traffic (Known) Statistical characterization of bits in a flow ABACUS signatures Decision process Supervised machine learning based on SVM

3° Cost-TMA PhD school – Krakow – Feb Hyper-space is partitioned  every point is given a label  even “unknown” apps Need a way to recognize them  Define a center for each class  Define a threshold R  Calculate the distance d between the point and the center of the assigned class  If d > R mark the new point as unknown Bhattacharyya distance BD  Distance between p.d.f. Rejection criterion Training points R R Center of the class New points Labeled as “unknown” Labeled as “green”

3° Cost-TMA PhD school – Krakow – Feb Experimental results

3° Cost-TMA PhD school – Krakow – Feb Experimental results For “unknown traffic” the selection of the rejection threshold R is fundamental! For R~0 low TPR low FPR For R~1 high TPR high FPR For R=0.5 high TPR low FPR

3° Cost-TMA PhD school – Krakow – Feb The Failure of DPI

3° Cost-TMA PhD school – Krakow – Feb Question you should ask yourself Which feature to use?  Are those “portable”?  How often should the classifier be retrained?  Can it take a decision after some packets? Open issues  Which is a good training set?  How to get a valid benchmarking set? Never trust other classifiers!  How to make it work at 100Gb/s?  How much traffic can it actually classify (coverage)?

3° Cost-TMA PhD school – Krakow – Feb How Much Traffic can be Classified? 118

3° Cost-TMA PhD school – Krakow – Feb Interesting research topics And for TCP? And for HTTP?  How to get Facebook or Twitter  Over HTTPS?  When served by the same CDN? And for the application?  Is it a video seen on Facebook? Can I get rid of the training set?  Use unsupervised classifiers? 119

120

Ops, wrong key 121

And for TCP? 122

3° Cost-TMA PhD school – Krakow – Feb Chunking and  2 First N payload bytes C chunks Each of b bits  2 1  2 C [], …, Vector of Statistics The provides an implicit measure of entropy or randomness  2 Observed distribution Expected distribution (uniform)

3° Cost-TMA PhD school – Krakow – Feb Results 124

3° Cost-TMA PhD school – Krakow – Feb Results 125