DPNM, POSTECH 1/23 NOMS 2010 Jae Yoon Chung 1, Byungchul Park 1, Young J. Won 1 John Strassner 2, and James W. Hong 1, 2 {dejavu94, fates, yjwon, johns,

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Wenke Lee and Nick Feamster Georgia Tech Botnet and Spam Detection in High-Speed Networks.
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
An Introduction of Botnet Detection – Part 2 Guofei Gu, Wenke Lee (Georiga Tech)
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Application Identification in information-poor environments Charalampos Rotsos 02/02/20101 What is application identification Current status My work Future.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.
PBS: Periodic Behavioral Spectrum of P2P Applications Tom Z.J. Fu, Yan Hu, Xingang Shi, Dah Ming Chiu and John C.S. Lui The Chinese University of Hong.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Unconstrained Endpoint Profiling (Googling the Internet)‏ Ionut Trestian Supranamaya Ranjan Aleksandar Kuzmanovic Antonio Nucci Northwestern University.
Assessing the Nature of Internet traffic: Methods and Pitfalls Wolfgang John Chalmers University of Technology, Sweden together with Min Zhang Beijing.
Licentiate Seminar: On Measurement and Analysis of Internet Backbone Traffic Wolfgang John Department of Computer Science and Engineering Chalmers University.
© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,
RelSamp: Preserving Application Structure in Sampled Flow Measurements Myungjin Lee, Mohammad Hajjat, Ramana Rao Kompella, Sanjay Rao.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Automated malware classification based on network behavior
Lucent Technologies – Proprietary Use pursuant to company instruction Learning Sequential Models for Detecting Anomalous Protocol Usage (work in progress)
A fast identification method for P2P flow based on nodes connection degree LING XING, WEI-WEI ZHENG, JIAN-GUO MA, WEI- DONG MA Apperceiving Computing and.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Webpage Understanding: an Integrated Approach
Combining Supervised and Unsupervised Learning for Zero-Day Malware Detection © 2013 Narus, Inc. Prakash Comar 1 Lei Liu 1 Sabyasachi (Saby) Saha 2 Pang-Ning.
A Statistical Anomaly Detection Technique based on Three Different Network Features Yuji Waizumi Tohoku Univ.
Traffic Classification through Simple Statistical Fingerprinting M. Crotti, M. Dusi, F. Gringoli, L. Salgarelli ACM SIGCOMM Computer Communication Review,
Differences between In- and Outbound Internet Backbone Traffic Wolfgang John and Sven Tafvelin Dept. of Computer Science and Engineering Chalmers University.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Presented by Tienwei Tsai July, 2005
Division of IT Convergence Engineering Towards Unified Management A Common Approach for Telecommunication and Enterprise Usage Sung-Su Kim, Jae Yoon Chung,
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and David H.C. Du Dept. of.
Detection Unknown Worms Using Randomness Check Computer and Communication Security Lab. Dept. of Computer Science and Engineering KOREA University Hyundo.
Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,
Exploration of Instantaneous Amplitude and Frequency Features for Epileptic Seizure Prediction Ning Wang and Michael R. Lyu Dept. of Computer Science and.
Probabilistic Graphical Models for Semi-Supervised Traffic Classification Rotsos Charalampos, Jurgen Van Gael, Andrew W. Moore, Zoubin Ghahramani Computer.
Heuristics to Classify Internet Backbone Traffic based on Connection Patterns Wolfgang John and Sven Tafvelin Dept. of Computer Science and Engineering.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Unconstrained Endpoint Profiling Googling the Internet Ionut Trestian, Supranamaya Ranjan, Alekandar Kuzmanovic, Antonio Nucci Reviewed by Lee Young Soo.
Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.
Workpackage 3 New security algorithm design ICS-FORTH Ipswich 19 th December 2007.
CISC Machine Learning for Solving Systems Problems Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware Automatic.
Centre de Comunicacions Avançades de Banda Ampla (CCABA) Universitat Politècnica de Catalunya (UPC) Identification of Network Applications based on Machine.
Similarity Access for Networked Media Connectivity Pavel Zezula Masaryk University Brno, Czech Republic.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Data Mining and Decision Support
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
U of Minnesota DIWANS'061 Energy-Aware Scheduling with Quality of Surveillance Guarantee in Wireless Sensor Networks Jaehoon Jeong, Sarah Sharafkandi and.
2009/6/221 BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure- Independent Botnet Detection Reporter : Fong-Ruei, Li Machine.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Item-Based Collaborative Filtering Recommendation Algorithms Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl GroupLens Research Group/ Army.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
IEEE AI - BASED POWER SYSTEM TRANSIENT SECURITY ASSESSMENT Dr. Hossam Talaat Dept. of Electrical Power & Machines Faculty of Engineering - Ain Shams.
Introduction Machine Learning 14/02/2017.
School of Computer Science & Engineering
Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel
DDoS Attack Detection under SDN Context
Automatic Discovery of Network Applications: A Hybrid Approach
Web Mining Department of Computer Science and Engg.
Data Mining: Introduction
Byung-Joon Lee and Youngseok Lee
Towards Unified Management
Unconstrained Endpoint Profiling (Googling the Internet)‏
Presentation transcript:

DPNM, POSTECH 1/23 NOMS 2010 Jae Yoon Chung 1, Byungchul Park 1, Young J. Won 1 John Strassner 2, and James W. Hong 1, 2 {dejavu94, fates, yjwon, johns, April 20, Dept. of Computer Science and Engineering, POSTECH, Korea 2 Division of IT Convergence Engineering, POSTECH, Korea An Effective Similarity Metric for Application Traffic Classification

DPNM, POSTECH 2/23 NOMS 2010 Contents  Introduction  Related Work  Research Goal  Proposed Methodology  Evaluation  Conclusion and Future Work

DPNM, POSTECH 3/23 NOMS 2010 Introduction  Traffic classification for network management Network planning QoS management Security Etc.  Diversity of today’s Internet traffic New types of network applications Increase of P2P traffic Various techniques for avoiding detection  Document classification  Traffic classification Document classification in natural language processing Comparing packet payload vectors is analogous to document classification

DPNM, POSTECH 4/23 NOMS 2010 Related Work  Well-known port-based classification Low complexity Low accuracy (approximately 50~70%)  Signature-based classification High reliability Exhaustive tasks for searching signatures E.g.) Snort, LASER  Behavior-based classification Focusing on traffic patterns and connection behaviors Questionable accuracy E.g.) BLINC  Machine Learning-based classification Utilize statistical information A huge computing resource consumption E.g.) SVM, Bayesian Network  Similarity-based classification Utilize document classification approach Questionable scalability E.g.) Flow similarity calculation [IPOM ‘09]

DPNM, POSTECH 5/23 NOMS 2010 Summary of IPOM 2009  Proposed new traffic classification approach Utilize document classification approach using Cosine similarity calculation Propose new packet representation using Vector Space Model Propose flow similarity calculation methodology which is to compare packets in flow sequentially  Methodology validation using real-world traffic on our campus backbone network Cannot classify flows in asymmetric routing environment  No comparison of Cosine similarity and other similarity metrics Cosine similarity that is common similarity metric for human- document classification High variation of similarity value according to term-frequency

DPNM, POSTECH 6/23 NOMS 2010 Research Goals  Propose new traffic classification algorithm Automation of signature generation step Generate application vector, which is an alternative signature, using simple vector operation Make groups according to traffic type and operation within single- application traffic Accurate and feasible traffic classification algorithm Classify application traffic using similarity calculation Solve asymmetric routing classification problem Validation using real-world network traffic to compare similarity metrics Complexity analysis  Compare three similarity metrics for traffic classification Jaccard similarity – counting fragment of signature Cosine similarity – high weighting scheme for signature RBF similarity – Euclidean distance between packets

DPNM, POSTECH 7/23 NOMS 2010 Proposed Methodology

DPNM, POSTECH 8/23 NOMS 2010 Vector Space Modeling  Vector Space Modeling An algebraic model representing text documents as vectors Widely used to document classification Categorize electronic document based on its content (e.g. spam filtering)  Document classification vs. Traffic classification Document classification Find documents from stored text documents which satisfy certain information queries Traffic classification Classify network traffic according to the type of application based on traffic information

DPNM, POSTECH 9/23 NOMS 2010 Payload Vector Conversion (1/2)  Definition of word in payload Payload data within an i-bytes sliding window |Word set| = 2 (8*sliding window size)  Definition of payload vector A term-frequency vector in NLP Payload Vector = [w 1 w 2 … w n ] T

DPNM, POSTECH 10/23 NOMS 2010 Payload Vector Conversion (2/2) Word The word size is 2 and the word set size is 2 16 –The simplest case for representing the order of content in payloads

DPNM, POSTECH 11/23 NOMS 2010 Similarity Metrics for Traffic Classification  Jaccard similarity The size of the intersection of the sample sets X and Y divided by the size of the union of the sample sets X and Y  Cosine similarity Two vectors X and Y of n dimensions by fining the cosine angle between them  RBF similarity Radius based function of Euclidean distance between two vectors X and Y

DPNM, POSTECH 12/23 NOMS 2010 Application Vector Heuristics  Application vector Represent typical packets that are generated by target applications as the center (basis) of each cluster  Application vector generator Read packets from the target application trace Divide the packets into several types of clusters without any pre- processing Application vector generator Application trace Application vector 1 Application vector 2 Application vector 3 Traffic cluster 1 Traffic cluster 2

DPNM, POSTECH 13/23 NOMS 2010 Application Vector Generation  Unsupervised grouping within single-application traffic Provide fine-grained classification Classify single-application traffic according to traffic types packet6 packet5 packet4 packet3 packet2 packet1 Application vector 1 Application vector 2 Application Traffic Cluster 1 Cluster 2

DPNM, POSTECH 14/23 NOMS 2010 Two-stage Traffic Classification  Packet level clustering Classify signal packets regardless of flow information Compare payload vectors with application vectors by calculating similarity value Mark on each packet with its application and priority Allow the permutation of packet sequence  Flow level classification Rearrange packets according to flow information Ignore mis-clustered packets that are caused by protocol ambiguities HTTP for Web HTTP for P2P

DPNM, POSTECH 15/23 NOMS 2010 Two-stage Traffic Classification Flow 2Flow 1 Cluster 3 Cluster 2 Cluster 1 F2 P2 F2 P3 F2 P1 F2 P4 F1 P1 F1 P2 F1 P4 F1 P3 F1 P2 F1 P4 F1 P3 F1 P1 F2 P2 F2 P3 F2 P1 F2 P4 Application Vector 1 Application Vector 2 Application Vector 3 F1 P2 F1 P4 F1 P3 F1 P1 F2 P2 F2 P1 F2 P4 F2 P3 Stage 1Stage 2BackboneTraffic BitTorrent Traffic FileGuri Traffic BitTorrent FileGuri Melon BitTorrentFileGuri Mis- clustered

DPNM, POSTECH 16/23 NOMS 2010 Evaluation

DPNM, POSTECH 17/23 NOMS 2010 Classifying Real-world Traffic  Fix-port Applications Traffic trace on one of two Internet junctions at POSTECH using optical tap Ground-truth traffic Some active flows among application traffic distinguished by usage of active port number Target Applications FileGuri, ClubBox, Melon, BigFile  Untraceable-port Applications Traffic Measurement Agent (TMA) Monitoring the network interface of the host Recording log data (five-flow tuples, process name, packet count, etc) Target Applications eMule, BitTorrent Backbone Traffic Target Application Traffic Ground-truth Traffic Target Application Traffic Ground-truth Traffic

DPNM, POSTECH 18/23 NOMS 2010 Classification Accuracy  Classification accuracy comparison Fixed-port application FileGuri, ClubBox, Melon, BigFile Untraceable-port application eMule, BitTorrent Jaccard similarity Reliable – count common segment Cosine similarity Emphasize common segment – cannot distinguish ambiguous packets RBF similarity Difficulty of setting parameter – need guideline how to set parameter  BitTorrent traffic on Backbone network Traffic over-classification by Cosine similarity High false positive rate of Cosine similarity

DPNM, POSTECH 19/23 NOMS 2010 Histogram of Similarity Values

DPNM, POSTECH 20/23 NOMS 2010 CDF of Distance among Payload Vectors

DPNM, POSTECH 21/23 NOMS 2010 Complexity Analysis

DPNM, POSTECH 22/23 NOMS 2010 Conclusion and Future Work  Develop new traffic classification research Utilizing document classification approach to traffic classification Unsupervised classification to make cluster within a single-application traffic Two-stage classification algorithm to solve asymmetric routing classification problem Linear time complexity  Compare three similarity metrics Provide guideline for selecting similarity metrics for traffic classification Provide soft-classification that represents similarity as a numerical value ranges from 0 to 1  Future Work Enhance unsupervised classification methodology for automated signature generation Extract orthogonal application vectors to improve scalability

DPNM, POSTECH 23/23 NOMS 2010