網路安全 (Network Security) 黃能富教授清華大學資訊工程學系/通訊工程研究所

網路安全 (Network Security) 黃能富教授清華大學資訊工程學系/通訊工程研究所

Agenda Introduction of Network Security
Content Inspection Technologies Pattern Matching Algorithms Flow Classification by Stateful Mechanism Machine Learning Based Application Identification Technologies Network Security Research Topics Conclusions

-- 駭客無所不在 -- 2000/3：駭客利用DDos的網路攻擊方式，引起Yahoo、Amazon、CNN、eBay 等知名網站癱瘓
2001/7：Amazon.com 旗下的 Bibliofind 遭駭客盜走顧客的信用卡資料 2002 中美駭客大戰 2003/1 SQL Slammer 攻擊 2003/4 大陸「流光」後門程式 2003/8 Blaster 疾風病毒攻擊 2003/9 SoBig 老大病毒攻擊 2003/9 大陸網軍攻擊 2004/3 Netsky 天網病毒攻擊 2004/4 Sasser 殺手病毒攻擊 2005/5 國內大考中心遭駭客竄改資料 2005/6 外交部網站遭大陸網軍後門程式竊取外交機密 Computer Security Institute’s 2000 Computer Crime and Security Survey

網路安全的隱憂網路攻擊技術日新月異，攻擊工具易於取得,界面淺顯易懂，不需高深技巧，即可進行攻擊。
網路攻擊已不侷限於侵入動作，許多攻擊行為旨在阻斷網站之服務能力。網路通訊設備安全性不足。路由器及交換器僅能檢視封包第三層資訊。防火牆著重在封包第四層資訊檢查。防毒軟體逐漸無法辨識網路攻擊。

網路攻擊工具範例

網路安全基本概念 Policy 資料的保密性資訊的可信賴性資訊的可取用性

網路攻擊種類 Denial of Service (DoS), Distributed Denial of Service (DDoS)
Network Invasion Network Scanning Network Sniffing Torjan Horse and Backdoors Worm

(1) DoS/DDoS Prevent another user from using network connection, or disable server or services: e.g. “Smurf” and “Fraggle” attacks, “Land”, “Teardrop”, “NewTear”, “Bonk”, “Boink”, SYN flooding, “Ping of death”, IGMP Nuke, buffer overflow. Caused by protocol fault or program fault. It damages the “Availability”.

一般常見的 DoS 攻擊 Ping Flooding Ping of Death UDP flooding (Chargen)
藉由傳送大量的 ICMP echo 封包至受害主機，以耗盡系統資源。 Ping of Death 攻擊者傳送夾帶 65,536位元組的 ICMP echo 封包至受害主機，而受害主機將因此而當機 (TCP/IP 協定實作漏洞)。 UDP flooding (Chargen) 攻擊者傳送大量的 UDP 封包至受害網路廣播位址的十九埠（Port 19, Character Generator），造成此網路的所有主機皆送出回應的UDP封包，耗盡網路的頻寬。

一般常見的 DoS 攻擊 Smurf Attack
借刀殺人計策。攻擊者對某網域的廣播位址傳送 ICMP echo 封包，而來源位址填上欲加害之主機。這會造成此網域的每一台機器均會傳送 ICMP reply 至被害主機，不但此網域頻寬受阻，被害主機也將因此而耗盡系統資源。 SYN flooding 攻擊者以每秒鐘送出數千個 SYN 封包（用以建立TCP連線）的速度攻擊受害主機，並於來源位址填上假造或不存在的網址。造成受害主機回送 SYN-ACK 給不存在的網址，而此假造網址當然不會回應。如此受害主機將無法再接受其他的 TCP 連線，也就無法讓合法的使用者登入。

Smurf attack (DoS) Dangerous attacks Network-based, fills access pipes
Uses ICMP echo/reply (smurf) or UDP echo (fraggle) packets with broadcast networks to multiply traffic Requires the ability to send spoofed packets Abuses “bounce-sites” to attack victims Traffic multiplied by a factor of 50 to 200 Low-bandwidth source can kill high-bandwidth connections Similar to ping flooding, UDP flooding but more dangerous due to traffic multiplication

“Smurf” Attack (cont’d)

SYN flooding Attack (DoS)
Goal is to deny access to a TCP service running on a host. Creates a number of half-open TCP connections which fill up a host’s listen queue; host stops accepting connections. Requires the TCP service be open to connections from the victim.

: SYN flooding (cont’d) Spoofed SYN ACK to spoofed address Attacker
Victim The Innocents

DDoS Attack Attacker Handler Handler Handler Agent Agent Agent Agent
Control message Maybe encrypted or hidden in normal packets. Victim Spoofed packets.

DDoS Attack 攻擊者從遠端控制多個傀儡機器同時對受害主機做大量的攻擊。
攻擊 Yahoo.com，Amazon.com，CNN.com，buy.com和 ebay.com的事件即採用DDoS攻擊

DDoS 攻擊範例 DDOS 攻擊攻擊程式範例： Trin00 (會進行破壞)
Tribe Flood Network（TFN） (會進行破壞) TFN2K Stacheldraht Trin00： Trin00 可由某機器或某群機器發動，當攻擊發動後，每一台被暗藏 Trin00 Daemon 的電腦都向受害主機傳送 UDP 封包（含四個位元組的資料），並一直改變目的地的埠號。這造成受害主機疲於奔命地回傳 ICMP port unreachable 訊息，而無法順利地服務合法封包及連線。 TFN：啟動模式和 Trin00 相同, 但 TFN的攻擊較具多樣化。它能傳送 SYN flood、UDP flood、ICMP flood、或Smurf 攻擊。最新版本的 TFN 已能自行變動攻擊封包上的來源位址，使得安全機制更難以檢查過濾此型攻擊。

(2) Network Invasion Goal is to get into the target system and obtain information Account usernames, passwords Source code, business critical information Usually caused by improper configurations or privilege setting, or program fault. Network invasion is diverse and various, knowledge about attack pattern may help to detect, but it is quite hard to detect all attacks.

Example of network invasion: IIS unicode buffer overflow
For IIS 5.0 on windows 2000 without this security patch, a simple URL string: will show the information of root directory.

(3) Network Scanning Goal is generally to obtain the chance, the topology of victim’s network. The name and the address of hosts and network devices. The opened services. Usually uses technique of ICMP scanning, X’mas scan, SYN-FIN scan, SNMP scan. There is an automatic and powerful tool: Nmap.

(4) Sniffing Account usernames, passwords, mail account
Goal is generally to obtain the content of communication Account usernames, passwords, mail account Network Topology Usually a program placing an Ethernet adapter into promiscuous mode and saving information for retrieval later Hosts running the sniffer program (e.g. NetBus) is often compromised using host attack methods.

(5) Backdoor and Torjan horse
Usually, the backdoor and torjan horse is the consequences of invasion or hostile programs. It may open a private communication channel and wait for remote commands. Available toolkits: Subseven, BirdSpy, Dragger It can be detected by monitoring known control channel activities, but not with 100% precision.

(6) Worm The chief intention of worm is to propagate and survive.
It takes advantages of system vulnerabilities to infect and then tries to infect any possible targets. It may decrease the production of system, leave back doors, steal confidential information and so on.

P2P/IM 網安威脅 P2P (Peer-to-Peer) 分享程式 IM (Instant Messenger) 即時通
Spyware 間諜軟體 Adware 廣告軟體 Tunneling 私人隧道

P2P: A new paradigm Bottleneck of Server Powerful PC
Flexible, efficient information sharing P2P changes the way of Web (Internet)

P2P即將破壞現存的資安架構洩漏企業內部機密資訊成為病蟲擴散的管道下載非法檔案侵犯著作權佔用大量網路頻寬影響其他系統正常運作
P2P 除了檔案分享與即時通訊，也逐漸發展出不同應用，例如 SoftEther 和 Skype。對個人用戶，利多於弊，但對企業，為資訊安全一大隱憂 P2P 應用潛藏諸多風險，包括洩漏企業內部機密資訊成為病蟲擴散的管道下載非法檔案侵犯著作權佔用大量網路頻寬影響其他系統正常運作造成員工分心，降低生產力

Famous P2P Examples Shareaza Direct-connect Gnutella Soulseek Opennap
BitTorrent eZpeer Kuro eDonkey eMule MLdonkey Gnutella Kazaa/Morpheus Shareaza Direct-connect Gnutella Soulseek Opennap Worklink Opennext Jelawat PP點點通 SoftEther iMESH MIB WinMix WinMule Skype

Instant Messenger (IM)
MSN Yahoo Messenger ICQ YamQQ AIM (AOL IM)

網路安全技術演進 Firewall (Layer-4) VPN  SSL VPN PKI IDS/IPS Defense-in-Depth
Application Firewall (Layer-7) UTM (Unified Threat Management) NAC (Network Access Control)

入侵偵測系統 Intrusion Detection System (IDS)
入侵偵測防禦系統 Intrusion Detection and Prevention System (IPS/IDP)

Intrusion Detection System
Intrusion Detection System: a computer system that attempts to detect any set of actions that try to compromise the integrity, confidentiality, or availability of a resource. An IDS has much more knowledge and many delicate detection functions than common firewalls. (Remember that, the main function of a firewall is to do access control).

IDS Types Host based vs. Network based.
Misused detection vs. Anomaly detection Active vs. Passive Centralized vs. Distributed

Host based & Network based IDS
Host based IDS: installed on target host as a monitor service. It checks system activity, user privilege, user behavior. Network based IDS: installed on network node, usually in promiscuous mode to listen all passing traffic. It checks network traffic, nodes interactions.

Misused detection & Anomaly detection IDS
Misused detection (signature-based): based on the assumption that intrusion attempts can be characterized by the comparison of user activities against a database of known attacks. Anomaly detection (statistical-based): identify abusive behavior by noting and analyzing audit data that deviates from a predicted norm.

Active IDS vs. Passive IDS
Active IDS: an participate in the system. Not only observe the events, but also involve in the necessary operation. Also called IPS or IDP (Intrusion Detection and Prevention System) Passive IDS: work on a monitor or bystander basis.

Active IDS v.s. Passive IDS
LAN 網路入侵攻擊可穿透 ISP 利用Port Mirror 收集封包分析 Active IDS 被攔截直接攔截封包分析 (a) Passive IDS (b) Active IDS

Centralized IDS v.s. Distributed IDS
Centralized: The sensors are managed by a single analyzer or manager. Distributed: The sensors are managed by multiple automated analyzers or managers. And among analyzers and managers, they can communicate to each other.

Comparison between Firewall and Network based active IDS
Same : Can’t protect insider to insider attack. Can’t protect against connections that don’t go through. Can do ACL and filtering. (For Active IDS) Different : IDS has the ability to detect new threats. IDS focuses on intrusion while Firewall focuses on access control and privacy. Firewalls use address as the passport while IDS will do much more checks.

The Challenge of IDS Speed limitation: NIDS cannot keep pace with the network speed. (NIDS need to check more fields of a packet than a firewall does.) The inability to see all the traffic: The “switched Ethernet” is getting largely deployed. Fail-open/fail-close architecture: when a NIDS fails often without notification of the problem to the central console., leave the network as an “open” one. A “fail-closed” methodology means the network is out of service until the NIDS is brought back on-line.

IDS False Alarms

Content Inspection Technologies

A Generic Layer-7 Engine
Packet Normalizer Makes sure the integrity of incoming packets Eliminates the ambiguity Decodes URI strings if necessary Pattern-Matching Engine Policy Engine Gather information from pattern-matching engine and issue the verdict to allow/drop the packets

Packet Normalizer Integrity Checking IP Fragment Reassemble
TCP Segment Reassemble TCP Segments may come out-of-order SEQ out of window size Segment Overlapping URI Decode URI hex code obfuscation (‘a’ = %61) URI unicode/UTF-8 obfuscation self-referential directories obfuscation (/././././ = /) directories obfuscation (/abc/a/../a/../a/ = /abc/a)

Pattern-Matching Engine
The most computation-intensive task in packet processing. Normally the PM engine needs to process every single byte in packet payload. In Snort, the PM routine accounts for 31% of the total execution time

Pattern Matching is Expensive!
~50 Instructions/ 1500 Byte packet ~30 Instructions/ Byte. 45K Instructions/1500 Byte packet Source: Intel Corp.

Content Inspection Technologies
Pattern-Matching Algorithms Software Based Boyer-Moore Aho-Corasick (AC) Wu-Manber Hardware Based Bloom-Filter Reconfigure Hardware (FSM) TCAM-based

Pattern Matching Problem Definition
Given an input text T = t0, t1, …, tn ,and a finite set of strings P = {P1, P2, …, Pr}, the string matching problem involves locating and identifying the substring of T which is identical to Pj = , 1 j r, where ts+i = , 0 i m-1. And this equation can be also denoted as ts…ts+m-1 = Text G C A T G C A

Aho-Corasick (AC) Algorithm
AC is a classic solution to exact set matching. It works in time O(n + m + z) where z is number of patterns occurrences in T. AC is based on a refinement of a keyword tree. AC is a deterministic algorithm. That is, the performance is independent of the number of patterns.

An Example of AC Algorithm
Example: P = {ab, ba, babb, bb}

An example of AC Algorithm
!={h,s} {he} {h} Patterns: hers his she e h r s 1 2 8 9 {hers} i {his} s s 6 7 {he, she} h e 3 4 5 {s} {sh} Dashed: fail transitions; those not shown leads to the root

An example of AC Algorithm
1 2 8 9 6 7 3 4 5 h e r s i s Got a Match! h i s Text: h e i s h i s

Reconfigure Hardware (FSM)
Implement the AC FSM in configurable Logic Elements (LEs) of FPGA. Achieve multiple gigabit performance. (Depends on the FPGA model) A powerful FPGA is necessary to accommodate thousands of patterns, so that it’s not practical and visible in commercial market.

FPGA-based pattern matching

Bloom Filter Given a string X, the Bloom filter computes k hash functions on it producing k hash values ranging from 1 to m. The same procedure is repeated for all the members of the pattern set. The input text is verified by generating k hash values in the same way. If at least one of these k bits is found not set then the string is declared to be impossible to match. Patterns in Length n are grouped into Bn.

Bloom Filter (Cont.) K Hash functions H1, H2, …, Hk
… Payload Stream A B C D E F G H I J …… B2 B3 B4 Bw Bloom Filter (B4) Bloom Filter (B3) Bloom Filter (B2) 1 m H1 H2 H3 Hk Group signature by length : G2 (X) G3 (X) G4 (X) False positive : Mim f = (0.5)K, while m = (k x n) / Ln2 So, total space, sum(Bi) = m x (w - 1) if k = 1, n = 2048, m = 3072 bits k = 1, n = 3072, m = 4608 bits if k = 4, f = k = 5, f = k = 6, f = K Hash functions H1, H2, …, Hk

TCAM fundamental TCAM stores data with three logic values: ‘0’, ‘1’, ‘X’ (don’t care) Multiple match modes are needed.

Policy Engine Collect the matching events from Pattern-Matching Engine. Clarify the relationship between matched patterns: Ordered: A policy may consists more than one pattern and should be matched in order. Offset, Depth: The matched position should be within a certain range or location. Distance, Within: The distance between two matched patterns should be taken into consideration also. Trace Application States Some applications are difficult to identify by using only one signature (e.g. P2P). Policy Engine needs to track the connection state like the following diagram: Msg Exchange Data Exchange Request File S1 S0 S2 S3

Fast Pattern Matching Algorithms
A Pattern Matching Coprocessor for Deep and Large Signature Set in Network Security System (IEEE GLOBECOM 2005) Hierarchical Matching Algorithm (HMA) for Intrusion Detection Systems (IEEE GLOBECOM2005) A Time and Memory Efficient String Matching Algorithm for Intrusion Detection Systems, (IEEE GLOBECOM 2006) A non-Computation Intensive Pre-filter for String Pattern Matching in Network Intrusion Detection Systems, (IEEE GLOBECOM 2006) Smart Architecture for High-speed Intrusion Detection and Prevention Systems, International Conference on Cryptology and Network Security (CANS 2006, Acceptance rate < 18%). A Deterministic Cost-effective String Matching Algorithm for Network Intrusion Detection Systems,” (IEEE ICC2007). A Novel Algorithm and Architecture for High Speed Pattern Matching in Resource-limited Silicon Solution, (IEEE ICC2007) Flow Digest: A State Synchronization Scheme for Stateful High Availability, (IEEE ICC2007). Performing Packet Content Inspection by Longest Prefix Matching Technology, (IEEE GLOBECOM2007).

Security SoC BroadWeb Security SoC Embedded-Linux
ARM922 RISC CPU (250Mhz) Hardware NAT (400Mbps) Hardware Content Inspection Engine (40Mbps) Two 10/100/1000 RJ-45 Ports Embedded-Linux NSS and ICSA approved IPS signature database IPS/Anti-virus functions IM/P2P Management Turn-key solution (ASIC + Software module) 1-tier Customers

Security SoC (Cont.) BroadWeb Security SoC (2nd Generation)
ARM926EJ RISC CPU (300Mhz) Intelligent Hardware NAT (1Gbps) Hardware Content Inspection Engine (100Mbps) Embedded GbE Smart Switch and 4-port GPHY core NSS and ICSA approved IPS Technology IPS/Anti-virus functions IM/P2P Management Turn-key solution (ASIC+Software module) 1-tier Customers

Cisco/Linksys Wireless Security Router
IEEE n 108 Mbps EWC Wireless LAN IPS protection and IM/P2P management Firewall/VPN/Routing Gigabit Ethernet x 5

State Machine Based Technologies

State Machine Based Technologies
The FA Example : FTP

The FAs of BitTorrent protocols.

The FAs of Yahoo Messenger protocol.

We can identify and manage Over 60 Applications
IM MSN, Yahoo Messanger, AIM, QQ, Google Talk, TM, ICQ, iChat, MIRC, Odigo, Rediff, Gadu-Gadu Web-IM Meebo.com, eBuddy.com, iLoveIM.com, MSN, AIM, Yahoo, ICQ P2P eDonkey, BitTorrent, Gnutella, Foxy, FastTrack, Vagaa, Winny, BitComet, DirectConnect, PiGo, PP365, WInMX, POCO, iMesh, ClubBox Streaming-Media QQLive, Podcast Bar, PPLive, RealPlayer, Window Media Player, iTunes, WinAMP, Player 365, QuickTime, FlashMedia Video, TVAnts Webmail Yahoo, Hotmail, Gmail VoIP Skype (3.6) File Transfer FTP, Web File Transfer, Thunder, GetRight, FlashGet VPN VNN, SpftEther, Hamachi, TinyVPN, PacketiX, HTTP-Tunnel, Tor, Ping-Tunnel Terminal Control VNC, PCAnywhere Online Game QQGame, OurGame, Cga.com.cn, QQFO

Machine Learning Based Technologies

Application Traffic identification
Traffic identification(or traffic classification) issues are focused in recently years since: The introducing of P2P application greatly impacts the network management task. Port number is not the best and efficient discriminator to identify these prevalent traffics. How about string matching method? Accurate! But… It cannot identify the encrypted traffic. High cost on manually maintenance work for protocol signatures. High cost to match string in very high speed network. Privacy issue is under debating.

How to resolve the problem?
Heuristics methods(2004~2005) Based on some intrinsically different behavior, some rule can be constructed. E.g. # dest ip == of dest port  the host is running P2P. To differentiate P2P or non-P2P traffic. Machine learning based techniques:(2004 ~ ) To construct the “statistical signatures” for different categories/application protocols. Most machine learning techniques are directly employed to construct traffic signature. applying field(應用場域)

The Milestone of Researches on Application Traffic Identification
Before 2003: String matching and port number. 2003~2005: Heuristics Machine learning method. 2006~ : Machine learning method for real-time based traffic classification. First k data packet sizes and direction of TCP connection. Stage-based classification(Statistical data in each stage)

Different Objects of Application Traffic Identification
At different levels Category level or QoS class (Bulk data transfer - FTP&P2P, interactive, mail, web, streaming) Protocol level (Kazza, eMule/eDonkey, Bittorrent, MSN, FTP, POP3, SMTP, HTTP, Skype, Winny, Share,….) Behavior level (FTP control, FTP data, MSN file transfer, MSN message chatting, MSN voip, Skype Chatting, Skype voip, Skype File transfer, Skype Video conference,…) All existing researches focus on classification in protocol or category level. Application field Offline based: traffic trend analysis. Online based: traffic shaping, traffic engineering, security management.

The Classes of Applied Machine Learning Algorithms
Supervised-Machine learning The model of traffic characteristics is constructed from the training instances with previously defined class label. Unsupervised-Machine learning (Clustering) The model of traffic characteristics is constructed from the training instances without previously defined class label. However, all the existing training set employed by both include pre-classified label. Because each cluster would contain several different classes/protocols.

The Discriminators (Attributes)
The key issues for machine-learning based traffic identification are: What are the most distinguishable characteristics (attributes/discriminators)? How to remove the expensive cost on training? Different discriminators: From L3/L4 layer—packet inter-arrival time, total packet size, number of packets,…,etc. Combination of L3/L4 attributes with different perspectives. e.g. upload/download size ratio.

The Milestone of Researches (Applying Machine Learning techniques)
2003~2004: [Matthew Roughan, IMC’04] Class-of-Service Mapping for QoS. 2005: [Sebastian Zander] Automated Traffic Classification. [Andrew W. Moore] Using Bayesian Analysis Techniques. 2006: [Sebastian Zander] Internet Archeology: Estimating Individual Application Trends in Incomplete Historic Traffic Traces. [Laurent Bernaille] Traffic classification on the fly. (first 5 packets of TCP with k-means clustering). [Jeffrey Erman] Internet Traffic Identification using Machine Learning (k-means, EM clustering).

The Milestone of Researches (Applying Machine Learning techniques)
2006 (cont.): [Laurent Bernaille] Early Application Identification.(first 4 packets of TCP with k-means, GMM , and HMM clustering) 2007: Real time based methods [Zhu Li] Accurate Classification of the Internet Traffic Based on the SVM Method. (TCP and UDP flow classification) [Laurent Bernaille] Early Recognition of Encrypted Application. (first 3 packets of TCP with GMM clustering) [Jeffrey Erman] Semi-Supervised Network Traffic Classification. (Stage-based classification)

Class-of-Service Mapping for QoS: A Statistical Signature-based Approach to IP Traffic ACM SIGCOMM Internet Measurement Conference (IMC '04) Matthew Roughan1, Subhabrata Sen2, Oliver Spatscheck2, Nick Duffield2 1School of Mathematical Sciences, University of Adelaide, Australia 2AT&T Labs – Research, Florham Park, NJ, USA

Introduction Before this paper: Features:
Traditional researches tried to find the model for traditional protocol (FTP, web, mail). Most researches of traffic characteristics modeling which focus on P2P and IM are case studies. Features: This paper studied the requirements and proposed a framework of QoS for traffic which consists of traditional and novel P2P/IM application in QoS class level. Classification is based on utilizing the statistics of particular applications in order to form “signatures”.

Ideas Nearest Neighbor(NN) Linear Discriminant Analysis(LDA)
The statistical attributes are aggregated with respect to Server ports and Server IP addresses, separately. Employing machine learning techniques to construct the mapping from Server port aggregation/Server IP aggregation to different QoS classes. Nearest Neighbor(NN) Linear Discriminant Analysis(LDA) Then, the port number of aggregation that belongs to particular QoS class can form one rule. Disadvantage: Applications that require different QoS might use the same server port number.(e.g. P2P)

Nearest Neighbor To classify a data point x, let’s find the nearest neighbor! The points with same property should be closely. The class of the nearest neighbor will be assigned to the data point x. K- Nearest Neighbor: To find the k nearest neighbors and let them “vote”. K- Nearest Neighbor: 找出最相近k點，此k點中佔最多數的class將設定給被分類的點。 More information: K-nearest-neighbor Rule

Linear Discriminant Analysis
To find the good “projection” for original points. Linear discriminant analysis finds a linear transformation ("discriminant function") of the two predictors, X and Y, that yields a new set of transformed values that provides a more accurate discrimination than either predictor alone: Transformed Target = C1*X + C2*Y 2 features 3 features More information:

Evaluation Example Attributes for this evaluation: the average packet size, flow duration, bytes per flow, packets per flow, and Root Mean Square (RMS) packet size.

Andrew W. Moore1, Denis Zuev2 1University of Cambridge
Internet Traffic Classification Using Bayesian Analysis Techniques ACM SIGMETRICS'05 Andrew W. Moore1, Denis Zuev2 1University of Cambridge 2University of Oxford

Introduction Features: Naïve Bayesian algorithm (貝氏演算法).
Only TCP flows are considered. Category-level classification. Supervised-machine-learning Naïve Bayesian algorithm (貝氏演算法). Uniquely use data that has been hand-classified (based upon flow content) to one of a number of categories. Feature selection was applied to improved the accuracy.

Ideas Discriminators: Naïve Bayesian classifier
About 248 discriminators of each flow. E.g. Packet inter-arrival time (mean, variance, ), Payload size (mean, variance, ), Fourier Transform of the packet inter-arrival time, TTL value, Flow duration, TCP Port…etc. Naïve Bayesian classifier For a flow with known statistical attributes, which class is most likely happened? To find the maximum probability Pr(Ci | X): Ci is i-th class X is the attributes of flow which will be classified. Only about 65% accuracy on flow level was achieved.

Ideas(cont.) Improvement: Naïve Bayes Kernel estimation method.
Kernel estimation was used instead of Gaussian distribution model assumed by Naïve Bayesian. Discriminator selection and dimension reduction. The accuracy was improved upto 95% Disadvantages: All the discriminators are available after the flow is closed. Only TCP flows are considered for classification. Network management might need more finer classes (protocol level or behavior level). Na¨ve Bayes makes certain assumptions on f( . | cj) such as independence of Ai's and the standard Gaussian behavior of them.

Evaluation for Train and Test sets from traffic of different time
Traffic in Test set was captured after 12 months of capturing training traffic FCBF: Fast Correlation-Based Filter

† LIP6, Universit ´e Pierre et Marie Curie, ‡ Thomson Paris Lab
Traffic Classification on the Fly ACM SIGCOMM Computer Communication Review Journal, Volume 36 , Issue 2, Laurent Bernaille†, Renata Teixeira†, Ismael Akodkenou†, Augustin Soule‡, Kave Salamatian† † LIP6, Universit ´e Pierre et Marie Curie, ‡ Thomson Paris Lab Paris, FRANCE

Introduction Features: K-means clustering. (50 clusters are the best)
The first paper focused on real-time flow-level application classification. To approximately model the L7 protocol handshaking. Protocol level classification. Unsupervised machine learning. K-means clustering. (50 clusters are the best) Protocol assignment: for each cluster, the protocol of the largest proportion dominates the cluster. Discriminators: the first q data packet sizes (payload) and direction of each TCP connection. q = 5 is the best. (+300, -200, +100, +200, -400)

K-means Clustering For given number of clusters k, to iteratively find k centers of these k clusters and “partition” all the points into these k clusters until the nearest center does not change. Each data point is expressed as a vector, and Euclidean distance is the most common distance computation function.

Evaluation Result Above 80% average accuracy can be achieved.
Disadvantages: Only TCP connections are considered. Protocol assignment will result in classification starvation. The protocols which don’t dominate any cluster will be always classified as other protocol.

Early Application Identification ACM Conf-CONEXT06 (International Conference On Emerging Networking Experiments And Technologies) Laurent Bernaille, R. Teixeira and K. Salamatian, Universit ´e Pierre et Marie Curie LIP6, CNRS Paris, France

Introduction Features: K-means
Three unsupervised machine learning (clustering) algorithms were used to evaluate cluster assignment accuracy and protocol labeling accuracy. K-means Gaussian Mixture Models (GMM) on an Euclidean space Spectral clustering on Hidden Markov Models (HMM, in order to consider order of packets) Discriminators: size and direction of first P data packets. To deal with the starvation problem in each group, a labeling heuristic method based on standard server port number (e.g. 25 for SMTP, 110 for POP3) is used to classify protocols in each cluster group. Only focus on TCP flows. Wireless traffic trace has been included for evaluation. 1.使用Hidden Markov Models是因為對於歐幾里德距離的計算而言，並未考慮vector內每個元素的順序，因此不同順序的packet當發生歐幾里得距離相同時，會被視為相同的特性舉例來說 (1, -3, 4, 2, 5) (2, 4, 1, 5, -1) 2. Cluster Assignment Accuracy is the accuracy of assigning a flow to a cluster which contain the correct protocol of the flow Protocol Labeling Accuracy is the accuracy of labeling a flow with correct protocol after Cluster assignment 3. Labeling heuristic目前僅用於分類一般具有standard port number的protocol。最基本概念是，對於一個被分群的flow，在該群中若符合標準port number(SMTP, POP3, …etc.)，則根據port number來標記。P2P protocol並未被包含在標記的部分，因此仍需根據cluster內所佔比例來標記。因此，P2P protocol 仍存在starvation的問題

Discriminators Discussion about the discriminators:
The size and direction of each packet adds more information to distinguish applications than arrival time related metrics. The range of packet sizes for each application is similar across traces. These models can be used to classify the same set of applications at another network. P = 4 packets for the three clustering methods. Clustering number: Kh = 30 for HMM, Kk = 40 for K-Means and Kg = 45 for GMM.

Packet size is a better attribute

On-line Classification

Labeling set of standard server ports
std(S) ={FTP, SSH, SMTP, HTTP, POP3, NNTP, HTTPS, POP3S}.

Labeling Accuracy

Features Pros: Easy, fast, and simple!
Payload size and packet direction of first P data packets. Unsupervised training  automatic learning mechanism. Cons: In [Jeffrey Erman’ HP TR]: “…is unsuccessful classifying application types with variable-length packets in their protocol handshakes such as Gnutella. Neither of these studies access the byte accuracy of their approaches which makes direct comparisons to our work difficult.”

Features Cons: Only TCP are included for classification.
According to the description of traces, there are un-ignorable fraction of flows which contain less than 4 data packets! And, the control flow might prevent the identification system from classifying detailed protocol behavior. Classification starvation is still exist for protocols which don’t use standard port.

Early Recognition of Encrypted Applications Passive and Active Measurement Conference (PAM 2007) Laurent Bernaille, Renata Teixeira Universit´e Pierre et Marie Curie - LIP6-CNRS Paris, France

Introduction Features: The classification of SSL-encrypted protocols.
Two stages:SSL detection & Protocol identification. First 3 packets and 35 clusters for Gaussian Mixture Model. Size of original packet: Most accurate method is to look up the encryption method in the handshake packets and transform the size of application packets accordingly. For the five most common ciphers this method is overkill because the increase varies from 21 to 33 bytes. Simple heuristic: subtract 21 from the size of the encrypted packet regardless of the cipher. Extending the Cluster+Port labeling heuristic SSL-specific ports: 443 for HTTPS, 993 for IMAPS and 995 for POP3S.

Zhu Li1, Ruixi Yuan1, and Xiaohong Guan1, 2
Accurate Classification of the Internet Traffic Based on the SVM Method IEEE ICC 2007 Zhu Li1, Ruixi Yuan1, and Xiaohong Guan1, 2 1Center for Intelligent and Networked Systems (CFINS) Tsinghua University, Beijing , China 2SKLMS Lab and MOE Key Lab for Intelligent Networks and Network Security Xian Jiatong University, Xi’an , China

Introduction Features: Support Vector Machine.
Category level classification. Supervised-machine learning. Support Vector Machine. Feature selection (Discriminator selection) is employed to select the best set of attributes. Both TCP and UDP are considered. Discriminators: Statistical data of flows. Disadvantages: the discriminators are available after the flow has finished the communication.

Feature Selection Sequential forward selection
Begin with 0 feature chosen; sequentially append 1 feature which can arrive at the best classification result. Plus-m-minus-r algorithm Begin with 0 feature chosen; sequentially append m features into chosen ones and pop r features from them (m>r) each time. Plus-2-minus-1 was used in this paper.

Feature Selection (Cont.)

Accuracy After Feature selection
For the data sample set with respect to original proportion in the traffic

Enterprise Systems and Software Laboratory HP Laboratories Palo Alto
Offline/Realtime Traffic Classification Using Semi-Supervised Learning Technique Report-HP Presented at Performance 2007, 2-5 October 2007, Cologne, Germany, and published in Performance Evaluation journal (special issue on Performance 2007 for the Proceedings of IFIP Performance 2007) Jeffrey Erman, Anirban Mahanti, Martin Arlitt, Ira Cohen, Carey Williamson Enterprise Systems and Software Laboratory HP Laboratories Palo Alto

Introduction Features:
Semi-supervised learning techniques Allows classifiers to be designed from training data that consists of only a few labeled and many unlabeled flows. Both high byte accuracy and flow accuracy (i.e., > 90%). To examine traffic over an extended period of time, to assess the longevity of the classifiers. Focused on TCP only. It would likely be advantageous to have a separate classier for the non-TCP traffic.(future work). Consideration about the elements in training set. Elephant vs. Mice Flows In order to obtain higher byte accuracy. 1.The ideas of clustering and protocol labeling is almost the same with 5 packet’s method. The difference is the author of this paper use different discriminators.

Introduction Semi-supervised Learning:
Hypothesis: few flows are labeled in each cluster, we have a reasonable basis for creating the clusters to application type mapping. Step1: Clustering: K-Means Step 2: Mapping from the clusters to the different known q applications (Y) according to the fraction of labeled application flows within the cluster. The clusters are unlabeled if they have no labeled flows. Use the unlabeled clusters to represent new or unknown applications. For most experiments, the number of clusters K = 400. 1.DBSCAN, and EM Clustering are employed in author’s previous work. In this paper, only k-means is used.

Discriminators 11 Discriminators: (After feature selection from 25 discriminators) Total number of packets. Average packet size. Total bytes. Total header (transport plus network layer) bytes. Number of caller to callee packets. Total caller to callee bytes. Total caller to callee payload bytes. Total caller to callee header bytes. Number of callee to caller Packets. Total callee to caller payload bytes. Total callee to caller header bytes.

On-line Classification
Layered classification system. A packet milestone is reached when the count of the total number of packets a flow (SYN/SYNACK packets are included) has sent or received reaches a specific value. Each layer is an independent model that classifies ongoing flows into one of the many class types using the flow statistics available at the chosen milestone. Each milestone's classification model is trained using flows that have reached each specific packet milestone. Reclassifying whenever a upper layer is reached: When a flow is reclassified, any previously assigned labels are disregarded.

Byte Accuracy April 13, 9 am trace
Experiments for online classification Training source: April 6, 9 am trace with 966,000 flows. Layers N = 13 For each of N layers we created models using 8,000 training ows, using K = 400. Test set of Figure 8: traffic trace is from April 13, 9 am trace (our largest 1-hour Campus trace). 78% of the flows had correct labels after classification

Features Pros: Semi-supervised mechanism reduces the cost to prepare large training data set. Considering sampling techniques to form the training set. Cons: Only TCP are included. Is exponential “packet milestone” suitable for real-time classification?

A High Accurate Machine-Learning Algorithm for Identifying Application Traffic in Early Stage
Nen-Fu Huang+ , Gin-Yuan Jai+, and Han-Chieh Chao1 +Department of Computer Science, National Tsing Hua University, Taiwan *Department of Electronics, National Ilan University, Taiwan

Classification in Early Stage
To get characteristics of protocol handshaking for each flow in L7 perspective. Flow id—tuple (sip, sport, dip, dport, protocol) Statistical information of each flow at first k rounds. Elapsed time, transmitted size, throughput, response time, inter-arrival time.

Rule-based Machine Learning
Rule-based ML (Supervised machine learning) Rules generated are suitable for intrinsic architecture of firewall and IDS/IPS. Rules generated by ML algorithm provide information to understand potential characteristics of application protocols One Rule, PART, Ripple down, DecisionTable, ConjunctiveRule, Ripper… ML Name Accuracy PART 85.58 % Ripple Down 82.94 % Ripper 81.8 % One R 69.19% Conjunctive Rule %

Experiment Architecture
Traffic Dump (payload included) Flow Preprocessing Flow Sets Result 1 Machine Learning Sampling Sample Set Random Split 10-fold cross validation Training Sets 1 Test Sets 10 … Result 10 Average Result Protocol signature

Accuracy Comparison with Respective to Sample Set
L. Bernaille 2006

Accuracy Comparison with Respective to Sample Set(cont.)
Zhu Li ICC2007

Accuracy After Discriminators Selection

Conclusions Machine learning based techniques to identify the Network Applications are more and more important. Focus on real-time based, protocol level requirement of application traffic classification. No existing common traffic traces provided for comparing the performance in the same base line. Expensive training is still a problem. Identifying encrypted traffic (e.g. Skype, Winny, Encrypted BT) is a new challenge. Identifying detailed behaviors of encrypted traffic is even a big challenge.

網路安全 (Network Security) 黃能富教授清華大學資訊工程學系/通訊工程研究所

Similar presentations

Presentation on theme: "網路安全 (Network Security) 黃能富教授清華大學資訊工程學系/通訊工程研究所"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

網路安全 (Network Security) 黃能富教授 清華大學資訊工程學系/通訊工程研究所

Similar presentations

Presentation on theme: "網路安全 (Network Security) 黃能富教授 清華大學資訊工程學系/通訊工程研究所"— Presentation transcript:

Similar presentations

About project

Feedback

網路安全 (Network Security) 黃能富教授清華大學資訊工程學系/通訊工程研究所

Presentation on theme: "網路安全 (Network Security) 黃能富教授清華大學資訊工程學系/通訊工程研究所"— Presentation transcript: