Presentation is loading. Please wait.

Presentation is loading. Please wait.

Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining

Similar presentations


Presentation on theme: "Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining"— Presentation transcript:

1 KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection
Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented by Chris Clifton

2 KDD Cup Overview Held Annually in conjunction with Knowledge Discovery and Data Mining Conference (now ACM-sponsored) Challenge problem(s) released well before conference Goal is to give best solution to problem Relatively informal “contest” Gives “standard” test for comparing techniques Winner announced at KDD conference Lots of recognition to winner

3 Classifier Learning for Intrusion Detection
One of two KDD’99 challenge problems Other was a knowledge discovery problem Goal is to learn a classifier to define TCP/IP connections as intrusion/okay Data: Collection of features describing TCP connection Class: Non-attack or type of attack Scoring: Cost per Test Sample Wrong answers penalized based on type of “wrong”

4 Data: TCP “connection” information
Dataset developed for 1998 DARPA Intrusion Detection Evaluation Program Nine weeks of raw TCP dump data from simulated USAF LAN Simulated attacks to give positive examples Processed into 5 million training “connections”, 2 million test Some “attributes” derived from raw data Twenty-four attack types in training data, four classes: DOS: denial-of-service, e.g. syn flood; R2L: unauthorized access from a remote machine, e.g. guessing password; U2R:  unauthorized access to local superuser (root) privileges, e.g., various ``buffer overflow'' attacks; probing: surveillance and other probing, e.g., port scanning. Test set includes fourteen attack types not found in training set

5 Basic features of individual TCP connections
feature name description  type duration  length (number of seconds) of the connection  continuous protocol_type  type of the protocol, e.g. tcp, udp, etc.  discrete service  network service on the destination, e.g., http, telnet, etc.  src_bytes  number of data bytes from source to destination  dst_bytes  number of data bytes from destination to source  flag  normal or error status of the connection  discrete  land  1 if connection is from/to the same host/port; 0 otherwise  wrong_fragment  number of ``wrong'' fragments  urgent  number of urgent packets 

6 Content features within a connection suggested by domain knowledge
feature name description  type hot  number of ``hot'' indicators continuous num_failed_logins  number of failed login attempts  logged_in  1 if successfully logged in; 0 otherwise  discrete num_compromised  number of ``compromised'' conditions  root_shell  1 if root shell is obtained; 0 otherwise  su_attempted  1 if ``su root'' command attempted; 0 otherwise  num_root  number of ``root'' accesses  num_file_creations  number of file creation operations  num_shells  number of shell prompts  num_access_files  number of operations on access control files  num_outbound_cmds number of outbound commands in an ftp session  is_hot_login  1 if the login belongs to the ``hot'' list; 0 otherwise  is_guest_login  1 if the login is a ``guest''login; 0 otherwise 

7 Traffic features computed using a two-second time window
feature name description  type count  number of connections to the same host as the current connection in the past two seconds  continuous Note: The following  features refer to these same-host connections. serror_rate  % of connections that have ``SYN'' errors  rerror_rate  % of connections that have ``REJ'' errors  same_srv_rate  % of connections to the same service  diff_srv_rate  % of connections to different services  srv_count  number of connections to the same service as the current connection in the past two seconds  Note: The following features refer to these same-service connections. srv_serror_rate  srv_rerror_rate  srv_diff_host_rate  % of connections to different host

8 Scoring Each prediction gets a score:
Row is correct answer Column is prediction made Score is average over all predictions normal probe DOS U2R R2L 1 2 3 4

9 Results Twenty-four entries, scores: 1-Nearest Neighbor scored

10 Winning Method: Bagged Boosting
Submitted by Bernhard Pfahringer, ML Group, Austrian Research Institute for AI 50 samples from the original 5 million odd examples set Contrary to standard bagging the sampling was slightly biased: all of the examples of the two smallest classes U2R and R2L 4000 PROBE, NORMAL, and DOS examples duplicate entries in the original data set removed Ten C5 decision trees induced from each sample used both C5's error-cost and boosting options. Final predictions computed from 50 single predictions of each training sample by minimizing “conditional risk” minimizes sum of error-costs times class-probabilities Took approximately 1 day of 200MHz 2 processor Sparc to train

11 Confusion Matrix (Breakdown of score)

12 Analysis of winning entry
Result comparable to 1-NN except on “rare” classes Training sample of winner biased to rare classes Does this give us a general principle? Misses badly for some attack categories True for 1-NN as well Problem with feature set?

13 Second and Third places (Probably not statistically significant)
Itzhak Levin, LLSoft, Inc.: Kernel Miner Link broken? Vladimir Miheev, Alexei Vopilov, and Ivan Shabalin, MP13, Moscow, Russia Verbal rules constructed by an expert First echelon of voting decision trees Second echelon of voting decision trees Steps sequentially Branch to the next step occurs whenever the current one has failed to recognize the connection Trees constructed using their own (previously developed) tree learning algorithm


Download ppt "Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining"

Similar presentations


Ads by Google