Analyzing Behavioral Features for Classification
Steve Martin, Anil Sewani, Blaine Nelson, Karl Chen, and Anthony Joseph {steve0, anil, nelsonb, quarl, University of California at Berkeley
The Problem: Abuse has become globally ubiquitous –By 2006, traffic is expected to surge to 60 billion messages daily. However, spam accounts for half the sent on a daily basis worldwide. Nearly all of the most virulent worms of 2004 spread by . system abuse results in huge damage costs.
Current Analysis Many current methods for detecting abuse examine characteristics of incoming . Example: Spam Detection –Calculate statistical features on received mail and classify each message separately. Example: Virus Scanning –Generate a hash value on each incoming message, compare with stored database of values. –Signatures must be predetermined by human analyst. Can be effective, but room for improvement.
Our Approach Huge corpus of ignored data: outgoing ! –Can’t profile user behavior with incoming . –Outgoing contains this information. Calculate features on outgoing . –Observe a wide variety of statistics. Build a statistical understanding of user behavior. –Use to classify sent by individual users. –Can detect sudden changes in behavior, such as worm/spam activity.
Ex. Outgoing Features Per- Features Contains HTML? Contains Scripts? Contains Images? Contains Links? MIME Types of Attachments Number of Attachments Number of Words in Body Number of Words in Subject Number of Chars in Subject... Per-User Features (calc’d over a window of ) Frequency of Sending No. of Unique ‘To’ Addr. No. of Unique ‘From’ Addr. Ratio s w/ Attachments Average Word Length Avg. No. of Words/Body Avg. No. of Words/Subject Variance in Word Length Variance in No. Words/Body...
1. Histogram Analysis Histograms of separate users over specific features allow similarity estimation. Example below: on left, two users, same feature. On right, difference between values. –Shows how these users differ over this feature. –Can use to detect differences in behavior between these two users.
Per-Feature Histograms
2. Covariance Analysis Goal: identify features that vary most significantly with the labels. Method 1: Principal Component Analysis (PCA) –Determines a linear combination of relevant features that maximize variance. –Does not take labels or redundancy into account. Method 2: Directions of Max Covariance –Determines directions in feature space that maximize the covariance between data and labels. –Modified to take potential feature redundancy into account.
Greedy Feature Ranking Rank features with a simple greedy approach using Directions of Max Covariance: –Rank features by their contribution to the first principal component of covariance matrix: cov[data,labels Feature Ranking Algorithm Set F = all features While F is not empty: CovMat = Empirical Covariance Matrix V = principle component vector of CovMat via SVD. Select feature f from principle component of V Modify (deflate) CovMat to eliminate redundancy F = F - f
Feature Ranking Results
Application: Worm Detection Can apply statistical learning on outgoing to detect/prevent novel worm propagation. –Success depends on ability of features to identify anomalous behavior. Constructed training/test sets of real traffic artificially ‘infected’ with viruses. Applied feature selection techniques, then tested with different models.
Example Results Features added greedily using selection algorithm. Graphs show exists an optimal set of features, beyond which performance decreases. Support Vector MachinesNaïve Bayes Classifier
Conclusions and Future Work Conclusion: analysis of behavior could have many applications. –Feature selection is extremely important to model performance. In the future, study effects of feature selection on classification accuracy for other statistical models Try similar analysis on existing anti-spam solutions. Cluster user behavior into sets of common models describing general behavior patterns.