Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analyzing Behavioral Features for Email Classification.

Similar presentations


Presentation on theme: "Analyzing Behavioral Features for Email Classification."— Presentation transcript:

1 Analyzing Behavioral Features for Email Classification

2 Steve Martin, Anil Sewani, Blaine Nelson, Karl Chen, and Anthony Joseph {steve0, anil, nelsonb, quarl, adj}@cs.berkeley.edu University of California at Berkeley

3 The Problem: Email Abuse Email has become globally ubiquitous –By 2006, email traffic is expected to surge to 60 billion messages daily. However, spam accounts for half the email sent on a daily basis worldwide. Nearly all of the most virulent worms of 2004 spread by email. Email system abuse results in huge damage costs.

4 Current Email Analysis Many current methods for detecting email abuse examine characteristics of incoming email. Example: Spam Detection –Calculate statistical features on received mail and classify each message separately. Example: Virus Scanning –Generate a hash value on each incoming message, compare with stored database of values. –Signatures must be predetermined by human analyst. Can be effective, but room for improvement.

5 Our Approach Huge corpus of ignored data: outgoing email! –Can’t profile user email behavior with incoming email. –Outgoing email contains this information. Calculate features on outgoing email. –Observe a wide variety of statistics. Build a statistical understanding of user behavior. –Use to classify email sent by individual users. –Can detect sudden changes in behavior, such as worm/spam activity.

6 Ex. Outgoing Email Features Per-Email Features Email Contains HTML? Email Contains Scripts? Email Contains Images? Email Contains Links? MIME Types of Attachments Number of Attachments Number of Words in Body Number of Words in Subject Number of Chars in Subject... Per-User Features (calc’d over a window of email) Frequency of Email Sending No. of Unique ‘To’ Addr. No. of Unique ‘From’ Addr. Ratio Emails w/ Attachments Average Word Length Avg. No. of Words/Body Avg. No. of Words/Subject Variance in Word Length Variance in No. Words/Body...

7 1. Histogram Analysis Histograms of separate users over specific features allow similarity estimation. Example below: on left, two users, same feature. On right, difference between values. –Shows how these users differ over this feature. –Can use to detect differences in behavior between these two users.

8 Per-Feature Histograms

9 2. Covariance Analysis Goal: identify features that vary most significantly with the labels. Method 1: Principal Component Analysis (PCA) –Determines a linear combination of relevant features that maximize variance. –Does not take labels or redundancy into account. Method 2: Directions of Max Covariance –Determines directions in feature space that maximize the covariance between data and labels. –Modified to take potential feature redundancy into account.

10 Greedy Feature Ranking Rank features with a simple greedy approach using Directions of Max Covariance: –Rank features by their contribution to the first principal component of covariance matrix: cov[data,labels Feature Ranking Algorithm Set F = all features While F is not empty: CovMat = Empirical Covariance Matrix V = principle component vector of CovMat via SVD. Select feature f from principle component of V Modify (deflate) CovMat to eliminate redundancy F = F - f

11 Feature Ranking Results

12 Application: Worm Detection Can apply statistical learning on outgoing email to detect/prevent novel worm propagation. –Success depends on ability of features to identify anomalous behavior. Constructed training/test sets of real email traffic artificially ‘infected’ with viruses. Applied feature selection techniques, then tested with different models.

13 Example Results Features added greedily using selection algorithm. Graphs show exists an optimal set of features, beyond which performance decreases. Support Vector MachinesNaïve Bayes Classifier

14 Conclusions and Future Work Conclusion: analysis of email behavior could have many applications. –Feature selection is extremely important to model performance. In the future, study effects of feature selection on classification accuracy for other statistical models Try similar analysis on existing anti-spam solutions. Cluster user behavior into sets of common models describing general behavior patterns.


Download ppt "Analyzing Behavioral Features for Email Classification."

Similar presentations


Ads by Google