Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enron Corpus: A New Dataset for Email Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.

Similar presentations


Presentation on theme: "Enron Corpus: A New Dataset for Email Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee."— Presentation transcript:

1 Enron Corpus: A New Dataset for Email Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee

2 Introduction Motivation Related Works The Enron Corpus Methods Evaluation Thread Information Conclusion

3 Motivation Other corpuses focus on newsgroups or personal email data Lack of common data set to evaluate the performance of email classification Previous research uses different personal data sets Difficulties to find actual use of email within a company Obviously, companies do not like to share their internal emails Privacy concerns for people working for the company

4 Related Works Other corpuses 20 Newsgroups http://people.csail.mit.edu/people/jrennie/20Newsgroups/ Related Papers Y. Diao, H. Lu, and D. Wu, A Comparative Study of Classification Based Personal E-mail Filtering (PAKDD ’00) I. Androutsopoulos, et. al., An Experimental Comparison of Naïve Bayesian and Keyword-Based Anti-Spam Filtering with Personal E-mail Messages (SIGIR ‘00) T. Payne, Learning Email Filtering Rules with Magi (Thesis 1994)

5 20 Newsgroups Collection of approximately 20,000 newsgroup documents, spread out evenly across 20 different newsgroups Sample newsgroups: comp.graphics, rec.motorcycles, rec.sport.baseball, sci.electronics, talk.politics.misc, talk.religion.misc, etc. Used originally in Ken Lang’s Newsweeder: Learning to filter netnews paper (ICML 1995) Dataset on newsgroup data, probably not very useful for research in personal information management

6 Enron Dataset 619,446 messages (200,399 after cleaning) by 158 users Average 757 messages per user Shows most users do use folders to organize emails Can use folder information to evaluate effectiveness for folder classification

7 Enron Corpus’ Characteristics Number of messages per user varies from a few messages to 10K + messages Upper bound of folder seems to correlate to the log(# of messages) Number of messages does not correlate to the lower bound (can have many messages but a few folders) Question: how can we use this kind of information?

8 Email Classification Features Constructive text BOW approach, feature used the most Some fields are more important than the others Stemming, stop word removal used, effectiveness not proven Categorical text “to” and “from” fields BOW, useful for classification, but not as useful as constructive text Numeric data Size of message, number of replies, number of words, etc. Not very useful Thread information Indicates how message relates to each other Not fully exploited

9 Email Features (Example) From: Mark Hills Subject: Re: When is the first lecture? When will the course page be updated? Date: Thu, 26 Aug 2004 13:41:09 -0500 Lines: 11 Message-ID: References: In-Reply-To: Joshua Blatt wrote: > When is the first lecture? When will the course page be updated? > > Thanks > > Josh The first lecture was today, during the normally scheduled time. Mark Categorical text Contextual text Numeric data Thread information

10 Classification Method Vector space model with SVM Vector weight w i is evaluated using “ltc” (http://people.csail.mit.edu/people/jrennie/ecoc-svm/smart.html), which means: l: new-tf = ln (tf) + 1.0 t: new-wt = new-tf * log (num-docs/coll-freq-of-term) c: divide each new-wt by sqrt (sum of (new-wts squared))

11 Classification Method (Cont.) Sort messages in chronological order, split into train and test set Run SVM on term weighted vectors of From Subject Body To, CC All fields Linear regression on all fields seem to have the best performance

12 Clustering Effectiveness

13 Number of Messages vs. F1 Number of message does not directly correlate to the accuracy Question: What about the case where the user has only one folder, which makes classification trivial?

14 Number of Folders vs. F1 There’s correlation between the number of folders and the F1 score. Question: Is this trivial as well? Some elements in the messages not modeled, since SVM have more messages to train on.

15 Thread Information 200,399 messages, 101,786 threads, 71,696 threads with only one message 61.63% of messages of corpus is in a thread. Average thread size is 4.1 messages Average folder per thread is 1.37 (meaning most messages of the thread stays in one folder) Question: Not clear how threads are detected. How can we use this information?

16 More Thread D. Lewis, et. al., Threading Electronic Mail: A Preliminary Study (1997) Lewis studied finding parent message using BOW, TF/IDF weighted, vector space approach on constructive text Document weight Query weight Similarity

17 More Thread (Cont.) Lewis’ work assumes that the thread information is incomplete in the message header. May not be the case. Algorithm by Jamie Zawinski is widely used in the original Netscape 4.x (maybe in recent Mozilla as well?) can group threaded messages effectively. http://www.jwz.org/doc/threading.htm Questions How can we leverage the thread information in email messages more effectively? Does this model extend to the more recent form of conversation such as blog and web forums as well?

18 Conclusion Pros Introduce a new corpus that can be useful in evaluating classification performance on a large collection of personal mail Unlike small collection of personal mails, corpus can also be used to analyze behavior within a company Cons Details on performing SVM and the linear weight for various fields are missing Not clear how threads are detected


Download ppt "Enron Corpus: A New Dataset for Email Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee."

Similar presentations


Ads by Google