Introduction to Automatic Email Classification Shih-Wen (George) Ke 7 th Dec 2005.

Introduction to Automatic Email Classification Shih-Wen (George) Ke 7 th Dec 2005

Overview Introduction to Enron Corpus Traditional Text Classification vs Email Classification Recent Work on Enron Corpus Our Work on Enron Corpus Summary Future Research Directions in Information Retrieval Further Discussion

Overview The nature of email classification is very different to that of traditional text classification tasks. Email is time-dependent, poorly structured and written in informal format and no standard ways of preparing and evaluating email datasets have been proposed.

Introduction Automatic Email Classification dates back to mid 90’s Email Classification received little attention until recently because there was no standard email dataset available Enron Email Corpus available in March 2004

Introduction – Enron Corpus Distributed by William Cohen at Carnegie Mellon Uni. Consists of 517,431 messages that belong to 150 users of Enron Corporation Most users use folders to categorise their emails Upper bound for the number of folders appears to be the log of the number of messages (Klimt & Yang, 2004)

Email Classification: Assumptions Categorise email into folders – a.k.a. email foldering Only personal and professional emails are considered here Assume that users use folders to organise their emails Other methods of organising emails, e.g. flag or label, are not considered here although they may provide more information in Email Classification

Recent Work on Enron Corpus Bekkerman et al. (2004)Klimt & Yang (2004) Mono & Multiple-classificationMultiple-classification Accuracy (TP/N)P&R, Micro & Macro F 1 SVM performed best in most cases, but not statistically significant Newly created folders adversely affect performance Performance does not necessarily improve as the training set size grows Incoming emails are more related to those recently received than those received long ago Enron is suitable for email classification evaluation Body field is the most useful feature followed by ‘From’ Email threads can be a valuable asset to email classification but they are difficult to detect and evaluate Foldering strategies differ individually

Our Work on Enron Corpus - Introduction Users sometimes forget which folders they have created or which folders they should file the email under So users tend to create new (duplicate) folders Newly created folders adversely affect performance (Bekkerman et al., 2004) Reduce the likelihood of users creating duplicate folders by improving the accuracy of assigning incoming emails to folders that were created in the first place Compare state-of-the-art classifiers (kNN, SVM) and our own classifier - PERC in a simulation of real-time situation using various parameter settings

Our Work on Enron Corpus - The PERC The PERC Classifier (PERsonal email Classifier)  Find a centroid c i for each category C i  For each test document x :  Find k nearest neighbouring training documents to x  Similarity between x and the training document d j is added to similarity between x and c i  Sort similarity scores sim(x,C i ) in descending order  Decision to assign x to C i can be made using various thresholding strategies

Our Work on Enron Corpus - The PERC The PERC Classifier (PERsonal email Classifier) where y(d j,C i ) {0,1} is the classification for training document d j with respect to category C i ; sim(x,d j ) is the similarity between test document x and training document d j ; and sim(x,c i ) is the similarity between test document x and the centroid c i of the category that d j belongs to.

Rationale for the Hybrid Approach Centroid method overcomes data sparseness: emails tend to be short. kNN allows the topic of a folder to drift over time. Considering the vector space locally allows matching against features which are currently dominant.

Our Work on Enron Corpus - Results SVM1 (c=1,j=1), SVM2 (c=0.01,j=1) Micro-averaging and Macro-average F1 over all users with standard deviation for kNN, SVM and PERC For Macro-averaging evaluations, PERC significantly outperformed kNN (t=2.786, p=0.032), SVM1 (t=2.533, p=0.044) and SVM2 (t=5.926, p=0.001)

Our Work on Enron Corpus - Conclusions PERC has the highest accuracy of assigning test documents to small folders kNN and PERC performed better with smaller k Parameters of SVM can be sensitive to the number of training documents available Investigate various parameter settings and training/test sets splits Use of time will be investigated A questionnaire-based study is being conducted in order to indicate the behaviour of real users in email management

Future Research Directions in IR Use of time information Training/test sets splits Feature extraction, selection Document representation Qualitative evaluation Threads detection, TDT for email Mining sequential patterns Burst of activity (Kleinberg, 2002)

References Bekkerman, R., McCallum, A. and Huang, G. (2004) Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora. Technical Report IR-418, CIIR, University of Massachusetts. Kleinberg, J. (2002) Bursty and Hierarchical Structure in Streams. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Klimt, B. & Yang, Y. (2004) The Enron Corpus: A New Dataset for Email Classification Research. European Conference on Machine Learning.

Introduction to Automatic Email Classification Shih-Wen (George) Ke 7 th Dec 2005.

Similar presentations

Presentation on theme: "Introduction to Automatic Email Classification Shih-Wen (George) Ke 7 th Dec 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Automatic Email Classification Shih-Wen (George) Ke 7 th Dec 2005.

Similar presentations

Presentation on theme: "Introduction to Automatic Email Classification Shih-Wen (George) Ke 7 th Dec 2005."— Presentation transcript:

Similar presentations

About project

Feedback