Enron Corpus: A New Dataset for Email Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Text Categorization.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Evaluation of Decision Forests on Text Categorization
Characteristic Identifier Scoring and Clustering for Classification By Mahesh Kumar Chhaparia.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Toward Automatic Speech Act Discovery. newsgroups forums blogs.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
BuzzTrack Topic Detection and Tracking in IUI – Intelligent User Interfaces January 2007 Keno Albrecht ETH Zurich Roger Wattenhofer.
Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman.
Fine-tuning Ranking Models: a two-step optimization approach Vitor Jan 29, 2008 Text Learning Meeting - CMU With invaluable ideas from ….
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
Scalable Text Mining with Sparse Generative Models
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.
Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August.
The identification of interesting web sites Presented by Xiaoshu Cai.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Summarizing Conversations with Clue Words Giuseppe Carenini Raymond T. Ng Xiaodong Zhou Department of Computer Science Univ. of British Columbia.
The Role of Metadata in Machine Learning for TAR Amanda Jones Marzieh Bazrafshan Fernando Delgado Tania Lihatsh Tami Schuyler
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.
A Kosher Source of Ham Nathan Friess John Aycock Department of Computer Science University of Calgary Canada.
BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.
Spam Detection Ethan Grefe December 13, 2013.
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Effective Information Access Over Public Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Enron datasets LING 575 Fei Xia 01/04/2011.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Classification using Co-Training
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.
Classification Results for Folder Classification on Enron Dataset.
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Applications of IScore (using R)
Text Categorization Assigning documents to a fixed set of categories
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Presentation transcript:

Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee

Introduction Motivation Related Works The Enron Corpus Methods Evaluation Thread Information Conclusion

Motivation Other corpuses focus on newsgroups or personal data Lack of common data set to evaluate the performance of classification Previous research uses different personal data sets Difficulties to find actual use of within a company Obviously, companies do not like to share their internal s Privacy concerns for people working for the company

Related Works Other corpuses 20 Newsgroups Related Papers Y. Diao, H. Lu, and D. Wu, A Comparative Study of Classification Based Personal Filtering (PAKDD ’00) I. Androutsopoulos, et. al., An Experimental Comparison of Naïve Bayesian and Keyword-Based Anti-Spam Filtering with Personal Messages (SIGIR ‘00) T. Payne, Learning Filtering Rules with Magi (Thesis 1994)

20 Newsgroups Collection of approximately 20,000 newsgroup documents, spread out evenly across 20 different newsgroups Sample newsgroups: comp.graphics, rec.motorcycles, rec.sport.baseball, sci.electronics, talk.politics.misc, talk.religion.misc, etc. Used originally in Ken Lang’s Newsweeder: Learning to filter netnews paper (ICML 1995) Dataset on newsgroup data, probably not very useful for research in personal information management

Enron Dataset 619,446 messages (200,399 after cleaning) by 158 users Average 757 messages per user Shows most users do use folders to organize s Can use folder information to evaluate effectiveness for folder classification

Enron Corpus’ Characteristics Number of messages per user varies from a few messages to 10K + messages Upper bound of folder seems to correlate to the log(# of messages) Number of messages does not correlate to the lower bound (can have many messages but a few folders) Question: how can we use this kind of information?

Classification Features Constructive text BOW approach, feature used the most Some fields are more important than the others Stemming, stop word removal used, effectiveness not proven Categorical text “to” and “from” fields BOW, useful for classification, but not as useful as constructive text Numeric data Size of message, number of replies, number of words, etc. Not very useful Thread information Indicates how message relates to each other Not fully exploited

Features (Example) From: Mark Hills Subject: Re: When is the first lecture? When will the course page be updated? Date: Thu, 26 Aug :41: Lines: 11 Message-ID: References: In-Reply-To: Joshua Blatt wrote: > When is the first lecture? When will the course page be updated? > > Thanks > > Josh The first lecture was today, during the normally scheduled time. Mark Categorical text Contextual text Numeric data Thread information

Classification Method Vector space model with SVM Vector weight w i is evaluated using “ltc” ( which means: l: new-tf = ln (tf) t: new-wt = new-tf * log (num-docs/coll-freq-of-term) c: divide each new-wt by sqrt (sum of (new-wts squared))

Classification Method (Cont.) Sort messages in chronological order, split into train and test set Run SVM on term weighted vectors of From Subject Body To, CC All fields Linear regression on all fields seem to have the best performance

Clustering Effectiveness

Number of Messages vs. F1 Number of message does not directly correlate to the accuracy Question: What about the case where the user has only one folder, which makes classification trivial?

Number of Folders vs. F1 There’s correlation between the number of folders and the F1 score. Question: Is this trivial as well? Some elements in the messages not modeled, since SVM have more messages to train on.

Thread Information 200,399 messages, 101,786 threads, 71,696 threads with only one message 61.63% of messages of corpus is in a thread. Average thread size is 4.1 messages Average folder per thread is 1.37 (meaning most messages of the thread stays in one folder) Question: Not clear how threads are detected. How can we use this information?

More Thread D. Lewis, et. al., Threading Electronic Mail: A Preliminary Study (1997) Lewis studied finding parent message using BOW, TF/IDF weighted, vector space approach on constructive text Document weight Query weight Similarity

More Thread (Cont.) Lewis’ work assumes that the thread information is incomplete in the message header. May not be the case. Algorithm by Jamie Zawinski is widely used in the original Netscape 4.x (maybe in recent Mozilla as well?) can group threaded messages effectively. Questions How can we leverage the thread information in messages more effectively? Does this model extend to the more recent form of conversation such as blog and web forums as well?

Conclusion Pros Introduce a new corpus that can be useful in evaluating classification performance on a large collection of personal mail Unlike small collection of personal mails, corpus can also be used to analyze behavior within a company Cons Details on performing SVM and the linear weight for various fields are missing Not clear how threads are detected