Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enron email datasets LING 575 Fei Xia 01/04/2011.

Similar presentations


Presentation on theme: "Enron email datasets LING 575 Fei Xia 01/04/2011."— Presentation transcript:

1 Enron email datasets LING 575 Fei Xia 01/04/2011

2 History of Enron Enron was formed in 1985 under the direction of Kenneth Lay In 1999, Enron officials began to use the “special purpose entities” (SPE) trick. In Dec 2000, Jeffrey Skilling took over the position of CEO from Kenneth Lay. In Aug 2001, Skilling surprisingly resigned. Lay became CEO again. Watkins wrote an anonymous letter to Lay about possible fraud. In Oct 2001, the losses transferred from Enron to SPE totaled over $618 million. SEC started an inquiry into Enron. In Jan 2002, Lay resigned as chairman and CEO. Enron collapsed in the same year. In 2003, Enron emerged from bankruptcy as two separate companies. Most creditors would receive about 1/5 of the $67 billion they were owed.

3 History of Enron email dataset Made public by the Federal Energy Regulatory Commission during its investigation in May 2002 Later collected and prepared by SRI for the CALO project William Cohen from CMU put up the dataset on the web for the researchers (the CMU dataset) in March 2004 ISI cleaned the CMU dataset and created a MySql database (the ISI database) Various teams did data cleaning and annotation

4 Several corpora Raw data: emails between 1998 and 2002 – the CMU dataset – the ISI database – … Annotated data – Personal vs. business – Email zoning – …

5 The CMU dataset

6 Paper: ( B. Klimt and Y. Yang, 2004) Available at http://www.cs.cmu.edu/~enron/http://www.cs.cmu.edu/~enron/ Stored on patas under /corpora/enron_email_dataset/cmu/

7 CMU dataset Raw corpus: – 619,446 messages from 158 users Cleanup: – remove folders such as “discussion_threads” – remove duplicates Cleaned corpus: – 200,399 messages from 158 users

8 Messages per user A few people sent out a lot of messages

9 Correlation of folders and messages Most users do use folders to organize their emails, but their usage of folders varies a lot.

10 Distribution of thread sizes Thread: same subject line among the same users. Out of 200,399 messages, 61.6% of emails are in threads (123,501 emails in 30,091 threads). Most threads are of small size:

11 The ISI database

12 Paper: Shetty and Adibi’s report Report and data are available at http://www.isi.edu/~adibi/Enron/Enron.htm http://www.isi.edu/~adibi/Enron/Enron.htm Stored on patas under $data_dir/isi/ Stored on capuchin as a mysql database called “enron”.

13 Data cleaning Start from the CMU dataset Remove duplicate emails Remove folders such as “discussion_threads”, “all documents”, and “sent_mail” …

14 Cleaned Enron email dataset 252,759 emails from 151 employees distributed in about 3000 user defined folders The dataset has been used by many research groups.

15 MySql database: four tables rtype: TO, CC, or BCC rvalue: recipient email value

16 Distribution of sent emails per user A few employees sent out a lot of messages.

17 Distribution of email over time Notice the spike around Nov 2001

18 Social network


Download ppt "Enron email datasets LING 575 Fei Xia 01/04/2011."

Similar presentations


Ads by Google