Enron datasets LING 575 Fei Xia 01/04/2011
History of Enron Enron was formed in 1985 under the direction of Kenneth Lay In 1999, Enron officials began to use the “special purpose entities” (SPE) trick. In Dec 2000, Jeffrey Skilling took over the position of CEO from Kenneth Lay. In Aug 2001, Skilling surprisingly resigned. Lay became CEO again. Watkins wrote an anonymous letter to Lay about possible fraud. In Oct 2001, the losses transferred from Enron to SPE totaled over $618 million. SEC started an inquiry into Enron. In Jan 2002, Lay resigned as chairman and CEO. Enron collapsed in the same year. In 2003, Enron emerged from bankruptcy as two separate companies. Most creditors would receive about 1/5 of the $67 billion they were owed.
History of Enron dataset Made public by the Federal Energy Regulatory Commission during its investigation in May 2002 Later collected and prepared by SRI for the CALO project William Cohen from CMU put up the dataset on the web for the researchers (the CMU dataset) in March 2004 ISI cleaned the CMU dataset and created a MySql database (the ISI database) Various teams did data cleaning and annotation
Several corpora Raw data: s between 1998 and 2002 – the CMU dataset – the ISI database – … Annotated data – Personal vs. business – zoning – …
The CMU dataset
Paper: ( B. Klimt and Y. Yang, 2004) Available at Stored on patas under /corpora/enron_ _dataset/cmu/
CMU dataset Raw corpus: – 619,446 messages from 158 users Cleanup: – remove folders such as “discussion_threads” – remove duplicates Cleaned corpus: – 200,399 messages from 158 users
Messages per user A few people sent out a lot of messages
Correlation of folders and messages Most users do use folders to organize their s, but their usage of folders varies a lot.
Distribution of thread sizes Thread: same subject line among the same users. Out of 200,399 messages, 61.6% of s are in threads (123,501 s in 30,091 threads). Most threads are of small size:
The ISI database
Paper: Shetty and Adibi’s report Report and data are available at Stored on patas under $data_dir/isi/ Stored on capuchin as a mysql database called “enron”.
Data cleaning Start from the CMU dataset Remove duplicate s Remove folders such as “discussion_threads”, “all documents”, and “sent_mail” …
Cleaned Enron dataset 252,759 s from 151 employees distributed in about 3000 user defined folders The dataset has been used by many research groups.
MySql database: four tables rtype: TO, CC, or BCC rvalue: recipient value
Distribution of sent s per user A few employees sent out a lot of messages.
Distribution of over time Notice the spike around Nov 2001
Social network