Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Enron and W3C Collections Tamer Elsayed and Douglas W. Oard ICAIL 2007, DESI Workshop, June 4 th, 2007 University of Maryland.

Similar presentations


Presentation on theme: "The Enron and W3C Collections Tamer Elsayed and Douglas W. Oard ICAIL 2007, DESI Workshop, June 4 th, 2007 University of Maryland."— Presentation transcript:

1 The Enron and W3C Collections Tamer Elsayed and Douglas W. Oard ICAIL 2007, DESI Workshop, June 4 th, 2007 University of Maryland

2 The Enron and W3C Collections ParticipantNon-participant Personal My own emails Shneiderman’s Postel’s Organization Help desks White House Enron Public Online communities Usenet news W3C Variants of Email Search Searcher Collection

3 The Enron and W3C Collections  Rich multimodal data Emails Phone calls Databases The (Extended) Enron Collection

4 The Enron and W3C Collections  “Public” version of Enron collection (CMU) 150 sets of rescued Outlook email folders 517,431 emails, 52% duplicates, 133,581 unique addresses Subset annotated w/genre, speech act, mentioned calls, …  Extended Enron email collection (Aspen Systems) Attachments, additional email (later release, redaction)  Phone calls from/to Enron traders (Shohomish PUD) Transcribed subset from 52 DVDs of recorded audio Recovered from scanned transcripts using OCR 93 annotated with date, time, participants, mentioned names, mentioned emails, mentioned meetings,...  Relational databases (Aspen Systems) The (Extended) Enron Collection

5 The Enron and W3C Collections Cross-References EMAIL Phone Calls

6 The Enron and W3C Collections Phone Call Transcripts Message-ID: Message-Type: PhoneCall Date: Fri, 26 Jan 2001 19:43:55 -0600 (CST) From: shari.stack@enron.com To: greg.wolfe@enron.com Parties: shari.stack@enron.com, greg.wolfe@enron.com Subject: Snohornish deal, Houston Chronicle Article, Bonuses e-mail, Houston Chronicle Article, Deal, email to Jane King Subject-TimePos: 145, 313, 713, 775, 920, 1018 InCallNames: Christian, Ken Lay, Greg, Chris Foster, Stewie, Stewie, Mike, Mike, Laverado, Mike, Kim, Shari, Greg, Forney, Stewie, Jane King, Shari InCallNames-TimePos: 42, 81, 90, 95, 96, 143, 146, 190, 262, 266, 522, 580, 780, 1007, 1018, 1038, 1067 Keywords: CDWR, email, email Keywords-TimePos: 55, 689, 1038 X-From: Stack, Shari <> X-To: Wolfe, Greg <> X-Parties: Stack, Shari <>, Wolfe, Greg <> X-AudioFile: 24-20010126-19435570-20020114-R.wav X-TranscriptFile: 24-20010126-19435570-20020114-R.txt SHARI STACK: Hey. GREG WOLFE: All right, let me get my fax machine workin'. Uh - [laughs] SHARI: [laughs] She's like, it was so easy, I could make you a lot of money [laughs]. She's like, he said it so desperate. She goes I hate to laugh at people, but - [laughs] GREG: Did you, um, did you, ah, ah tell her about the, ah, that voice mail? SHARI: Yeah, I said - I said Greg [inaudible] he's got the - they got a mob connection [langhs] - his friend threw away the business card after the meeting.[both laughing] SHARI: But, my God - my God, and so anyway, have you talked to Chnstian about this 'cause Christian apparently talked to him twice today. GREG: Oh, he sent a - Christian sent an e-mail shortly after, you know, that, and said we're not doin' business with this guy. SHARI: [laughs] GREG: Ah, so I still don't understand why this guy's trying to get in the middle of us and CDWR and I guess - SHARI: [laughs]

7 The Enron and W3C Collections Message Header Main Body Salutation Signature Block Quoted HeaderQuoted Text Message Body Quoted Signature Quoted Main Body Typical Enron Email -----Original Message----- From: SStack@reliant.com@ENRON Sent: Monday, July 30, 2001 2:24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.com Cc: ntillett@reliant.com Subject:Shhhh.... it's a SURPRISE ! Message-ID: Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: elizabeth.sager@enron.com To: sstack@reliant.com Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth X-To: 'SStack@reliant.com@ENRON' Hope all is well. Count me in for the group present. See ya next week if not earlier Please call me (713) 207-5233 Liza Elizabeth Sager 713-853-6349 Hi Shari Thanks! Shari

8 The Enron and W3C Collections Research Problems (Enron)  Threading  Email Classification  Social Network Analysis  Mention Resolution

9 The Enron and W3C Collections Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Who is that “Sheila”? Sheila ?

10 The Enron and W3C Collections Rich Evidence about Identity m..scott@enron.com susan m scott suebob susan scott sue susan m scott scott.susan@enron.com scott susan susan m scott susan scott sscott5@enron.com susan scott friday sscott5 susan sscott susan m scott com members 66,715 models 82,084 addr-name 3,151 addr-nickname 19,708 addr-addr

11 The Enron and W3C Collections Test Collection of Mention Resolution Candidates CollectionEmailsIdentitiesQueriesMin.Avg.Max. Sager1,628627511411 Shapiro974855491821 Enron-subset54,01827,340781152489 Enron-all248,451123,7837835181785 Sager Shapiro Enron-subset Enron-all Test Collections

12 The Enron and W3C Collections Evaluation  Task named-mention  ranked list of people  Measures Mean Reciprocal Rank Success @ K  Success @ 1 Confidence-based scoring

13 The Enron and W3C Collections Limitations (Mention Resolution)  Small number of queries  Only resolved by Enron employees Much easier Most of participants are outsides  Measures focus only on accuracy

14 The Enron and W3C Collections Identity-Content Interplay Search for People Search for Content Social Context Topical Context

15 The Enron and W3C Collections W3C Collection  Set of mailing lists public not private Topically-oriented  ~175,000 emails  Introduced at TREC 2005  50 topics (x 2 years)  relevance judgments available for ad-hoc retrieval

16 The Enron and W3C Collections Research Problems (W3C)  Expert Finding Topic  ranked list of experts  Know-item Retrieval Query  ranked list of emails  Discussion Search (i.e., ad-hoc retrieval) Pro/con retrieval Query  ranked list of emails

17 The Enron and W3C Collections Topic Type Analysis Find categories amenable to pro/con classification (TREC 2005-Enterprise Track)

18 The Enron and W3C Collections Limitations (Pro/Con Retrieval)  Not private/personal communication  Mailing lists  receivers are hidden  Topical categories are unbalanced  Developed by researchers NOT users

19 The Enron and W3C Collections Related Projects  Others working with CMU’s Enron emails Berkeley, CMU, U Mass, SIAM Workshop  University of Southern California ISI/ICT eArchivarius, Postel collection (Anton Leuski)  Georgia Tech Research Institute PERPOS Presidential records (Bill Underwood)

20 The Enron and W3C Collections Conclusion  Two email test collections Public Hundreds of thousands of emails Annotated emails and transcripts Tasks and ground truth  Need for “real” user needs  Development of evaluation measures for utility

21 The Enron and W3C Collections For More Information  Joint Institute for Knowledge Discovery http://www.umiacs.umd.edu/jikd

22 The Enron and W3C Collections Running System


Download ppt "The Enron and W3C Collections Tamer Elsayed and Douglas W. Oard ICAIL 2007, DESI Workshop, June 4 th, 2007 University of Maryland."

Similar presentations


Ads by Google