Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modeling Identity in Archival Collections of Email: A Preliminary study Tamer Elsayed and Douglas W. Oard Conference on Email and Anti-Spam (CEAS), July.

Similar presentations


Presentation on theme: "Modeling Identity in Archival Collections of Email: A Preliminary study Tamer Elsayed and Douglas W. Oard Conference on Email and Anti-Spam (CEAS), July."— Presentation transcript:

1 Modeling Identity in Archival Collections of Email: A Preliminary study Tamer Elsayed and Douglas W. Oard Conference on Email and Anti-Spam (CEAS), July 28 th, 2006 Department of Computer Science College of Information Studies Institute for Advanced Computer Studies

2 Modeling Identity in Archival Collections of Email: A Preliminary Study Real Problem National Archives Clinton White House Tobacco Policy search request hired 25 persons ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ 32 million emails 200,000 80,000 for 6 months …

3 Modeling Identity in Archival Collections of Email: A Preliminary Study Email Search ParticipantNon-participant PersonalMy own emails Shneiderman’s Postel’s Organizational CS UMIACS White House Enron PublicTREC Enterprise Usenet news W3C  Meaning  Modeling Content  People  Modeling Identity Searcher

4 Modeling Identity in Archival Collections of Email: A Preliminary Study Identity Email ~~~~~~~~~ ~~Email~~ ~~~~~~~~~ ~~~~~~~~~ ~~~~~~~~~ sent email to SenderReceivers Mentioned sentreceived mentions mentioned mentioned to Email Address Name Nickname Email Address Name Nickname Email Address Name Nickname

5 Modeling Identity in Archival Collections of Email: A Preliminary Study Outline  Problem  Identity Resolution Architecture  Evaluation  Conclusion

6 Modeling Identity in Archival Collections of Email: A Preliminary Study Entity Example “robert.bruce@enron.com” “Robert Bruce” “Bob” Robert E. Bruce Senior Counsel Enron North America Corp. T (713) 345-7780 F (713) 646-3393 robert.bruce@enron.com Static Signature (140) Main Headers (915) Quoted Headers (8) Salutations (7) Free Signatures (9) Name Email Address Nickname Signature Block

7 Modeling Identity in Archival Collections of Email: A Preliminary Study Enron Collection  Example of large organizational collection  CMU version about half million emails 133,581 unique email addresses  ~52% of emails are duplicates! same address, subject, body

8 Modeling Identity in Archival Collections of Email: A Preliminary Study Message Header Main Body Salutation Signature Block Quoted HeaderQuoted Text Message Body Quoted Signature Quoted Main Body Typical Enron Email -----Original Message----- From: SStack@reliant.com@ENRON Sent: Monday, July 30, 2001 2:24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.com Cc: ntillett@reliant.com Subject:Shhhh.... it's a SURPRISE ! Message-ID: Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: elizabeth.sager@enron.com To: sstack@reliant.com Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth X-To: 'SStack@reliant.com@ENRON' Hope all is well. Count me in for the group present. See ya next week if not earlier Please call me (713) 207-5233 Liza Elizabeth Sager 713-853-6349 Hi Shari Thanks! Shari

9 Modeling Identity in Archival Collections of Email: A Preliminary Study Identity Resolution Architecture Duplicate Detection Extraction from Main Header Extraction from Quoted Header Body and Quoted Text Separation Signature Line Detection Salutation Line Detection Nickname Extraction Main body Salutation lines Signature lines Address-Nickname Associations Address-Name Associations Address-Address Associations Clustering Associations Entities Unique emails Quoted headers

10 Modeling Identity in Archival Collections of Email: A Preliminary Study Message-ID: Date: Wed, 26 Sep 2001 09:25:19 -0700 (PDT) From: jmathes@nbchamber.com To: mark.vandini@enron.com, steve.urbon@enron.com, sapienza.tony@enron.com, o'rourke.tom@enron.com, lyons.tom@enron.com Subject: New Email Address X-From: Jim Mathes X-To: Vandini, Mark, Urbon Steve, Tony Sapienza, Tom O'Rourke, Tom Lyons, Tom Hodgson X-cc: X-bcc: We have just launched our "New & Improved Website", www.newbedfordchamber.com and I have a new email address: jmathes@newbedfordchamber.com Please make the appropriate changes in your email address book. Thank you, Jim Mathes, President New Bedford Area Chamber of Commerce Extraction From Main Headers Name-Address Association Address-Address Association

11 Modeling Identity in Archival Collections of Email: A Preliminary Study Extraction From Quoted Headers Hi Jeff, Did you get our registration packet? If not, stop by and pick one up because you need it. Make sure you get the one for new students. Shawn On Wednesday, November 03, 1999 11:18 AM, Jeff Dasovich [SMTP:jdasovic@enron.com] wrote: > > ok, don't shoot me, but what's the deadline for scheduling for classes? > > signed, > clueless Name-Address Association ---------------------- Forwarded by Elizabeth Sager/HOU/ECT on 02/09/2000 12:02 PM --------------------------- "Patricia Young" on 02/09/2000 08:50:59 AM To: Elizabeth Sager/HOU/ECT@ECT cc: Subject: If possible, would you forward your resume to me electronically? Thanks. If possible, would you forward your resume to me electronically? Thanks. Name-Address Association

12 Modeling Identity in Archival Collections of Email: A Preliminary Study From: susan.scott@enron.com The kiddies are going back to school already so now would be a good time to plan a trip to D.C. at last. Maybe early Sept? Also I'd be game for a girls' trip to Destin. Time to work! Love, -Sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002 Signature & Salutation Detection The week is going OK. All the tennis and swimming has left me with sore muscles so this is my night off. Am planning to do some more house chores so I do not end up with another weekend like the last. I'm still planning on coming to Austin next weekend, I'm just not sure when, but I'll let you know. Call if you get lonely!Love,Sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002 Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, but you have to shut down sometime. Am sending you my travel schedule for next week. The following week (May 29 - June 2) I'm planning to be in SF also, but I'm not sure I'll actually have to be there that long. Have a good afternoon!love,sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002

13 Modeling Identity in Archival Collections of Email: A Preliminary Study 3,151 address-nickname associations Nickname Extraction Had another sleepless night Sun. and finally took some Unisom and had a good night's sleep last night. What a relief. I have really never had this problem before. It's good to have a lot of energy, but you have to shut down sometime. Am sending you my travel schedule for next week. The following week (May 29 - June 2) I'm planning to be in SF also, but I'm not sure I'll actually have to be there that long. Have a good afternoon!love,sooz Procurement, Logistics, and Contracts Enron Broadband Services, Inc. 1400 Smith, Suite EB-4573A Houston, TX 77002 nickname From: susan.scott@enron.com

14 Modeling Identity in Archival Collections of Email: A Preliminary Study Identifying Entities “robert.bruce@enron.com” “Robert Bruce” “Bob” Robert E. Bruce Senior Counsel Enron North America Corp. T (713) 345-7780 F (713) 646-3393 robert.bruce@enron.com Static Signature (140) Main Headers (915) Quoted Headers (8) Salutations (7) Free Signatures (9) Name Email Address Nickname Signature Block “rbruce@hotmail.com” Email Address “Robert” Name Quoted Headers (5) Main Headers (7) 82,084 addr-name 3,151 addr-nickname 19,708 addr-addr 66,715 entities

15 Modeling Identity in Archival Collections of Email: A Preliminary Study Outline  Problem  Identity Resolution Architecture  Evaluation  Conclusion  Future Work

16 Modeling Identity in Archival Collections of Email: A Preliminary Study Stratified Sampling Weakest EvidenceStronger Evidence Address-Name Associations Main headers only50 / 2967750 / 31248 Quoted headers only50 / 804250 / 3828 Both headers50 / 9289 Address-Nickname Associations Salutations only50 / 27250 / 465 Signatures only50 / 17250 / 1754 Both50/490 Address-Address Associations 50 / 651450 / 4194

17 Modeling Identity in Archival Collections of Email: A Preliminary Study Judgment Process kmpresto@msn.com  "home email" terrie.james@enron.com  "alexis james-petty" june-deadrick@reliantenergy.com  “june deadrick” robbie.lewis@enron.com  “robbie lewis” terriecovarrubias@hotmail.com  "terrie covarrubias" randal.maffett@enron.com  "randy" lemelpe@nu.com  "phyllis" piazzet@wharton.upenn.edu  "tom" Incorrect Correct but not informative Correct and somewhat informative Correct and very informative

18 Modeling Identity in Archival Collections of Email: A Preliminary Study Evaluation Measures Judged Associations Correct Informative Very Informative

19 Modeling Identity in Archival Collections of Email: A Preliminary Study Accuracy Address-Name Associations Address-Nickname Associations Address-Address Associations  100% accuracy with multiple sources of evidence.  Address-name association was nearly perfect  80% minimum accuracy in address-nickname  96.7% entity accuracy

20 Modeling Identity in Archival Collections of Email: A Preliminary Study Informativeness Address-Name Associations Address-Nickname Associations Address-Address Associations

21 Modeling Identity in Archival Collections of Email: A Preliminary Study Outline  Problem  Identity Resolution Architecture  Evaluation  Conclusion

22 Modeling Identity in Archival Collections of Email: A Preliminary Study Conclusion  Introduced a computational model of identity a set of simple techniques put together provide a useful baseline assessed its potential utility in the context of one fairly complex email collection  Automatic detection of nicknames in salutations and signature lines.  Most informative results from weakest evidence & least accurate  Accuracy and informativeness are both important

23 Modeling Identity in Archival Collections of Email: A Preliminary Study Limitations  Email address associated with single identity  Strength of evidence not exploited  Heuristics hand-tuned for Enron collection  Focus on personal attributes  No reconciliation of multiple identities for single person  No attempt to classify identities as machines or groups  Recall?

24 Modeling Identity in Archival Collections of Email: A Preliminary Study Thank You! Questions?

25 Modeling Identity in Archival Collections of Email: A Preliminary Study Backup

26 Modeling Identity in Archival Collections of Email: A Preliminary Study Future Work  extend the model to exploit temporal features and behavioral evidence  implement machine learning techniques  perform ablation studies  characterize the coverage of our methods in more detail  replicate this work in other contexts  integrate these techniques with the ultimate applications for which computational models of identity are needed (e.g., social network analysis).

27 Modeling Identity in Archival Collections of Email: A Preliminary Study Helping in Judgments

28 Modeling Identity in Archival Collections of Email: A Preliminary Study Identity Framework PersonGroup Identity Machine Entity Entity Entity Entity Entity Entity Candidates

29 Modeling Identity in Archival Collections of Email: A Preliminary Study Modeling Identity  Attributes (stable explicit features) email addresses, names, nickname, contact info  Associations Link attributes together Based on observations  Entities Representation of an identity Set of attributes in undirected graph  Linked by weighted associations

30 Modeling Identity in Archival Collections of Email: A Preliminary Study Identifying Entities  First round limited transitive closure  Merging associations based on unique attributes Address-address associations  No use of strength of evidence yet  66,715 entities Covering 77,420 unique email address (58% of all addresses)

31 Modeling Identity in Archival Collections of Email: A Preliminary Study Related Work  Attribute/association extraction  Name recognition and reference resolution  Applications: Social network analysis Finding experts

32 Modeling Identity in Archival Collections of Email: A Preliminary Study Unjudged Associations Address-Name Associations Address-Nickname Associations Address-Address Associations Only 19  ~3%


Download ppt "Modeling Identity in Archival Collections of Email: A Preliminary study Tamer Elsayed and Douglas W. Oard Conference on Email and Anti-Spam (CEAS), July."

Similar presentations


Ads by Google