Presentation is loading. Please wait.

Presentation is loading. Please wait.

Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool.

Similar presentations


Presentation on theme: "Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool."— Presentation transcript:

1 Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool

2 Identity Resolution in Email Collections 2 Real Problem National Archives Clinton White House Tobacco Policy search request hired 25 persons ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ ~~~~~~~~ 32 million emails 200,000 80,000 for 6 months …

3 Identity Resolution in Email Collections 3 Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Sheila Identity Resolution in Email WHO?WHO?

4 Identity Resolution in Email Collections 4 Enron Collection -----Original Message----- From: SStack@reliant.com@ENRON Sent: Monday, July 30, 2001 2:24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.com Cc: ntillett@reliant.com Subject:Shhhh.... it's a SURPRISE ! Message-ID: Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: elizabeth.sager@enron.com To: sstack@reliant.com Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth X-To: 'SStack@reliant.com@ENRON' Hope all is well. Count me in for the group present. See ya next week if not earlier Please call me (713) 207-5233 Liza Elizabeth Sager 713-853-6349 Hi Shari Thanks! Shari 55 Sheila’s !! weisman pardo glover rich jones breeden huckaby tweed mcintyre chadwick birmingham kahanek foraker tasman fisher petitt Dombo Robbins chang jarnot kirby knudsen boehringer lutz glover wollam jortner neylon whanger nagel graves mclaughlin venville rappazzo miller swatek hollis maynes nacey ferrarini dey macleod howard darling watson perlick advani hester kenner lewis walton whitman berggren osowski kelly Rank Candidates

5 Identity Resolution in Email Collections 5 Generative Model person 1.Choose “person” c to mention p(c)p(c) context 2.Choose appropriate “context” X to mention c p(X | c) mention 3.Choose a “mention” l p(l | X, c) “sheila” GE conference call

6 Identity Resolution in Email Collections 6 3-Step Solution (1) Identity Modeling Posterior Distribution (3) Mention Resolution (2) Context Reconstruction

7 Identity Resolution in Email Collections 7 Outline  Introduction and Approach Overview  Computational Model of Identity  Context Reconstruction  Mention Resolution  Evaluation on Existing Collections  Scalable MapReduce Implementation  New Test Collection  Conclusion and Future Work

8 Identity Resolution in Email Collections 8 “Easy References” of Identity -----Original Message----- From: SStack@reliant.com@ENRON Sent: Monday, July 30, 2001 2:24 PM To: Sager, Elizabeth; Murphy, Harlan; jcrespo@hess.com; wfhenze@jonesday.com Cc: ntillett@reliant.com Subject:Shhhh.... it's a SURPRISE ! Message-ID: Date: Mon, 30 Jul 2001 12:40:48 -0700 (PDT) From: elizabeth.sager@enron.com To: sstack@reliant.com Subject: RE: Shhhh.... it's a SURPRISE ! X-From: Sager, Elizabeth X-To: 'SStack@reliant.com@ENRON' Hope all is well. Count me in for the group present. See ya next week if not earlier Please call me (713) 207-5233 Liza Elizabeth Sager 713-853-6349 Hi Shari Thanks! Shari Email Standards Email-Client Behavior User Regularities

9 Identity Resolution in Email Collections 9 Representational Model of Identity 77,240 “non-trivial” models sheila.glover@enron.com 14 (Quoted Headers) sheila glover 932 (Main Headers) sheila 19 (Salutation) 216 (Signature) sg 19 (Signature) sheila glover 1170 (User Name) Representational Model

10 Identity Resolution in Email Collections 10 Computational Model of Identity c m t identity observed mention name type

11 Identity Resolution in Email Collections 11 Identity Models Candidates Likelihood: p ( “sheila” | c)

12 Identity Resolution in Email Collections 12 Outline  Introduction and Approach Overview  Computational Model of Identity  Context Reconstruction  Mention Resolution  Evaluation on Existing Collections  Scalable MapReduce Implementation  New Test Collection  Conclusion and Future Work

13 Identity Resolution in Email Collections 13 Contextual Space Local Context Local Context Conversational Context Conversational Context Topical Context

14 Identity Resolution in Email Collections 14 Topical Context Date: Fri Dec 15 05:33:00 EST 2000 From: david.oxley@enron.com To: vince j kaminski Cc: sheila walton Subject: Re: Grant Masson Great news. Lets get this moving along. Sheila, can you work out GE letter? Vince, I am in London Monday/Tuesday, back Weds late. I'll ask Sheila to fix this for you and if you need me call me on my cell phone. sheila.walton@enron.com Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Sheila call Sheila call GE

15 Identity Resolution in Email Collections 15 Contextual Space Social Context Local Context Local Context Conversational Context Conversational Context Topical Context

16 Identity Resolution in Email Collections 16 Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Social Context Date: Tue, 19 Dec 2000 07:07:00 -0800 (PST) From: rebecca.walker@enron.com To: kay.mann@enron.com Subject: ESA Option Execution Kay Can you initial the ESA assignment and assumption agreement or should I ask Sheila Tweed to do it? I believe she is currently en route from Portland. Thanks, Rebecca Sheila Tweed kay.mann@enron.com

17 Identity Resolution in Email Collections 17 Formally probability distribution  A context of an email is a probability distribution over emails Probability estimated based on type of context  Contextual Space is a linear combination of 4 contexts

18 Identity Resolution in Email Collections 18 Context Expansion topical time people content social conversationallocal Temporal similarity affects social and topical similarity

19 Identity Resolution in Email Collections 19 Temporal Similarity  Decay over time  Gaussian and Linear functions  Time difference / Rank

20 Identity Resolution in Email Collections 20 Social Similarity  Two sets of participants (email adresses)  Binary, Overlap, Jacaard, Both

21 Identity Resolution in Email Collections 21 Temporal Effect Temporal SimPure Social Sim Social Sim Normalize Social Context

22 Identity Resolution in Email Collections 22 Topical Similarity  Standard IR Similarity: BM25  Email as a DOCUMENT? Subject Body (+Subject) Root of thread Concatenated path to root  Combined similarly with temporal similarity email reply / forward

23 Identity Resolution in Email Collections 23 Contextual Space (emails) Social Context Local Context Local Context Conversational Context Conversational Context Topical Context

24 Identity Resolution in Email Collections 24 Contextual Space (mentions) “Sheila” social conversational social topical social topical “Sheila Tweed” “sheila” “jsheila@enron.com” “sg” “Sheila Walton” “Sheila”

25 Identity Resolution in Email Collections 25 Outline  Introduction and Approach Overview  Computational Model of Identity  Context Reconstruction  Mention Resolution  Evaluation on Existing Collections  Scalable MapReduce Implementation  New Test Collection  Conclusion and Future Work

26 Identity Resolution in Email Collections 26 Mention Resolution Candidates Likelihood: p ( “sheila” | c) Goal: estimate p(c|m, X(m)) and rank accordingly Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann To: Suzanne Adams Subject: Re: GE Conference Call has be rescheduled Did Sheila want Scott to participate? Looks like the call will be too late for him. Sheila 1 2 3 ?

27 Identity Resolution in Email Collections 27 [1] Context-Free Resolution “Sheila” social conversational social topical social topical “Sheila Tweed” “sheila” “jsheila@enron.com” “sg” “Sheila Walton” “Sheila” X Context-Free Resolution

28 Identity Resolution in Email Collections 28 [2] Contextual Resolution “Sheila” social topical “Sheila Tweed” “sheila” “jsheila@enron.com” “sg” “Sheila Walton” “Sheila” Context-Free Resolution

29 Identity Resolution in Email Collections 29 Outline  Introduction and Approach Overview  Computational Model of Identity  Context Reconstruction  Mention Resolution  Evaluation on Existing Collections  Scalable MapReduce Implementation  New Test Collection  Conclusion

30 Identity Resolution in Email Collections 30 Test Collections CollectionEmailsIdentitiesMentionCandidates QueriesMin.Avg.Max. Sager1,628627511411 Shapiro974855491821 Enron-subset54,01827,340781152489 Enron-all248,451123,7837835181785 Sager Shapiro Enron-subset Enron-all

31 Identity Resolution in Email Collections 31 Evaluation Measures Commonly used in “known-item” retrieval  Success @1 (i.e., Precision @1) One-best  MRR (Mean Reciprocal Rank) Inverse of the harmonic mean of the ranks of true answer r i

32 Identity Resolution in Email Collections 32 Comparison w/Literature MRR Success @ 1 ContextLit.ContextLit. CollectionExpansionBestExpansionBest Sager0.9110.8890.8630.804 Shapiro0.9130.8790.8780.779 Enron-subset0.91-0.846(0.82) Enron-all0.89-0.821- Context Expansion Lit. Best Context Expansion Lit. Best Earlier expansion approach, reported in ACL 2008 Improved expansion 0.870.92

33 Identity Resolution in Email Collections 33 Limitations  Resolving single mentions  All mention-queries are sampled from Enron to Enron emails  All mention-queries refer to Enron Employee  Small for train/test split Scalable Implementation for Resolving All Mentions New Test Collection

34 Identity Resolution in Email Collections 34 Outline  Introduction and Approach Overview  Computational Model of Identity  Context Reconstruction  Mention Resolution  Evaluation on Existing Collections  Scalable MapReduce Implementation  New Test Collection  Conclusion

35 Identity Resolution in Email Collections 35 Scalable Implementation Two Bottlenecks: 1.Context expansion of ALL emails For each email: ranked list of “Similar” emails 2.Resolution of ALL mentions Resolution of one mention depends on resolution of all other mentions in context

36 Identity Resolution in Email Collections 36 Context Expansion of ALL Emails  Goal: For each email: ranked list of “Similar” emails Need for BOTH social and topical contexts Efficient implementation Abstract Problem: Computing Pairwise Similarity

37 Identity Resolution in Email Collections 37 Trivial Solution  load each vector o(N) times  load each term o(df t 2 ) times scalable and efficient solution for large collections Goal

38 Identity Resolution in Email Collections 38 Better Solution  Load weights for each term once  Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in

39 Identity Resolution in Email Collections 39 MapReduce Framework map map map map reduce reduce reduce input output Shuffling group values by: [keys] (a) Map (b) Shuffle (c) Reduce transparently handles low-level details transparently (k 2, [v 2 ]) (k 1, v 1 ) [(k 3, v 3 )] [k 2, v 2 ]

40 Identity Resolution in Email Collections 40 reduce Decomposition  Load weights for each term once  Each term contributes o(df t 2 ) partial scores Each term contributes only if appears in map

41 Identity Resolution in Email Collections 41 Expansion Using MapReduce  Using generic pairwise-similarity for both topical and social expansion ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~-- ~~~~ ~~~~ ~~~~ doc rep. time window rank cut-off df-cut topical : body/root/pat h social : participants temporal sim model context sim model context graph

42 Identity Resolution in Email Collections 42 Context Mention-Graph “Sheila” social conversational social topical social topical “Sheila Tweed” “sheila” “jsheila@enron.com” “sg” “Sheila Walton” “Sheila” Context-Free Resolution map map map map reduce

43 Identity Resolution in Email Collections 43 Resolution System Using MapReduce EmailsThreadsIdentity Models Social Expansion Conv. Expansion Topical Expansion Local Expansion Local GraphSocial GraphTopical GraphConv. Graph Mention Recognition and Prior Computation Prior Resolution Posterior Resolution Social Resolution Conv. Resolution Topical Resolution Local Resolution Merging Context Resolutions Prior Dict. Expansion Resolution Packing Preprocessing

44 Identity Resolution in Email Collections 44 Outline  Introduction and Approach Overview  Computational Model of Identity  Context Reconstruction  Mention Resolution  Evaluation on Existing Collections  Scalable MapReduce Implementation  New Test Collection  Conclusion

45 Identity Resolution in Email Collections 45 New Test Collection  Random Sample from CMU-Enron collection  “Annotation + Search” interface available  Total annotators: 3  Annotation time: ~50 hours.  Not only resolutions Time, difficulty, confidence, evidence, and comments  Total mention-queries : 584  80% resolvable, 82% of them to Enron domain  Overall inter-annotator agreement: ~81 %

46 Identity Resolution in Email Collections 46 Mention-Query Selection

47 Identity Resolution in Email Collections 47 Distribution of Names Based on Resolution

48 Identity Resolution in Email Collections 48 Distribution Based on Difficulty

49 Identity Resolution in Email Collections 49 Distribution Based on Confidence

50 Identity Resolution in Email Collections 50 Distribution of Time Spent

51 Identity Resolution in Email Collections 51 Evaluation again …

52 Identity Resolution in Email Collections 52 Pairwise Agreement 195 50 190 199 16/16 (100%) 2/4 (50%) 4/7 (57%) 50 27 23 12/16 (75%) 1/5 (20%) 2/2 (100%) 35/38 (92%) 2/2 (100%) 6/10 (60%) 24/27 (89%) 5/12 (42%) 11/11 (100%) Enron-resolvable Non-enron-resolvable Unresolvable a3a3 a2a2 a1a1

53 Identity Resolution in Email Collections 53 Individual Annotator Agreement

54 Identity Resolution in Email Collections 54 Overall Agreement

55 Identity Resolution in Email Collections 55 Agreement Based on Difficulty

56 Identity Resolution in Email Collections 56 Agreement Based on Confidence

57 Identity Resolution in Email Collections 57 Conclusion and Future Work  Identity Resolution by non-participants is feasible And automatic systems for that can be built ~90-75% accurate  Proposed generative probabilistic model Context Expansion using temporal similarity Scalable Implementation using “Pairwise Sim with MapReduce”  Developed largest test collection for the task 80% resolvable, 82% of them to Enron employees  Effectiveness scales well to large collections  Efficiency Results  Evaluation using double-assessments  Iterative approach for “joint resolution”

58 Identity Resolution in Email Collections 58 Thank You!

59 Identity Resolution in Email Collections 59 Related Work  Diehl et al. (SIAM, 2006) Developed Enron-subset collection Temporal traffic models Candidates must have communicated with sender  Minkov et al. (SIGIR, 2006) Developed Sager and Shapiro collections Graphical framework Large collections?


Download ppt "Identity Resolution in Email Collections Tamer Elsayed and Douglas W. Oard CLIP Colloquium, UMD, Feb 2009 Department of Computer Science, UMIACS, and iSchool."

Similar presentations


Ads by Google