Presentation is loading. Please wait.

Presentation is loading. Please wait.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN EMAIL USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

Similar presentations


Presentation on theme: "CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN EMAIL USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,"— Presentation transcript:

1 CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN EMAIL USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun 1

2 OUTLINE Introduction Email as a graph Graph similarity Learning Evaluation Conclusion 2

3 INTRODUCTION In modern IR settings, documents are frequently connected to other objects, via hyperlinks or meta-data It is important to understand how text-based document similarity measures can be extended to documents embedded in complex structural settings This paper considers schemes for propagating similarity across a graph that naturally models a structured dataset like an email 3

4 EMAIL AS A GRAPH 4 A graph consists of nodes and labeled directed edges Edge labels determine the source and target node type Multiple relations can hold between any particular pair nodes of types It could be that or, where The edges need not denote functional relations Inverse label: the graph will definitely be cyclic

5 GRAPH SIMILARITY * EDGE WEIGHTS Similarity between two nodes is defined by a lazy walk process, and a walk on the graph is controlled by a small set of parameters Θ To walk away from a node x, one first picks an edge label l ; then, given l, one picks a node y such that The probability of picking the label l depends only on the type T(x) of the node x Once l is picked, y is chosen uniformly from the set of all y such that. That is, the weight of an edge of type l connecting source node x to node y is 5

6 GRAPH SIMILARITY * GRAPH WALKS Associate nodes with integers, and make M a matrix indexed by nodes, then a walk of k steps can then be defined by matrix multiplication If V 0 is some initial probability distribution over nodes, then the distribution after a k -step walk is proportional to V k = V 0 M k In our framework, a query is an initial distribution V q over nodes, plus a desired output type T out, and the answer is a list of nodes y of type T out, ranked by their score in the distribution V k 6

7 LEARNING Re-order an initial ranking The advantage is that the learned classifier can take advantage of “global” features that are not easily used in a walk M. Collins and T. Koo. Discriminative reranking for natural language parsing. Computational Linguistics, 2005 7

8 EVALUATION * CORPORA In this paper we evaluate the system on two tasks: Person name disambiguation Email threading Corpora: Cspace corpus Enron corpus 8

9 EVALUATION * PERSON NAME DISAMBIGUATION Task definition: Automatically determining that “Andrew” refers to “Andrew Y. Ng” and not “Andrew McCallum” is especially difficult when an informal nickname is used, or when the mentioned person does not appear in the email header Based on a name-mention in an email message m, we formulate query distribution V q, and then retrieve a ranked list of person nodes 9

10 EVALUATION * PERSON NAME DISAMBIGUATION Results for person name disambiguation 10

11 EVALUATION * PERSON NAME DISAMBIGUATION 11

12 EVALUATION * THREADING Threading is the problem of retrieving other messages in an email thread given a single message from the thread Users make inconsistent use of the “reply” mechanism, and there are frequent irregularities in the structural information that indicates threads Thread information can improve message categorization into topical folders Given an email file as a query, produce a ranked list of related email files, where the immediate parent and child of the given file are considered to be “correct” answers 12

13 EVALUATION * THREADING 13

14 EVALUATION * THREADING 14

15 CONCLUSION A scheme for representing a corpus of email messages with a graph of typed entities An extension of the traditional notions of document similarity to documents embedded in a graph Using boosting-based learning scheme to rerank outputs based on graph-walk related features provides an additional performance improvement 15

16 CONCLUSION Preserving entity type allows one to formulate a broad range of problems as typed search queries Structural relation modeling provides a unified framework for integration of multiple types of information 16


Download ppt "CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN EMAIL USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,"

Similar presentations


Ads by Google