Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.

Similar presentations


Presentation on theme: "The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are."— Presentation transcript:

1 The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are highly ambiguous -- in the US 90,000 different names are shared by 100 million people Cross-document coreference resolution is the task of identifying if two mentions of the same (or similar) name in different sources refer to the same individual Solving this problem is important no only for better access to information but also in practical applications

2 SemEval 2007 Web People Search Task A search engine user types in a person name as a query Instead of ranking the Web pages, an ideal system should organize the results in as many clusters as different individuals sharing the name have been returned System receive a set of documents matching a person name and returns clusters, each cluster refers to the same individual

3 SemEval 2007 Web People Search Data Training data (100 documents per person name) –10 person names from the European Conference on Digital Libraries; 7 person names from Wikipedia; 32 person names from a previous study (Gideon&Mann’03) Testing: 30 person names; pages returned by Yahoo! Systems output compared to gold standard produced by human

4 Examples Boston MA Homes for Sale Contact Us Donna Harman Phone: 617-269- 3227 donna@kiley.com Donna Harman, Chris BuckleyChris Buckley: The NRRC reliable information access (RIA) workshop. SIGIR 2004: 528-529 SIGIR 2004 STATE AGENTSCIENTIST George Foster George Arthur Foster Bats Right, Throws Right Height 6' 1", Weight 185 lb. All George Foster Movies Centipede Centipede (2004)PG-13 Killer MeKiller Me (2001) NR SPORTMAN ACTOR TRAINING TESTING

5 Metrics

6 Clustering Given a set of documents and a threshold 1.Initially there are as many clusters as documents 2.All clusters are compared using a similarity metric 3.At each iteration the two most similar clusters are merged if their similarity is greater than a threshold (otherwise stop and return clusters) 4.Goto step 2

7 Document Representation term frequency (tf) of term t in document d = the number of times t occurs in d inverted document frequency (idf) of term t in collection c = the number of documents in c containing t Bag-of-word approach = terms are words –text = (word 1 =w 1 ….) Semantic-based approach = terms are named entities (person, location, organization, date, address) –text = (ne 1 =w 1 ….) Two approaches to extract terms: –terms belong to the full document (full document condition) –terms belong to personal summaries (summary condition)

8 Examples of terms Organization: DARPA; MIT Press; Artificial Intelligence Center; AAAI; Department of Computer Science; etc. Person: Douglas E. Appelt; David J. Israel; Jean- Claude Martin; etc. Location: Menlo Park; Las Palmas; Clearwater Beach; etc. Date: 1995-2007; 15 February 2007; 20:34; etc. Address: http://acl.ldc.upenn.edu/J/J87/; Los Angeles Area; ontherecord@foxnews.com; 105 Chamber Street; etc.

9 Implementation Details local IDF tables are computed for each set of documents weights are tf*log(N/idf) – N is the size of the document set sim C is the cluster similarity; sim D is the document similarity which is the cosine metric threshold estimated over training data –the algorithm was run over the ECDL training data and the similarity value for the optimal f-score is recorded for each instance –the threshold for testing is set to the average of the optimal thresholds (for word-based representation and semantic-based representation)

10 SEARCH ENGINE DOCS GATE ANNIE SYSTEM SUMMARIZATION TOOLKIT IDF TABLES ANNOTATED DOCS VECTOR EXTRACTOR CLUSTERING CLUSTERS WEB PERSON NAME PERSONAL SUMMARIES VECS THRESHOLD

11 NLP Components Use ANNIE system - a GATE information extraction system (http://gate.ac.uk) –Tokeniser –Sentence splitter –Gazetteer list lookup –Regular expressions over annotations (JAPE) –Parts of speech tagging –Coreference resolution Use in-house Summarization Toolkit (http://www.dcs.shef.ac.uk/~saggion) –term frequency statistics; Vector Space Model representation; IDF tables computation –Personal summaries

12 Analysed Document

13 Personal Summaries Coreference chains are identified (in each document) All elements in a coreference chain containing the target person are marked Sentences containing marked person name are selected for summary

14 SemEval Results 4 configurations for SemEval 2007 –System 1 = full document & words –System 2 = full document & NEs – system submitted for official evaluation –System 3 = summary & words –System 4 = summary & NEs Config.PurityI-PurityF-Score System 10.680.850.74 System 20.620.850.68 System 30.840.700.74 System 30.650.750.64 –best system obtained f-score = 0.78; our system ranked 5 th out of 16 participants; all our system configurations f-score > average system

15 The Effect of Semantic Information Post SemEval experiments studied the effect of each type of information – basically vectors were created for each type of NE and documents re-clustered NE typePurityI-PurityF-score Organization0.900.720.78 Person0.810.720.75 Address0.820.640.69 Date0.580.850.67 Location0.550.850.64 NE typePurityI-PurityF-score Person0.850.640.70 Organization0.970.570.69 Date0.870.600.68 Location0.820.630.67 Address0.930.540.65 FULL TEXT CONDITION SUMMARY CONDITION

16 Conclusions Presented an approach to cross-document coreference based on available robust extraction and summarization technology Approach is largely unsupervised – need some training data to set up parameters System demonstrated good performance in SemEval 2007 Web People Search Task Special attention should be given to the type of information used for representing vectors in order to achieve optimal performance


Download ppt "The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are."

Similar presentations


Ads by Google