Using Encyclopedic Knowledge for Named Entity Disambiguation Razvan Bunescu Machine Learning Group Department of Computer Sciences University of Texas.

Using Encyclopedic Knowledge for Named Entity Disambiguation Razvan Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin razvan@cs.utexas.edu Marius Pasca Google Inc. 1600 Amphitheatre Parkway Mountain View, CA mars@google.com

Some names denote multiple entities: –“John Williams and the Boston Pops conducted a summer Star Wars concert at Tanglewood.” John Williams  John Williams (composer) –“John Williams lost a Taipei death match against his brother, Axl Rotten.” John Williams  John Williams (wrestler) –“John Williams won a Victoria Cross for his actions at the battle of Rorke’s Drift. John Williams  John Williams (VC) Introduction: Disambiguation

Introduction: Normalization Some entities have multiple names: –John Williams (composer)  John Williams –John Williams (composer)  John Towner Williams –John Williams (wrestler)  John Williams –John Williams (wrestler)  Ian Rotten –Venus (planet)  Venus –Venus (planet)  Morning Star –Venus (planet)  Evening Star

Introduction: Motivation Web searches –Queries about Named Entities (NEs) constitute a significant portion of popular web queries. –Ideally, search results are clustered such that: In each cluster, the queried name denotes the same entity. Each cluster is enriched by querying the web with alternative names of the corresponding entity. Web-based Information Extraction (IE) –Aggregating extractions from multiple web pages can lead to improved accuracy in IE tasks (e.g. extracting relationships between NEs). –Named entity disambiguation is essential for performing a meaningful aggregation.

Introduction: Approach Build a dictionary D of named entities –Use information from a large coverage encyclopedia – Wikipedia. –Each name d  D is mapped to d.E, the set of entities that d can refer to in Wikipedia. Design a method that takes as input a proper name in its document context, and can be trained to: 1)Detect when a proper name refers to an entity from D. [Detection] 2)Find the named entity refered in that context. [Disambiguation]

Introduction: Example John WilliamsJohn Towner WilliamsIan Rotten John Williams (composer)John Williams (VC)John Williams (wrestler) “… this past weekend. John Williams and the Boston Pops conducted a summer Star Wars concert at Tanglewood …” ? John Williams (other) Dictionary Document

Outline Introduction Wikipedia Structures –Named Entity Dictionary –Disambiguation Dataset Disambiguation & Detection Experimental Evaluation Future Work Conclusions

Wikipedia – A Wiki Encyclopedia Wikipedia – a free online encyclopedia written collaboratively by volunteers, using wiki software. 200 language editions, with varying levels of coverage. Very dynamic and quickly growing resource: –May 2005: 577,860 articles –Sep. 2005: 751,666 articles

Wikipedia Articles & Titles Each article describes a specific entity or concept. An article is uniquely identified by its title. –Usually, the title is the most common name used to denote the entity described in the article. –If the title name is ambiguous, it may be qualified with an expression between parentheses. –Example: John Williams (composer) Notation: –E  the set of all named entities from Wikipedia. –e  E  an arbitrary named entity. e.title  the title name e.T  the text of the article

Wikipedia Structures In general, there is a many-to-many relationship between names and entities, captured in Wikipedia through: –Redirect articles. –Disambiguation articles. Hyperlinks: An article may contain links to other articles in Wikipedia. Categories: each article belongs to at least one Wikipedia category.

Redirect Articles A redirect article exists for each alternative name used to refer to an entity in Wikipedia. Example: The article titled John Towner Williams consists in a pointer to the article John Williams (composer). Notation: –e.R  the set of all names that redirect to e. Example: –e.title  United States. –e.R  {USA, US, Estados Unidos, Untied States, Yankee Land, …}.

Disambiguation Articles A disambiguation article lists all Wikipedia entities (articles) that may be denoted by an ambiguous name. Example: The article titled John Williams (disambiguation) list 22 entities (articles). Notation: –e.D  the set of names whose disambiguation pages contain a link to e. Example: –e.title  Venus (planet). –e.D  {Venus, Morning Star, Evening Star}.

Named Entity Dictionary Named Entities  entities with a proper name title. All Wikipedia titles begin with a capital letter  3 heuristics for detecting proper name titles: 1)If e.title is a multiword title, then e is a named entity only if all content words are capitalized (e.g. The Witches of Eastwick) 2)If e.title is a one word title that contains at least two capital letters, then e is a named entity (e.g. NATO) 3)If at least 75% of the title occurrences inside the article are capitalized, then e is a named entity. Notation: –d  D is a proper name entry in the dictionary D (  500K entries). –d.E is the set of entities that may be denoted by d in Wikipedia, –e  d.E  d  e.name  d  e.R  d  e.D (e.name  e.title without the expression between parantheses)

Hyperlinks Mentions of entities in Wikipedia articles are often linked to their corresponding article, by using links or piped links. The [[Vatican City|Vatican]] is now an enclave surrounded by [[Rome]]. The Vatican is now an enclave surrounded by Rome. Wiki source Display string piped linklink

Disambiguation Dataset Hyperlinks in Wikipedia provide disambiguated named entity queries q. The [[Vatican City|Vatican]] is now an enclave surrounded by [[Rome]]. Notation: – q.E  the set of entities that are associated in the dictionary D with the display name from the link. – q.e  q.E  the true entity associated with the query, given by the title included in the link. – q.T  the text contained in a window of size 55 words [Gooi & Allan, 2004] centered on the link. display nametitle display name  title q1q1 q2q2

Disambiguation Dataset Every entity e k  q.E contributes a disambiguation example, labeled 1 if and only if e k  q.e  Query Text (q.T)Entity Title (e k.title) 1Boston Pops conducted concert Star Wars …e 1 : John Williams (composer) 0Boston Pops conducted concert Star Wars …e 2 : John Williams (wrestler) 0Boston Pops conducted concert Star Wars …e 3 : John Williams (VC) “… this past weekend. [[John Williams]] and the Boston Pops conducted a summer Star Wars concert at Tanglewood …” q 1,783,868 queries

Categories Each article in Wikipedia is required to be associated with at least one category. Categories form a directed acyclic graph, which allows multiple categorization schemes to co-exist. 59,759 categories in Wikipedia taxonomy. Notation: –e.C  the set of categories to which e belongs (ancestors included). Example: –e.title  Venus (planet). –e.C  {Venus, Planets of the Solar Systems, Planets, Solar System}.

Outline Introduction Wikipedia Structures Named Entity Dictionary Disambiguation Dataset Disambiguation & Detection Experimental Evaluation Future Work Conclusions

NE Disambiguation: Two Approaches 1)Classification:  Train a classifier for each proper name in the dictionary D.  Not feasible: 500K proper names  need 500K classifiers! 2)Ranking:  Design a scoring function score(q,e k ) that computes the compatibility between the context of the proper name occurring in a query q, and any of the entities e k  q.E that may be referred by that proper name.  For a given named entity query q, select the highest ranking entity:

Context-Article Similarity NE disambiguation  ranking problem. Use cosine similarity between query context and article, based on the tf x idf formulation:

Word-Category Correlations Problem: In many cases, given a query q, the true entity q.e fails to rank first because cue words from the query context do not occur in q.e’s article. –The article may be too short, or incomplete. –Relevant concepts from the query context are captured in the article through synonymous words or phrases. Approach: Use correlations between words in the query context w  q.T and categories to which the named entity belongs c  e.C.

“John Williams and the Boston Pops a summer Star Wars concert at Tanglewood.” conducted Word-Category Correlations John Williams (composer) ? John Williams (wrestler) Film score composers Composers Musicians Professional wrestlers Wrestlers People known in connection with sports and hobbies People by occupation

Ranking Formulation Redefine q.E  the set of named entities from D that may be denoted by the display name in the query, plus an out-of-Wikipedia entity e out. Use a linear ranking function: One feature for the context-article similarity: Each word-category pair  w,c   V  C is translated into a feature: One special feature for out-of-Wikipedia entities:  [  cos |  w,c |  out ]

Ranking Formulation: Example “… this past weekend. John Williams and the Boston Pops a summer Star Wars concert at Tanglewood.” conducted e 1  John Williams (composer) ? e 2  John Williams (wrestler) Film score composers Composers Musicians Professional wrestlers Wrestlers People known in connection with sports and hobbies People by occupation q  q.T  {past, weekend, Boston, Pops, conducted, summer, Star, Wars, concert, Tanglewood, …} e 1.C  {Film score composers, Composers, Musicians, People by occupation, …} e out.C   w,c (q,e 1 )  1, if (w,c)  q.T  e 1.C 0, otherwise.  w,c (q,e out )  0

NE Disambiguation: Overview 1 Redirect Pages NE Dictionary Hyperlinks Disambiguation Dataset Disambig Pages Data Structures

NE Disambiguation: Overview 2 Disambiguation Dataset Ranking Examples features  (q,e k ) SVM training Ranking Model weights wTraining Ranking Instances features  (q,e k ) NE Dictionary Answer: Ranking Model weights w NE query q Testing

Outline Introduction Wikipedia Structures Named Entity Dictionary Disambiguation & Detection Experimental Evaluation Future Work Conclusions

Experimental Evaluation The normalized ranking kernel is trained and evaluated against cosine similarity in 4 scenarios: 1)Disambiguation between entities with different categories in the set of 110 top-level categories under People by Occupation. 2)Disambiguation between entities with different categories in the set of 540 most popular (size > 200) categories under People by Occupation. 3)Disambiguation between entities with different categories in the set of 2847 most popular (size > 20) categories under People by Occupation. 4)Detection & Disambiguation between entities with different categories in the set of 540 most popular (size > 200) categories under People by Occupation. Use SVM light with the max-margin ranking approach from [Joachims 2002].

Experimental Evaluation: S 2 The set of Wikipedia categories is restricted to: C 2  the 540 categories under People by Occupation that have at least 200 articles Train & Test only on ambiguous queries  q,e k  such that: –e k.C  C 2   (i.e. matching entities have categories in C 2 ) –e k.C  C 2  q.e.C  C 2 (i.e. the true entity does not have exactly the same categories as other matching entities) Statistics & Results: #Cat Training datasetTest datasetTest Accuracy #Queries#Pairs#Constr.#Queries#PairsKernelCosine 54017,97055,45237,48270,468235,29068.4%55.8%

Experimental Evaluation: S 4 The set of Wikipedia categories is restricted to: C 4  the 540 categories under People by Occupation that have at least 200 articles. Train & Test: –Consider out-of-Wikipedia all entities that are not under People by Occupation. –Randomly select queries such that 10% have true answer out-of-Wikipedia. Statistics & Results: #Cat Training datasetTest datasetTest Accuracy #Queries#Pairs#Constr.#Queries#PairsKernelCosine 54038,726102,55363,82780,386191,22784.8%82.3%

Future Work Use weight vector w explicitly – reduce its dimensionality by considering only features occurring frequently in training data. Augment article text with context from hyperlinks that point to it. Use correlations between categories and traditional WSD features such as (syntactic) bigrams and trigrams centered on the ambiguous proper name.

Conclusion A novel approach to Named Entity Disambiguation based on knowledge encoded in Wikipedia. Learned correlations between Wikipedia categories and context words substantially improve disambiguation accuracy. Potential applications: Clustering results to web searches for popular named entities. NE disambiguation is essential for aggregating corpus- level results from Information Extraction.

Questions?

Ranking Kernel The corresponding kernel is: The normalized version:

Experimental Evaluation: S 1 The set of Wikipedia categories is restricted to: C 1  the 110 top-level categories under People by Occupation. Train & Test only on ambiguous queries  q,e k  such that: –e k.C  C 1   (i.e. matching entities have categories in C 1 ) –e k.C  C 1  q.e.C  C 1 (i.e. the true entity does not have exactly the same categories as other matching entities) Statistics & Results: #Cat Training datasetTest datasetTest Accuracy #Queries#Pairs#Constr.#Queries#PairsKernelCosine 11012,28839,88027,59248,661147,16577.2%61.5%

Experimental Evaluation: S 3 The set of Wikipedia categories is restricted to: C 3  the 2847 top-level categories under People by Occupation that have at least 20 articles Train & Test only on ambiguous queries  q,e k  such that: –e k.C  C 3   (i.e. matching entities have categories in C 3 ) –e k.C  C 3  q.e.C  C 3 (i.e. the true entity does not have exactly the same categories as other matching entities) Statistics & Results: #Cat Training datasetTest datasetTest Accuracy #Queries#Pairs#Constr.#Queries#PairsKernelCosine 284721,18564,56043,37575,190261,72368.0%55.4%

Using Encyclopedic Knowledge for Named Entity Disambiguation Razvan Bunescu Machine Learning Group Department of Computer Sciences University of Texas.

Similar presentations

Presentation on theme: "Using Encyclopedic Knowledge for Named Entity Disambiguation Razvan Bunescu Machine Learning Group Department of Computer Sciences University of Texas."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Encyclopedic Knowledge for Named Entity Disambiguation Razvan Bunescu Machine Learning Group Department of Computer Sciences University of Texas.

Similar presentations

Presentation on theme: "Using Encyclopedic Knowledge for Named Entity Disambiguation Razvan Bunescu Machine Learning Group Department of Computer Sciences University of Texas."— Presentation transcript:

Similar presentations

About project

Feedback