Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.

Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC 2009

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 2

Structured and organized encyclopedic corpus is a suitable training corpus. –a wide range of topics –provides hyperlinks 31 Introduction

In this paper 1)Discuss the usability of Wikipedia 2)Induce WordNet and Wikipedia domain taxonomy into the feature space 3)Using Maximum Entropy and SVM classifier 41 Introduction

Kazama and Torisawa (2007) –extracted gloss text Dakka and Cucerzan (2008) –tagging the Wikipedia data Bunescu and Pasca (2006) –built a disambiguation system 2 Related Work6

10-18-2007 English version of Wikipedia 2 million articles 292,384 categories a taxonomy with a depth about 10 –5882 Wikipedia Stub categories –105 domains 83 Corpus Creation

3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure 9

taxonomy –constituted by categories –linked to other categories across depth and breadth contains cycles –Tackled by Zesch and Gurevych, 2007 wikipedia taxonomy is not a tree 3.1 Categories in Wikipedia10

3 Corpus Creation 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure 11

the domain hierarchy –17 basic domains –88 sub-domains 123.2 Named entity categories

to avoid the bias towards any particular domain rules to choose set of categories –To ensure diversity in the categorization task –To ensure we select balanced categories –consider category with each parameter closest to mean value under that domain 133.2 Named entity categories

3 Corpus Creation 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure 14

extract named entity phrases –using Stanford POS tagger extract typed dependency relationships extract the content words around a named entity –collect the NPs (noun phrases) and VPs (verb phrases) 3.3 Procedure15

1)Firstly, we look for redirected and disambiguated article titles matching with first name of the named entity. 2)If, there are more than one such titles, consider the target title using minimum edit distance metric. 3)Pick all articles that fall under the same category as the target article. 4)Look for those articles that fall under the special categories that are chosen for the classification task. 5)Find the article that shares maximum number of categories with the target article and label the target article with the its special category. 163.3 Procedure

About 10,000 samples –Training 75% –Testing 25% 173.3 Procedure

183.3 Procedure

four types of feature sets –a syntactic feature set –three semantic features 4 Features20

4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 21

phrase structure parse –nesting of multi-word constituents dependency parse –dependencies between individual words dependency relations gives a clue about probable semantic relations that can be associated with the named entity. 4.1 Typed Dependency Feature22

preferred to have a hypernym feature which is semantically specific –hypernyms of all synsets are inversely ordered according to their depth in the hypernym tree –deepest hypernym in the lot is choosen as the target feature for that content word 4.2 Hypernyms24

4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 4.3.1 Wordnet domains 4.3.2 Wikipedia domains 4.3.3 WDH vsWikipedia Domain System 26

Every synset in WordNet is associated a domain label in Wordnet Domain Hierarchy (WDH) There are 5 top-level domains and 46 basic domains in WDH. 4.3.1 Wordnet domains27

indexed Wikipedia search content words in the index for the categories that contain more number of pages containing a content word Especially, pages with links are weighed double the pages that contains the word without a hyperlink. 294.3.2 Wikipedia domains

4.3.3 WDH vsWikipedia Domain System 31

335 Experiments

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis 34

presented a named entity categorization system –employs Wikipedia categories as classes adapted hierachial categorization of Wikipedia –mine relations among named entities 43

Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.

Similar presentations

Presentation on theme: "Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.

Similar presentations

Presentation on theme: "Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC."— Presentation transcript:

Similar presentations

About project

Feedback