Download presentation
Presentation is loading. Please wait.
Published byBritton Long Modified over 8 years ago
1
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC 2009
2
outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 2
3
Structured and organized encyclopedic corpus is a suitable training corpus. –a wide range of topics –provides hyperlinks 31 Introduction
4
In this paper 1)Discuss the usability of Wikipedia 2)Induce WordNet and Wikipedia domain taxonomy into the feature space 3)Using Maximum Entropy and SVM classifier 41 Introduction
5
outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 5
6
Kazama and Torisawa (2007) –extracted gloss text Dakka and Cucerzan (2008) –tagging the Wikipedia data Bunescu and Pasca (2006) –built a disambiguation system 2 Related Work6
7
outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 7
8
10-18-2007 English version of Wikipedia 2 million articles 292,384 categories a taxonomy with a depth about 10 –5882 Wikipedia Stub categories –105 domains 83 Corpus Creation
9
3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure 9
10
taxonomy –constituted by categories –linked to other categories across depth and breadth contains cycles –Tackled by Zesch and Gurevych, 2007 wikipedia taxonomy is not a tree 3.1 Categories in Wikipedia10
11
3 Corpus Creation 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure 11
12
the domain hierarchy –17 basic domains –88 sub-domains 123.2 Named entity categories
13
to avoid the bias towards any particular domain rules to choose set of categories –To ensure diversity in the categorization task –To ensure we select balanced categories –consider category with each parameter closest to mean value under that domain 133.2 Named entity categories
14
3 Corpus Creation 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure 14
15
extract named entity phrases –using Stanford POS tagger extract typed dependency relationships extract the content words around a named entity –collect the NPs (noun phrases) and VPs (verb phrases) 3.3 Procedure15
16
1)Firstly, we look for redirected and disambiguated article titles matching with first name of the named entity. 2)If, there are more than one such titles, consider the target title using minimum edit distance metric. 3)Pick all articles that fall under the same category as the target article. 4)Look for those articles that fall under the special categories that are chosen for the classification task. 5)Find the article that shares maximum number of categories with the target article and label the target article with the its special category. 163.3 Procedure
17
About 10,000 samples –Training 75% –Testing 25% 173.3 Procedure
18
183.3 Procedure
19
outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 19
20
four types of feature sets –a syntactic feature set –three semantic features 4 Features20
21
4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 21
22
phrase structure parse –nesting of multi-word constituents dependency parse –dependencies between individual words dependency relations gives a clue about probable semantic relations that can be associated with the named entity. 4.1 Typed Dependency Feature22
23
4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 23
24
preferred to have a hypernym feature which is semantically specific –hypernyms of all synsets are inversely ordered according to their depth in the hypernym tree –deepest hypernym in the lot is choosen as the target feature for that content word 4.2 Hypernyms24
25
4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 25
26
4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 4.3.1 Wordnet domains 4.3.2 Wikipedia domains 4.3.3 WDH vsWikipedia Domain System 26
27
Every synset in WordNet is associated a domain label in Wordnet Domain Hierarchy (WDH) There are 5 top-level domains and 46 basic domains in WDH. 4.3.1 Wordnet domains27
28
4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 4.3.1 Wordnet domains 4.3.2 Wikipedia domains 4.3.3 WDH vsWikipedia Domain System 28
29
indexed Wikipedia search content words in the index for the categories that contain more number of pages containing a content word Especially, pages with links are weighed double the pages that contains the word without a hyperlink. 294.3.2 Wikipedia domains
30
4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 4.3.1 Wordnet domains 4.3.2 Wikipedia domains 4.3.3 WDH vsWikipedia Domain System 30
31
4.3.3 WDH vsWikipedia Domain System 31
32
outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 32
33
335 Experiments
34
outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis 34
35
35
36
outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis 36
37
37
38
38
39
39
40
outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis 40
41
41
42
outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 42
43
presented a named entity categorization system –employs Wikipedia categories as classes adapted hierachial categorization of Wikipedia –mine relations among named entities 43
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.