Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.

Similar presentations


Presentation on theme: "Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC."— Presentation transcript:

1 Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC 2009

2 outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 2

3 Structured and organized encyclopedic corpus is a suitable training corpus. –a wide range of topics –provides hyperlinks 31 Introduction

4 In this paper 1)Discuss the usability of Wikipedia 2)Induce WordNet and Wikipedia domain taxonomy into the feature space 3)Using Maximum Entropy and SVM classifier 41 Introduction

5 outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 5

6 Kazama and Torisawa (2007) –extracted gloss text Dakka and Cucerzan (2008) –tagging the Wikipedia data Bunescu and Pasca (2006) –built a disambiguation system 2 Related Work6

7 outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 7

8 10-18-2007 English version of Wikipedia 2 million articles 292,384 categories a taxonomy with a depth about 10 –5882 Wikipedia Stub categories –105 domains 83 Corpus Creation

9 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure 9

10 taxonomy –constituted by categories –linked to other categories across depth and breadth contains cycles –Tackled by Zesch and Gurevych, 2007 wikipedia taxonomy is not a tree 3.1 Categories in Wikipedia10

11 3 Corpus Creation 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure 11

12 the domain hierarchy –17 basic domains –88 sub-domains 123.2 Named entity categories

13 to avoid the bias towards any particular domain rules to choose set of categories –To ensure diversity in the categorization task –To ensure we select balanced categories –consider category with each parameter closest to mean value under that domain 133.2 Named entity categories

14 3 Corpus Creation 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure 14

15 extract named entity phrases –using Stanford POS tagger extract typed dependency relationships extract the content words around a named entity –collect the NPs (noun phrases) and VPs (verb phrases) 3.3 Procedure15

16 1)Firstly, we look for redirected and disambiguated article titles matching with first name of the named entity. 2)If, there are more than one such titles, consider the target title using minimum edit distance metric. 3)Pick all articles that fall under the same category as the target article. 4)Look for those articles that fall under the special categories that are chosen for the classification task. 5)Find the article that shares maximum number of categories with the target article and label the target article with the its special category. 163.3 Procedure

17 About 10,000 samples –Training 75% –Testing 25% 173.3 Procedure

18 183.3 Procedure

19 outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 19

20 four types of feature sets –a syntactic feature set –three semantic features 4 Features20

21 4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 21

22 phrase structure parse –nesting of multi-word constituents dependency parse –dependencies between individual words dependency relations gives a clue about probable semantic relations that can be associated with the named entity. 4.1 Typed Dependency Feature22

23 4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 23

24 preferred to have a hypernym feature which is semantically specific –hypernyms of all synsets are inversely ordered according to their depth in the hypernym tree –deepest hypernym in the lot is choosen as the target feature for that content word 4.2 Hypernyms24

25 4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 25

26 4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 4.3.1 Wordnet domains 4.3.2 Wikipedia domains 4.3.3 WDH vsWikipedia Domain System 26

27 Every synset in WordNet is associated a domain label in Wordnet Domain Hierarchy (WDH) There are 5 top-level domains and 46 basic domains in WDH. 4.3.1 Wordnet domains27

28 4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 4.3.1 Wordnet domains 4.3.2 Wikipedia domains 4.3.3 WDH vsWikipedia Domain System 28

29 indexed Wikipedia search content words in the index for the categories that contain more number of pages containing a content word Especially, pages with links are weighed double the pages that contains the word without a hyperlink. 294.3.2 Wikipedia domains

30 4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 4.3.1 Wordnet domains 4.3.2 Wikipedia domains 4.3.3 WDH vsWikipedia Domain System 30

31 4.3.3 WDH vsWikipedia Domain System 31

32 outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 32

33 335 Experiments

34 outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis 34

35 35

36 outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis 36

37 37

38 38

39 39

40 outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis 40

41 41

42 outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 42

43 presented a named entity categorization system –employs Wikipedia categories as classes adapted hierachial categorization of Wikipedia –mine relations among named entities 43


Download ppt "Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC."

Similar presentations


Ads by Google