Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Graph Data Management Lab, School of Computer Science Put conference information here.
Automatic Metaphor Interpretation as a Paraphrasing Task Ekaterina Shutova Computer Lab, University of Cambridge NAACL 2010.
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Faculty Of Applied Science Simon Fraser University Cmpt 825 presentation Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary Jiri.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text Soo-Min Kim and Eduard Hovy USC Information Sciences Institute 4676.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
ELN – Natural Language Processing Giuseppe Attardi
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Text Classification, Active/Interactive learning.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
Universit at Dortmund, LS VIII
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
NYU: Description of the Proteus/PET System as Used for MUC-7 ST Roman Yangarber & Ralph Grishman Presented by Jinying Chen 10/04/2002.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Discovering Descriptive Knowledge Lecture 18. Descriptive Knowledge in Science In an earlier lecture, we introduced the representation and use of taxonomies.
Blog Summarization We have built a blog summarization system to assist people in getting opinions from the blogs. After identifying topic-relevant sentences,
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
1 Yang Yang *, Yizhou Sun +, Jie Tang *, Bo Ma #, and Juanzi Li * Entity Matching across Heterogeneous Sources *Tsinghua University + Northeastern University.
Mining Wiki Resoures for Multilingual Named Entity Recognition Xiej un
1 Fine-grained and Coarse-grained Word Sense Disambiguation Jinying Chen, Hoa Trang Dang, Martha Palmer August 22, 2003.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Data Mining and Text Mining. The Standard Data Mining process.
Sentiment analysis algorithms and applications: A survey
Text Categorization Assigning documents to a fixed set of categories
Automatic Detection of Causal Relations for Question Answering
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Text Mining Application Programming Chapter 9 Text Categorization
Presentation transcript:

Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC 2009

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 2

Structured and organized encyclopedic corpus is a suitable training corpus. –a wide range of topics –provides hyperlinks 31 Introduction

In this paper 1)Discuss the usability of Wikipedia 2)Induce WordNet and Wikipedia domain taxonomy into the feature space 3)Using Maximum Entropy and SVM classifier 41 Introduction

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 5

Kazama and Torisawa (2007) –extracted gloss text Dakka and Cucerzan (2008) –tagging the Wikipedia data Bunescu and Pasca (2006) –built a disambiguation system 2 Related Work6

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 7

English version of Wikipedia 2 million articles 292,384 categories a taxonomy with a depth about 10 –5882 Wikipedia Stub categories –105 domains 83 Corpus Creation

3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure 9

taxonomy –constituted by categories –linked to other categories across depth and breadth contains cycles –Tackled by Zesch and Gurevych, 2007 wikipedia taxonomy is not a tree 3.1 Categories in Wikipedia10

3 Corpus Creation 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure 11

the domain hierarchy –17 basic domains –88 sub-domains Named entity categories

to avoid the bias towards any particular domain rules to choose set of categories –To ensure diversity in the categorization task –To ensure we select balanced categories –consider category with each parameter closest to mean value under that domain Named entity categories

3 Corpus Creation 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure 14

extract named entity phrases –using Stanford POS tagger extract typed dependency relationships extract the content words around a named entity –collect the NPs (noun phrases) and VPs (verb phrases) 3.3 Procedure15

1)Firstly, we look for redirected and disambiguated article titles matching with first name of the named entity. 2)If, there are more than one such titles, consider the target title using minimum edit distance metric. 3)Pick all articles that fall under the same category as the target article. 4)Look for those articles that fall under the special categories that are chosen for the classification task. 5)Find the article that shares maximum number of categories with the target article and label the target article with the its special category Procedure

About 10,000 samples –Training 75% –Testing 25% Procedure

183.3 Procedure

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 19

four types of feature sets –a syntactic feature set –three semantic features 4 Features20

4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 21

phrase structure parse –nesting of multi-word constituents dependency parse –dependencies between individual words dependency relations gives a clue about probable semantic relations that can be associated with the named entity. 4.1 Typed Dependency Feature22

4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 23

preferred to have a hypernym feature which is semantically specific –hypernyms of all synsets are inversely ordered according to their depth in the hypernym tree –deepest hypernym in the lot is choosen as the target feature for that content word 4.2 Hypernyms24

4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 25

4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features Wordnet domains Wikipedia domains WDH vsWikipedia Domain System 26

Every synset in WordNet is associated a domain label in Wordnet Domain Hierarchy (WDH) There are 5 top-level domains and 46 basic domains in WDH Wordnet domains27

4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features Wordnet domains Wikipedia domains WDH vsWikipedia Domain System 28

indexed Wikipedia search content words in the index for the categories that contain more number of pages containing a content word Especially, pages with links are weighed double the pages that contains the word without a hyperlink Wikipedia domains

4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features Wordnet domains Wikipedia domains WDH vsWikipedia Domain System 30

4.3.3 WDH vsWikipedia Domain System 31

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 32

335 Experiments

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis 34

35

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis 36

37

38

39

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis 40

41

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion 42

presented a named entity categorization system –employs Wikipedia categories as classes adapted hierachial categorization of Wikipedia –mine relations among named entities 43