Automatically Extending NE coverage of Arabic WordNet using Wikipedia

Slides:



Advertisements
Similar presentations
On-line Compilation of Comparable Corpora and Their Evaluation Radu ION, Dan TUFIŞ, Tiberiu BOROŞ, Alexandru CEAUŞU and Dan ŞTEFĂNESCU Research Institute.
Advertisements

SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
YAGO: A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum Max-Planck-Institute for Computer Science, Saarbruecken,
Chapter 5: Introduction to Information Retrieval
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
A Robust Approach to Aligning Heterogeneous Lexical Resources Mohammad Taher Pilehvar Roberto Navigli MultiJEDI ERC
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
LREC 2008 AWN 1 Building WordNets: The Arabic case H. Rodríguez.
Publishing your paper. Learning About You What journals do you have access to? Which do you read regularly? Which journals do you aspire to publish in.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
University of Sheffield, NLP Entity Linking Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Author: William Tunstall-Pedoe Presenter: Bahareh Sarrafzadeh CS 886 Spring 2015.
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Oana Adriana Şoica Building and Ordering a SenDiS Lexicon Network.
LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa.
© Paul Buitelaar – November 2007, Busan, South-Korea Evaluating Ontology Search Towards Benchmarking in Ontology Search Paul Buitelaar, Thomas.
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
1 Web-Page Summarization Using Clickthrough Data* JianTao Sun, Yuchang Lu Dept. of Computer Science TsingHua University Beijing , China Dou Shen,
Very Large Cross-lingual Resources at OAEI 2008 Laura Hollink Véronique Malaisé Vrije Universiteit Amsterdam.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Understanding User’s Query Intent with Wikipedia G 여 승 후.
WordNet Enhancements: Toward Version 2.0 WordNet Connectivity Derivational Connections Disambiguated Definitions Topical Connections.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Evgeniy Gabrilovich and Shaul Markovitch
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
知識管理報告 Semantic interpretation and knowledge extraction 第四組 M 余思慧 M 林道明 M 謝明哲 M 曾世賢.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual.
Evaluating Answer Validation in multi- stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es The Second.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institut AIFB – Angewandte Informatik.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.
Exploiting Wikipedia as External Knowledge for Document Clustering
Talp Research Center, UPC, Barcelona, Spain

Irion Technologies (c)
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
Extracting Semantic Concept Relations
WordNet WordNet, WSD.
CS 620 Class Presentation Using WordNet to Improve User Modelling in a Web Document Recommender System Using WordNet to Improve User Modelling in a Web.
CSE 635 Multimedia Information Retrieval
CS246: Information Retrieval
Summarization for entity annotation Contextual summary
Extracting Why Text Segment from Web Based on Grammar-gram
Machine Reading.
Presentation transcript:

Automatically Extending NE coverage of Arabic WordNet using Wikipedia Musa Alkhalifa2, Horacio Rodríguez1 1 Talp Research Center, UPC, Barcelona, Spain 2 UB, Barcelona, Spain Citala 2009

Index of the presentation Introduction & motivation AWN NEs Wikipedia Collecting NEs in AWN Collecting NEs from Wikipedia Our system Empirical evaluation Conclusions Citala 2009

Introduction & motivation: AWN USA REFLEX program funded (2005-2007) Partners: Universities Princeton, Manchester, UPC, UB Companies Articulate Software, Irion Description: Black et al, 2006 Elkateb et al, 2006 Rodríguez et al, 2008a Rodríguez et al, 2008b Citala 2009

Introduction & motivation: AWN Objectives 10,000 synsets including some amount of domain specific data linked to PWN 2.0 finally to PWN 3.0 linked to SUMO + 1,000 NE manually built (or revised) vowelized entries including root of each entry Citala 2009

Introduction & motivation: AWN Current figures Arabic synsets 11270 Arabic words 23496 pos DB content adj 661 nouns 7961 adv 110 verbs 2538 Named entities: Synsets that are named entities 1142 Synsets that are not named entities 10028 Words in synsets that are named entities 1656 Citala 2009

Introduction & motivation: NEs Importance of NEs for NLP tasks & applications Mention detection, Coreference resolution, Textual Entailment, ... IR, Q&A, Summarization, ... Lack of sufficient coverage in WN (and AWN) Additional sources The Web Wikipedia Citala 2009

Introduction & motivation: Wikipedia Importance of Wikipedia Size English: 2 683 000+ articles Deutsch: 847 000+ Español: 431 000+ Français: 746 000+ Italiano: 527 000+ Português: 449 000+ ... > 200 languages Collaborative effort Exponential growing Citala 2009

Introduction & motivation: Wikipedia The Arabic version (AWP) has over 65,000 articles (about 1% of the total size of WP) Among all the different languages, Arabic has a rank of 29, just above Serbian and Slovenian. The growing of AWP is very high (more than 100% of last year) Citala 2009

Collecting NEs in AWN Objectives Approach 1,000 synsets variety of types (locations, persons, organizations, ... ) Approach Selection of the candidates Manual validation. Citala 2009

Collecting NEs in AWN Selection of the candidates sources GEONAMES FAO NMSU Arabic/English lexicon Citala 2009

Collecting NEs in AWN Selection of the candidates Identifying synsets corresponding to instances Obtaining the generic types 371 generic types such as 'capitals', 'cities', 'countries', 'inhabitants' or 'politicians' Filter out those not linked to AWN Obtaining NMSU entries corresponding to the variants in instance synsets Formatting and merging the results of the three sources Citala 2009

A fragment of GEONAMES database Citala 2009

Collecting NEs in AWN Manual validation Deciding the acceptance or rejection of the pair. Modifying Arabic form if needed. Adding diacritics. Completing attachments to PWN2.0 if possible. Citala 2009

Collecting NEs in AWN Results 1,147 synsets 1,659 variants 31 generic types. Citala 2009

Collecting NEs in AWN Citala 2009

Collecting NEs from Wikipedia Using Wikipedia for NLP tasks see a tutorial in my page: http://www.lsi.upc.edu/horacio ... multilingual tasks using Interwiki links Richman and Schone, 2008 Ferrández et al, 2007 software Iryna Gurevych's (U. Darmstadt) JWPL system Citala 2009

Collecting NEs from Wikipedia Crude approach: English NE -> Arabic interwiki link -> Arabic NE But ... Which English NEs have to be looked for? How to deal with polysemy? vowelization (recovering diacritics) Citala 2009

Collecting NEs from Wikipedia Our approach: Which English NEs have to be looked for? Same approach used in building AWN How to deal with polysemy? use of disambiguation pages when available in EWP comparing with (using Vectorial Space Model) : the set of variants (senses) of each generic type the set of words occurring in the gloss (after stopwords and example removing) the topic signature, vowelization (recovering diacritics) comparison with other interwiki links Citala 2009

Our approach Citala 2009

Results Our approach: We started with 16,873 English NE occurring as instances in PWN2.0 From them 14,904 occurs as well in EWP as article titles. This is a really nice coverage (88%) 3,854 Arabic words corresponding to 2,589 English synsets were recovered following our approach. The coverage (26%) is really high taking into account the small size of AWP From the recovered synsets only 496 belonged to the set of NEs already included in AWN. Citala 2009

Results Our approach: Automatic evaluation Manual validation From the 496 synsets included in both sets 464 were the same and 32 differed 93.4% accuracy Manual validation From the 3,854 proposed assignments, 3,596 (93.3%) were considered correct, 67 (1.7%) were considered wrong and 191 (5%) were not known Citala 2009

Conclusions We have presented an approach for automatically attaching Arabic NEs to English NEs using AWN, PWN, AWP and EWP as Knowledge sources The system is fully automatic, quite accurate, and has been applied to a substantial enrichment of the NE set in AWN Citala 2009

Thank you for your attention Citala 2009