Automatically Extending NE coverage of Arabic WordNet using Wikipedia

Automatically Extending NE coverage of Arabic WordNet using Wikipedia
Musa Alkhalifa2, Horacio Rodríguez1 1 Talp Research Center, UPC, Barcelona, Spain 2 UB, Barcelona, Spain Citala 2009

Index of the presentation
Introduction & motivation AWN NEs Wikipedia Collecting NEs in AWN Collecting NEs from Wikipedia Our system Empirical evaluation Conclusions Citala 2009

Introduction & motivation: AWN
USA REFLEX program funded ( ) Partners: Universities Princeton, Manchester, UPC, UB Companies Articulate Software, Irion Description: Black et al, 2006 Elkateb et al, 2006 Rodríguez et al, 2008a Rodríguez et al, 2008b Citala 2009

Objectives 10,000 synsets including some amount of domain specific data linked to PWN 2.0 finally to PWN 3.0 linked to SUMO + 1,000 NE manually built (or revised) vowelized entries including root of each entry Citala 2009

Current figures Arabic synsets 11270 Arabic words 23496 pos DB content adj 661 nouns 7961 adv 110 verbs 2538 Named entities: Synsets that are named entities 1142 Synsets that are not named entities 10028 Words in synsets that are named entities 1656 Citala 2009

Introduction & motivation: NEs
Importance of NEs for NLP tasks & applications Mention detection, Coreference resolution, Textual Entailment, ... IR, Q&A, Summarization, ... Lack of sufficient coverage in WN (and AWN) Additional sources The Web Wikipedia Citala 2009

Introduction & motivation: Wikipedia
Importance of Wikipedia Size English: articles Deutsch: Español: Français: Italiano: Português: ... > 200 languages Collaborative effort Exponential growing Citala 2009

Introduction & motivation: Wikipedia
The Arabic version (AWP) has over 65,000 articles (about 1% of the total size of WP) Among all the different languages, Arabic has a rank of 29, just above Serbian and Slovenian. The growing of AWP is very high (more than 100% of last year) Citala 2009

Collecting NEs in AWN Objectives Approach 1,000 synsets
variety of types (locations, persons, organizations, ... ) Approach Selection of the candidates Manual validation. Citala 2009

Collecting NEs in AWN Selection of the candidates sources GEONAMES FAO
NMSU Arabic/English lexicon Citala 2009

Collecting NEs in AWN Selection of the candidates
Identifying synsets corresponding to instances Obtaining the generic types 371 generic types such as 'capitals', 'cities', 'countries', 'inhabitants' or 'politicians' Filter out those not linked to AWN Obtaining NMSU entries corresponding to the variants in instance synsets Formatting and merging the results of the three sources Citala 2009

A fragment of GEONAMES database
Citala 2009

Collecting NEs in AWN Manual validation
Deciding the acceptance or rejection of the pair. Modifying Arabic form if needed. Adding diacritics. Completing attachments to PWN2.0 if possible. Citala 2009

Collecting NEs in AWN Results 1,147 synsets 1,659 variants
31 generic types. Citala 2009

Collecting NEs in AWN Citala 2009

Collecting NEs from Wikipedia
Using Wikipedia for NLP tasks see a tutorial in my page: ... multilingual tasks using Interwiki links Richman and Schone, 2008 Ferrández et al, 2007 software Iryna Gurevych's (U. Darmstadt) JWPL system Citala 2009

Crude approach: English NE -> Arabic interwiki link -> Arabic NE But ... Which English NEs have to be looked for? How to deal with polysemy? vowelization (recovering diacritics) Citala 2009

Our approach: Which English NEs have to be looked for? Same approach used in building AWN How to deal with polysemy? use of disambiguation pages when available in EWP comparing with (using Vectorial Space Model) : the set of variants (senses) of each generic type the set of words occurring in the gloss (after stopwords and example removing) the topic signature, vowelization (recovering diacritics) comparison with other interwiki links Citala 2009

Our approach Citala 2009

Results Our approach: We started with 16,873 English NE occurring as instances in PWN2.0 From them 14,904 occurs as well in EWP as article titles. This is a really nice coverage (88%) 3,854 Arabic words corresponding to 2,589 English synsets were recovered following our approach. The coverage (26%) is really high taking into account the small size of AWP From the recovered synsets only 496 belonged to the set of NEs already included in AWN. Citala 2009

Results Our approach: Automatic evaluation Manual validation
From the 496 synsets included in both sets 464 were the same and 32 differed 93.4% accuracy Manual validation From the 3,854 proposed assignments, 3,596 (93.3%) were considered correct, 67 (1.7%) were considered wrong and 191 (5%) were not known Citala 2009

Conclusions We have presented an approach for automatically attaching Arabic NEs to English NEs using AWN, PWN, AWP and EWP as Knowledge sources The system is fully automatic, quite accurate, and has been applied to a substantial enrichment of the NE set in AWN Citala 2009

Thank you for your attention
Citala 2009

Automatically Extending NE coverage of Arabic WordNet using Wikipedia

Similar presentations

Presentation on theme: "Automatically Extending NE coverage of Arabic WordNet using Wikipedia"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatically Extending NE coverage of Arabic WordNet using Wikipedia

Similar presentations

Presentation on theme: "Automatically Extending NE coverage of Arabic WordNet using Wikipedia"— Presentation transcript:

Similar presentations

About project

Feedback