Download presentation
Presentation is loading. Please wait.
1
Talp Research Center, UPC, Barcelona, Spain
Arabic WordNet: What has been done, what could we do, what should we do? Horacio Rodríguez Talp Research Center, UPC, Barcelona, Spain NOOJ 2009
2
Index of the talk Introduction Arabic WordNet Ontologies Wordnets
What has been done what could we do what should we do NOOJ 2009
3
Introduction semantic components used in NLP applications:
Ontologies large-scale knowledge-bases. Need (or convenience) of developing wide-coverage domain-independent lexico-conceptual ontologies WordNet NOOJ 2009
4
Ontologies Ontologies represent static domain knowledge allowing an efficient use by multiple knowledge agents Acquiring domain knowledge for building ontologies is highly costly and time consuming. For this reason lots of methods and techniques have been developed for trying to reduce such efforts NOOJ 2009
5
Ontologies What an ontology is: Studer et al, 1998
an ontology is a formal explicit specification of a shared conceptualization Gruber, 1993 Studer et al, 1998 A conceptualization is an abstract, simplified view of the world represented for some purpose An ontology is a description (formal specification) of a set of concepts and relationships for enabling knowledge sharing and reuse (to perform logical commitments) An ontology commitment is an agreement to use a vocabulary in a way that is consistent with respect to the theory specified by the ontology NOOJ 2009
6
Lexico-Conceptual Ontologies
NOOJ 2009
7
Ontologies The mapping between lexical items (words or multiwords) and concepts can be complex. Due to polysemy, most lexical items can be mapped into more than one concept. Due to synonymy, more than one word can be mapped to a concept. Usually the mapping is splitted into two steps from words into word-senses (i.e. different word meanings) and from word-senses into concepts. NOOJ 2009
8
Wordnets Princeton's English WordNet
(Miller et al, 1990), (Fellbaum, 1998) Semantic Information more than 123,000 words organised in 117,000 synsets (WN3.0) more than 235,000 relations between synsets Freely available: NOOJ 2009
9
Wordnets Princeton's English WordNet
Lexicalised concepts (words, compounds, multiwords) Synset: synonym set (of words) Large semantic net conecting synsets synonymy, antonymy, hyperonymy, hyponymy, meronymy, implication, causation ... Structure Noun hierarchy depth ~12 Verb hierarchy depth ~3 Adjective/adverb not in hierarchy, but in star structure NOOJ 2009
10
Exemple of WN relations
NOOJ 2009
11
Wordnets NOOJ 2009
12
Wordnets Beyond WN EuroWordNet (Vossen 98) UE funded project
Integrated local wordnets in several languages English Sheffield Dutch Amsterdam Italian Pisa Spanish UB, UPC, UNED. NOOJ 2009
13
Wordnets NOOJ 2009
14
Wordnets EuroWordNet Architecture Core Extensions
Inter-Lingual-Index (ILI) Top Concept Ontology (TCO) Domain Ontology (DO) Extensions Local wordnets Domain wordnets NOOJ 2009
15
Wordnets Beyond WN EWN2 ITEM, CREL EuroTerm, Jur-Wordnet Balkanet
German (GermaNet), French, Chec, Swedish, Estonian ITEM, CREL Spanish, Catalan, Basque (UB, UPC) EuroTerm, Jur-Wordnet Extending EWN in particular domain Balkanet Extending EWN for the Balkan languages Hownet Chinese WN NOOJ 2009
16
Wordnets Macro Ontologies based on WN MCR Yago Omega NOOJ 2009
17
Arabic WordNet USA REFLEX program funded (2005-2007) Partners:
Universities Princeton Manchester UPC (Barcelona) UB (Barcelona) Companies Articulate Software Irion NOOJ 2009
18
Arabic WordNet papers Introducing the Arabic WordNet Project
Black et al, 2006 Building a WordNet for Arabic Elkateb et al, 2006 Arabic WordNet: Current State and Future Extensions Rodríguez et al, 2008 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference Automatically Extending NE coverage of Arabic WordNet using Wikipedia AlKhalifa, Rodríguez, 2009 NOOJ 2009
19
Arabic WordNet Objectives
10,000 synsets including some amount of domain specific data linked to PWN 2.0 finally to PWN 3.0 linked to SUMO + 1,000 NE manually built (or revised) vowelized entries including root of each entry NOOJ 2009
20
Arabic WordNet Criteria for selecting synsets to be covered
Connectivity as densely connected as possible Most of them connected to English WN counterparts the overall topology of both wordnets is expected to be similar. Relevance Frequent and salient concepts Generality Synsets on the highest levels of WN NOOJ 2009
21
Arabic WordNet Approach described in 3rd GWC (Elkateb et al, 2006)
Manually built 2 lexicographic interfaces Manchester, Barcelona guided by automatically generated suggestions of <Arabic word, English synset> pairs coming from bilingual resources. NOOJ 2009
22
Arabic WordNet Approach BCs Filling gaps
Covering of EWN & Balkanet Base Concepts Filling gaps Building Arabic specific synsets Covering domain specific synsets Adding NEs. (Semi) automatic extensions heuristic based Bayesian networks NOOJ 2009
23
Arabic WordNet Resources used LOGOS database of Arabic verbs:
contains 944 fully conjugated Arabic verbs Bilingual (Arabic-English) dictionaries NMSU bilingual Arabic-English lexicon: Salmoné University of Barcelona Effel Corpora Arabic GigaWord Corpus (from LDC) UN ( ) bilingual Arabic-English Corpus (from LDC). NOOJ 2009
24
Arabic WordNet Representation database (implemented in MySQL)
interchange format (XML) The database structure comprises four principal entity types: item, word, form and link. NOOJ 2009
25
AWN: What has been done Current (Final ?, we hope no!!!) figures
up to date statistics: Arabic synsets 11270 Arabic words 23496 pos DB content a 661 n 7961 r 110 v 2538 Named entities: Synsets that are named entities 1142 Synsets that are not named entities 10028 Words in synsets that are named entities 1656 NOOJ 2009
26
AWN: What has been done Software Lexicographer's Web Interface
User's Web Interface The Arabic Word Spotter AWN browser AWN to SUMO mapping including automatic generation of Arabic paraphrases of SUMO formal axioms NOOJ 2009
27
AWN: What could we do AWN has a relatively small coverage compared with PWN But due to the way of building it, the coverage of most important concepts is comparable. AWN has a lower density of relations compared with PWN But many uses of PWN are reduced to the hypernymy/hyponymy relation and the coverage of this relation is similar in both WNs NOOJ 2009
28
AWN: What could we do AWN is fully linked to PWN and through PWN to many other WNs Existing WNs, specially PWN are used today in almost all NLP tasks that need (or involve) semantic (lexical) knowledge NOOJ 2009
29
AWN: What could we do USE AWN FOR NLP TASKS So, the morale is:
For instance: Look at Csomai's bibliography in WN page NLP tasks IR, IE, MT, WSD, Coreference Resolution, Summarization, Textual Entailment, WN mappings, NER, NERC, Language Models, Semantic distances NOOJ 2009
30
AWN: What should we do IMPROVE AND EXTEND AWN
AWN is far to be complete only 10,000 regular synsets only 1,000 NE synsets low density of relations lack of appropriate APIs for interfacing computer applications So, the morale is: IMPROVE AND EXTEND AWN NOOJ 2009
31
AWN: What should we do Lines of improvement: Extend AWN coverage:
Manual semi-automatic Heuristic-based approach GWC 2008 (Rodríguez et al, 2008a) Bayesian Networks LREC 2008 (Rodríguez et al, 2008b) Improve the relation density. Finding new relations Manually revising relations existing for other languages (specially English) using of roots as way of suggesting new relations NOOJ 2009
32
AWN: What should we do Lines of improvement:
Building APIs for make easier the use of the database (Perl, Python, Prolog, C, Java, ...) including computing of semantic distances Extending the coverage of NEs Ex. from Wikipedia as Knowledge Source Citala 2009 (AlKhalifa, Rodríguez, 2009) Link AWN with other already available resources: Wikipedia CyC Geonames ... NOOJ 2009
33
Thank you for your attention
NOOJ 2009
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.