Talp Research Center, UPC, Barcelona, Spain Arabic WordNet: What has been done, what could we do, what should we do? Horacio Rodríguez Talp Research Center, UPC, Barcelona, Spain horacio@lsi.upc.edu http://lsi.upc.edu/horacio NOOJ 2009
Index of the talk Introduction Arabic WordNet Ontologies Wordnets What has been done what could we do what should we do NOOJ 2009
Introduction semantic components used in NLP applications: Ontologies large-scale knowledge-bases. Need (or convenience) of developing wide-coverage domain-independent lexico-conceptual ontologies WordNet NOOJ 2009
Ontologies Ontologies represent static domain knowledge allowing an efficient use by multiple knowledge agents Acquiring domain knowledge for building ontologies is highly costly and time consuming. For this reason lots of methods and techniques have been developed for trying to reduce such efforts NOOJ 2009
Ontologies What an ontology is: Studer et al, 1998 an ontology is a formal explicit specification of a shared conceptualization Gruber, 1993 Studer et al, 1998 A conceptualization is an abstract, simplified view of the world represented for some purpose An ontology is a description (formal specification) of a set of concepts and relationships for enabling knowledge sharing and reuse (to perform logical commitments) An ontology commitment is an agreement to use a vocabulary in a way that is consistent with respect to the theory specified by the ontology NOOJ 2009
Lexico-Conceptual Ontologies NOOJ 2009
Ontologies The mapping between lexical items (words or multiwords) and concepts can be complex. Due to polysemy, most lexical items can be mapped into more than one concept. Due to synonymy, more than one word can be mapped to a concept. Usually the mapping is splitted into two steps from words into word-senses (i.e. different word meanings) and from word-senses into concepts. NOOJ 2009
Wordnets Princeton's English WordNet (Miller et al, 1990), (Fellbaum, 1998) Semantic Information more than 123,000 words organised in 117,000 synsets (WN3.0) more than 235,000 relations between synsets Freely available: http://wordnet.princeton.edu/ NOOJ 2009
Wordnets Princeton's English WordNet Lexicalised concepts (words, compounds, multiwords) Synset: synonym set (of words) Large semantic net conecting synsets synonymy, antonymy, hyperonymy, hyponymy, meronymy, implication, causation ... Structure Noun hierarchy depth ~12 Verb hierarchy depth ~3 Adjective/adverb not in hierarchy, but in star structure NOOJ 2009
Exemple of WN relations NOOJ 2009
Wordnets NOOJ 2009
Wordnets Beyond WN EuroWordNet (Vossen 98) UE funded project Integrated local wordnets in several languages English Sheffield Dutch Amsterdam Italian Pisa Spanish UB, UPC, UNED. http://www.hum.uva.nl/~ewn/ NOOJ 2009
Wordnets NOOJ 2009
Wordnets EuroWordNet Architecture Core Extensions Inter-Lingual-Index (ILI) Top Concept Ontology (TCO) Domain Ontology (DO) Extensions Local wordnets Domain wordnets NOOJ 2009
Wordnets Beyond WN EWN2 ITEM, CREL EuroTerm, Jur-Wordnet Balkanet German (GermaNet), French, Chec, Swedish, Estonian ITEM, CREL Spanish, Catalan, Basque (UB, UPC) EuroTerm, Jur-Wordnet Extending EWN in particular domain Balkanet Extending EWN for the Balkan languages Hownet Chinese WN NOOJ 2009
Wordnets Macro Ontologies based on WN MCR Yago Omega NOOJ 2009
Arabic WordNet USA REFLEX program funded (2005-2007) Partners: Universities Princeton Manchester UPC (Barcelona) UB (Barcelona) Companies Articulate Software Irion NOOJ 2009
Arabic WordNet papers Introducing the Arabic WordNet Project Black et al, 2006 Building a WordNet for Arabic Elkateb et al, 2006 Arabic WordNet: Current State and Future Extensions Rodríguez et al, 2008 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference Automatically Extending NE coverage of Arabic WordNet using Wikipedia AlKhalifa, Rodríguez, 2009 NOOJ 2009
Arabic WordNet Objectives 10,000 synsets including some amount of domain specific data linked to PWN 2.0 finally to PWN 3.0 linked to SUMO + 1,000 NE manually built (or revised) vowelized entries including root of each entry NOOJ 2009
Arabic WordNet Criteria for selecting synsets to be covered Connectivity as densely connected as possible Most of them connected to English WN counterparts the overall topology of both wordnets is expected to be similar. Relevance Frequent and salient concepts Generality Synsets on the highest levels of WN NOOJ 2009
Arabic WordNet Approach described in 3rd GWC (Elkateb et al, 2006) Manually built 2 lexicographic interfaces Manchester, Barcelona guided by automatically generated suggestions of <Arabic word, English synset> pairs coming from bilingual resources. NOOJ 2009
Arabic WordNet Approach BCs Filling gaps Covering of EWN & Balkanet Base Concepts Filling gaps Building Arabic specific synsets Covering domain specific synsets Adding NEs. (Semi) automatic extensions heuristic based Bayesian networks NOOJ 2009
Arabic WordNet Resources used LOGOS database of Arabic verbs: contains 944 fully conjugated Arabic verbs Bilingual (Arabic-English) dictionaries NMSU bilingual Arabic-English lexicon: Salmoné University of Barcelona Effel Corpora Arabic GigaWord Corpus (from LDC) UN (2000-2002) bilingual Arabic-English Corpus (from LDC). NOOJ 2009
Arabic WordNet Representation database (implemented in MySQL) interchange format (XML) The database structure comprises four principal entity types: item, word, form and link. NOOJ 2009
AWN: What has been done Current (Final ?, we hope no!!!) figures up to date statistics: http://www.lsi.upc.edu/~mbertran/arabic/awn/query/sug_statistics.php. Arabic synsets 11270 Arabic words 23496 pos DB content a 661 n 7961 r 110 v 2538 Named entities: Synsets that are named entities 1142 Synsets that are not named entities 10028 Words in synsets that are named entities 1656 NOOJ 2009
AWN: What has been done Software Lexicographer's Web Interface http://www.lsi.upc.edu/~mbertran/arabic/awn/update/synset_browse.php User's Web Interface http://www.lsi.upc.edu/~mbertran/arabic/awn/index.html The Arabic Word Spotter http://www.lsi.upc.edu/~mbertran/arabic/wwwWn7/ AWN browser http://sourceforge.net/projects/awnbrowser/ AWN to SUMO mapping including automatic generation of Arabic paraphrases of SUMO formal axioms NOOJ 2009
AWN: What could we do AWN has a relatively small coverage compared with PWN But due to the way of building it, the coverage of most important concepts is comparable. AWN has a lower density of relations compared with PWN But many uses of PWN are reduced to the hypernymy/hyponymy relation and the coverage of this relation is similar in both WNs NOOJ 2009
AWN: What could we do AWN is fully linked to PWN and through PWN to many other WNs Existing WNs, specially PWN are used today in almost all NLP tasks that need (or involve) semantic (lexical) knowledge NOOJ 2009
AWN: What could we do USE AWN FOR NLP TASKS So, the morale is: For instance: Look at Csomai's bibliography in WN page NLP tasks IR, IE, MT, WSD, Coreference Resolution, Summarization, Textual Entailment, WN mappings, NER, NERC, Language Models, Semantic distances NOOJ 2009
AWN: What should we do IMPROVE AND EXTEND AWN AWN is far to be complete only 10,000 regular synsets only 1,000 NE synsets low density of relations lack of appropriate APIs for interfacing computer applications So, the morale is: IMPROVE AND EXTEND AWN NOOJ 2009
AWN: What should we do Lines of improvement: Extend AWN coverage: Manual semi-automatic Heuristic-based approach GWC 2008 (Rodríguez et al, 2008a) Bayesian Networks LREC 2008 (Rodríguez et al, 2008b) Improve the relation density. Finding new relations Manually revising relations existing for other languages (specially English) using of roots as way of suggesting new relations NOOJ 2009
AWN: What should we do Lines of improvement: Building APIs for make easier the use of the database (Perl, Python, Prolog, C, Java, ...) including computing of semantic distances Extending the coverage of NEs Ex. from Wikipedia as Knowledge Source Citala 2009 (AlKhalifa, Rodríguez, 2009) Link AWN with other already available resources: Wikipedia CyC Geonames ... NOOJ 2009
Thank you for your attention NOOJ 2009