Presentation on theme: "BalkaNet project overview Dan Tufiş Dan Cristea Sofia Stamou RACAI UAIC DBLAB."— Presentation transcript:
BalkaNet project overview Dan Tufiş Dan Cristea Sofia Stamou RACAI UAIC DBLAB
Overview of the talk Goals and Design Principles Validation Methodologies Well-formedness checking (synt. val.) Cross-lingual validation of the ILI mapping (sem. val.) Current state of the BalkaNet wordnets Applications (WSD) Standard Balkanet Tools (VisDic, WMS)
Balkanet An EU funded project (IST-2000-29388) for the development of a (core) multilingual semantic lexicon along the principles of EuroWordNet; Started in September 2001, will end August 2004 Languages concerned: Bulgarian, Czech, Greek, Romanian, Serbian, Turkish.
Teams Bulgarian – DCMB (Sofia) & PU (Plovdiv) Czech – FI MU (Brno) Greek – DBLAB (Patras-coordinator) & CTI (Athens) Romanian – RACAI (Bucharest) & UAIC (Iaşi) Turkish – SABANCI University (Istanbul) Memodata (Caen) – com. partner: evaluation studies Subcontractors Serbian – MATF (Belgrad) OTE (Athens) – ind. partner: user studies
Goals and Design Principles (1) Goals: g1)at least 8000 synsets per partner, g2)maximal interlingual overlap (> 80%), g3)building tools to efficiently exploit the multilingual wordnet (sum of the ILI-based aligned monolingual wordnets) g4)development of free software for the use and management of the BalkaNet wordnets g5)building various applications (WSD, intelligent document indexing, CLIR, etc.).
Goals and Design Principles (2) Design Principles: d1) ensuring as much as possible compatibility with the EuroWordnet approaches (e.g. unstructured ILI based on Princeton WordNet) d2) synset structuring (relations) inside each wordnet (lots of redundancy, but much more powerful) d3) keeping up with Princeton WordNet (PWN) developments d4) ensuring conceptually dense wordnets
Goals and Design Principles (3) d5)defining a reusable methodology for data acquisition and validation (open for further development) d6)linguistically motivated (reference language resources, with human experts actively involved in all decision makings and validation) d7)minimizing the development time and costs
Maximisation of the cross-lingual coverage (1) ILI= the set of PWN synsets (labeled by their offsets in the database) taken as interlingual concepts: (07766677-n; 02564241-v; 00933364-a; 00087007-b) The consortium selected a common set of ILI codes to be implemented for all languages; this selection took place in three steps: BCS1 (essentially the BC set of EuroWordnet):1218 concepts BCS2: 3471 concepts BCS3: 3827 concepts
Maximisation of the cross-lingual coverage (2) Selection criteria for BCS1,2,3…(8516 ILI-codes) number of languages in EuroWordNet linked to an ILI code (imperative) conceptual density: once a concept was selected, all its ancestors (nouns and verbs), up to the top level were also selected (imperative); adjectives were selected so that they would typically be related to nominal concepts in the selection (be_in_state) language specific criteria: each team proposed a set of concepts of interest and the maximum intersection set among these proposals became imperative
Synsets structuring (1) At the level of each individual wordnet Common set of relations (the semantic relations) as used in the PWN Language specific relations (the lexical relations: such as derivative, usage_domain, region_domain)
Synsets structuring(2) Principle of hierarchy preservation M 1 L1 H + M 2 L1 M 1 L1 = N 1 L2 N 1 L2 H + N 2 L2 M 2 L1 = N 2 L2 Allows for importing taxonomic relations and checking interlingual alignments. When taxonomic relations were imported, they were hand validated.
Keeping up with PWN developments When the project started ILI was based on PWN1.5 (as EuroWordNet was). BalkaNet ILI was updated following the new releases of PWN: PWN1.5 => PWN1.7.1 PWN1.7.1 => PWN2.0 As the automatic remapping is not always deterministic the partners manually solved the remaining ambiguities in their wordnets.
Defining a reusable methodology for data acquisition and validation Each partner developed own specific tools for acquisition and validation, having a commonly agreed set of functionalities. These tools were documented for a lay computer user. The language specific tools differ mainly because of the set of language resources available to each partner; depending on available resources each partner chose the appropriate balance among the d6) and d7) next issue
Trading effort and development time for language centricity (1) This issue has been addressed by each partner differently, basically, depending on: available man power and language resources available. For instance, if relevant (encoded) electronic dictionaries (2lang. Dicts + Expl. Dicts + Syn. Dicts + Antonym Dicts + etc.) were available, the development effort concentrated to a large extent on equivalence interlingual mappings. This approach allowed a more language centric development (merge model).
Trading effort and development time for language centricity (2) if reliable dictionaries other than bilingual dictionaries (which every partner had) were not available (e.g. because of the reluctance of the copyright holders to release or to allow the use of their data) a translation approach of the literals in the PWN was generally followed (approximately an expand model); additional efforts were necessary in this case to check out the translated synsets as well as their language adequacy.
Syntactic validations (wordnet well-formedness checking) Semantic validation (word sense alignment in parallel corpora) Validation methodologies
Validation of syntactically well-formed wordnets: -compliance with the dtd for the VISDIC editor. -no duplicate literals in the same synset -no sense duplications (literal&sense number) -valid set of semantic relations -no dangling nodes (conceptual density) -no loops -valid synsets identifiers … and many others Syntactic validations
Sense conflicts (a literal&sense-label in two or more synsets): easy to solve (obvious human errors in sense assignment) hard to solve (provide evidence for the Wordnet sense distinctions hard to make in other languages; hints for ILI soft clustering) Consistency checking
Cross-lingual validation of the ILI mapping A bilingual lexicon might say TR (w L1 )=w 1 L2, w 2 L2, … (not enough) A lexical alignment process might give you contextual translation information: The m th word in language L1 (w m L1 ) is translated by the n th word in language L2 (w n L2 ) (step1) TR-EQ (w m L1 )= w n L2 (not enough, but better)
Cross-lingual validation of the ILI mapping A sense clustering procedure might give you info on similar senses of different occurrences of the same word: Sense (Occ(w i L1, p), Occ(w i L1, q) …) = (step2) Sense (Occ(w j L2, m), Occ(w j L2, n) …) = β , β=? (sense labeling) synset(w i L1 ) TR-EQV synset(w j L2 ) (step3) , β are ILI-codes (ideally = β)
Cross-lingual validation of the ILI mapping (idealistic view) Translation(W i L1 )=W j L2 => Syn 1 L1, Syn 2 L2 so that W i L1 Syn 1 L1 and W j L2 Syn 2 L2 and => EQ-SYN (Syn 1 L1 )=EQ-SYN(Syn 2 L2 ) = ILI k WN1 WN2 ILI EQ-SYN W i L1 W j L2 ILI k TR-EQ
Cross-lingual validation of the ILI mapping (more realistic view) ILI EQ-SYN W i L1 W j Lk TR-EQ WN1 WN2
Checking intelingual mappings by translations in parallel corpora Sense Assignment Example (I) is ‘lamp’ is ‘lampă’ Common Sense of lamp and lampă is ENG20-03500733-n and they correspond to lamp(2) and lampă(1)
Checking intelingual mappings by translations in parallel corpora Sense Assignment Example (II) is ‘lamp’ is ‘felinar’ The closest conceptual match of lamp and felinar is for the pairs ENG20-03500372-n and ENG20-03505057-n and they correspond to lamp(1) and felinar(1)
Current status of the BalkaNet wordnets (2) LanguageSynsetsLiteralsSensesAvg. Syn. LgAvg. sense/Lit BG15007 20431268211,791.31 CZ26525 28892396171,491.37 GR15781 15756209891,331.33 RO14407 16080276961,921.72 SR4772 646482371,731.27 TR10280 11581155881,521.35
Balkanet Common Set Coverage LanguagesBCS1BCS2BCS3BCSs TOTAL BG1218 347138278516 CZ1218 347135068195 GR 1218346312525933 RO1218 347137958484 TR 1218347138278516
Cross-lingual coverage (2) LanguagesCZGRROTR BG 126827250114899076 CZ 8871123918755 GR73366652 RO9076 BG CZ RO = 10688 BG CZ GR RO TR = 6035 (75,43%) We hope 100% at the end of the project!
Applications (WSD) WSDtool (presented in the morning session) Initially designed as tool for semantic validation of the BalkaNet wordnets (the interactive regime) In autonomous regime WSDtool works as a word- sense disambiguator based on parallel corpora For the WSD task it was evaluated on the EN-RO bitext (“1984” parallel corpus).
Applications (WSD) The word assignment in both parts of the bitext are ILI codes Very promising results: for a set of 211 target words, with 1411 occurrences in the parallel corpus the accuracy was > 80% User friendly interface