Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest

Exploiting Multilinguality in Developing Training Data for Statistics-Based NLP
Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest NATO-ASI, October, 2007, Batumi - Georgia

Overview of the talks (I: approx. 8h)
The BLARK and ELARK concepts Monolingual and multilingual BLARK/ELARK Morpho-Syntactic Tagging Tagsets design; case study: tiered tagging Creating training data; validation and correction of the training data Gold Standards, tagsets mapping & cross-tagging Combining language models for further improvements

Overview of the talks (II: approx. 8h)
Briefs on lemmatization, chunking and dependency linking Alignment Sentence alignment Word alignment; case study: reified alignment Multilingual lexical repositories alignment; case study: wordnets (EuroWordNet, BalkaNet)

Overview of the talks (III: approx. 18h)
Applications Exploiting annotations in one part of a parallel text and inducing similar annotations in the other part of the bitext; frame structures and grammar induction Semantic validation of the parallel semantic lexicons Checking term translation consistency Automatic extraction of bilingual dictionaries (word and MWE ) Cross and monolingual Question Answering Statistical Machine Translation systems

BLARK: Tokenisation Identifying the lexical units in arbitrary texts;
copes with multiword units recognition (look_for, with_respect_to, used_to, back_and_forth etc), splitting multi-token concatenations (e.g. cliticization: damelo=da+me+lo; dându-mi-se=dându+mi+se), abbreviations, non-alphabetic token interpretations ($100, 5/4/2005, etc) Difficulty of this task is language dependent Various tools for doing the proper job Language dependent tools (not very attractive, but possibly more efficient) Multilanguage tools: needs language specific resources; an example MtSeg ( Named Entity Recognition (NER): “White House”, “Standard & Poor’s”, etc.

An example: MTSeg There are three input formats: plain, normalized sgml, tabular. We will use the plain format. Consider “infile” containing the plain text (Ro) Într-un cuvânt, acesta este un exemplu. The segmenter can be invoked three ways, depending on the input format: plain text: mtseg -lang ro -input plain <infile >ofile

(black&red=full format; red=filtered format)
Output format (black&red=full format; red=filtered format) [CHUNK <DIV FROM="1">; (PAR ; (SENT <S>; 1\ TOK Într 1\ PROC un 1\ TOK cuvânt 1\ PUNCT , 1\ TOK acesta 1\ TOK este 1\ TOK un 1\ TOK exemplu 1\ PTERM_P . )SENT </S>; )PAR ]CHUNK </DIV>;

BLARK: MSD/POS Tagging
Given an input tokenized string: token1 token2 … tokenk token1 description1 token2 description2 … tokenk descriptionk This process, which we would like to be as fast and correct as possible, is generically called tagging. If the description is done in terms of morpho-syntactic properties, it is called morpho-syntactic tagging (MS-tagging) or (less accurate) POS-tagging.

Morpho-syntactic descriptors or MSD (or simply MS-tags) and the set of all tags needed to describe the words in a lexicon=a tagset. The tagsets may have various granularity: the finer the granularity, the larger the number of tags. Tagset design is a language specific activity, but …to promote multilinguality, we need standardized ways of describing tagsets. There are various initiatives towards standardization of the MSD, one of the most influential: EAGLES (“Synopsis and comparison of morpho-syntactic phenomena encoded in lexicons and corpora. A common proposal and applications to European languages” see

Part-of-Speech Code Attributes Noun N 10 Verb V 15 Adjective A 12
___________________________________________ Part-of-Speech Code Attributes Noun N 10 Verb V 15 Adjective A 12 Pronoun P 17 Determiner D 10 Article T Adverb R Adposition S 4 Conjunction C 7 Numeral M 12 Interjection I 2 Residual X 0 Abbreviation Y 5 Particle Q

Verb description in MULTEXT-EAST
P Attribute Value Code 1 Type main m auxiliary a modal o copula c l.s. base b 2 Vform i s m c n p g u l.s. t l.s. q 3 Tense present p i f l.s. l l.s. a 4 Person first 1 second third 5 Number singular s plural p l.s. dual d 6 Gender masculine m feminine f neuter n P Attribute Value Code 7 Voice active a passive p 8 Negative no no yes y 9 Definite no n l.s. short_art s l.s. full_art f l.s s2s 10 Clitic no n 11Case n g d a l i x 2 e 4 5 12 Animate no n 13Clitic_s no n don’t sing! Vmmp2s--y

In practice, the learning program builds a data structure called Language Model (LM) which is fed into an interpretation program (the proper tagger) which produces the tagged text. Training THE GORY DETAILS 3-gram HMM = (S, P, A, B) S = finite set of states (a state corresponds to a sequence of 2 tags) P = initial states probabilities A = transition matrix (probabilities to move from sij to sjk) B = emitting matrix (lexical probabilities - lexicon) Aijk = probability of moving from state sij to sjk Bij (w) = probability of emitting w in state si P(w|sij) The estimation of the parameters of the LM is usually done by a method called EM (expectation-maximization) or Baum-Welch algorithm. It chooses the maximum likehood model parameters. Self-organising techniques.

Tagging tagged text plain text tokenized text ... ... " ” DBLQ Senatul
este în vacanţă extraordinară , spunea un parlamentar şi uimit amuzat de S ceea_ce se întâmplă . ... ” DBLQ Senatul NSRY este V3 în S vacanţă NSRN extraordinară ASN " DBLQ , COMMA spunea V3 un TSR parlamentar NSN şi CR uimit ASN amuzat ASN de S ceea_ce RELR se PXA întâmplă V3 . PERIOD tagged text plain text … ”/DBLQ Senatul/NSRY este/V3 în/S vacanţă/NSRN extraordinară/ASN”/DBLQ ,/COMMA spunea/V3 un/TSR parlamentar/NSNşi/CR uimit/ASN ,/COMMA şi/CR amuzat/ASN de/S ceea_ce/RELR se/PXA întâmplă/V3 ./PERIOD … “Senatul este în vacanţă extraordinară", spunea un parlamentar şi uimit, şi amuzat de ceea ce se întâmplă. ... tokenized text

Morpho-Syntactic Tagging: the Hidden Side of the Story
Tagset design issues and training data cleaning-up: Cross-lingual standardization Richness versus distributional adequacy and training data availability Lexical tagsets vs. Corpus tagsets Reducing lexical tagsets cardinality Information loss-less reduction Controlled information loss reduction; recovering left out information Validation and correction of training data Tagsets mapping Biased tagging Direct tagging vs. Cross tagging Double Cross tagging & improvement of the original annotations Case study: Semcor

A) Designing Tagsets: Lexical Tagsets
Lexical tagsets: encodings of morpho-lexical properties of the lexical stock; Word-form lexicons (usually very large) contain paradigmatic families of the lemmas they cover; each member of a paradigmatic family is attached the representative lemma and a comprehensive description of the morpho-lexical properties of the word-form in case; The need for standardization in the multilingual electronic world generated various encoding proposals;

Lexical tagsets (II) The benefits of the standardized morpho-lexical descriptions are many folded; one of the advantages is using some useful tools for corpus tagsets generation; One of the most influential proposal is EAGLES/ISLE further extended by Multext and Multext/East: ( morphsyn/morphsyn.html; Highly inflectional languages have large lexical tagsets (2000, 3000 or even 4000 tags are not unusual)

Lexical tagsets (III) Fortunately not every feature values combination is possible, but still, the lexical tagsets, although maximally informative, are hardly adequate for a statistical-based approach to automatic POS-tagging (in spite of some attempts to do so). Supervised learning methods (the most accurate) would be hampered by training data sparseness Some fundamental observations: Feature values are not independent on each other in a legal combination (lexical tag) Some feature (and all their values) are insensitive to morpho-syntactic distribution and as such they cannot be reliably distinguished by distributional analysis methods A proper design process could reduce the lexical tagsets to manageable corpus tagsets, much more appropriate for fast and accurate POS-tagging of large documents;

Corpus tagsets The quality of the tagging process depends on many factors: The quantity and quality of the training data: insufficient discriminative examples for each considered class is very harmful for the reliability of the classification system; a training corpus containing errors induces wrong generalizations in the LM. If the learner is given wrong examples it will learn bad classifications. The adequacy of the tagset: a poorly designed tagset will destroy the performance of any tagger; on the contrary a good tagset will allow even a simple-minded tagger to get reasonable results. A too small tagset (say only parts of speech) will be too coarse and won’t allow the learner to abstract on various collocational restrictions. A too large tagset will, most likely, generate a data sparseness problem; there is a non-linear relation between the tagset size and the size of the needed training corpora.

Good practices: leaving out attributes/features that are irrelevant for the words distribution (animate/inanimate), requires more context or higher level knowledge (transitive/intransitive) or are recoverable from the form of the word; particularly relevant for inflectional languages. PennTB tagset omits a number of the distinctions that are made in the LOB and Brown tagsets (on which it is based); M. Marcus et all.1994 CL(19/2) morpho-lexical syncretism is also source of noise (case system) clustering tags without information loss eliminate attributes without reducing the cardinality of any lexical item’s ambiguity class; Thorsten Brants, 1995 (ACL’95) “port- manteau” tags = tags assigned for one or few more words with idiosyncratic ambiguity class (e.g. English LOB and Brown tagsets use different tags for the auxiliaries “be”, “do” and “have”; Roger Garside et. all 1987)

observing frequent tagging errors: this is a signal that limited distributional analysis is insufficient for discriminating among the frequently confused tags; in such a case, merge tags. With LOB and Brown tagsets one of the most frequent error is mistagging a word as a subordinated conjunction (CS) rather than a preposition and vice-versa: PennTB uses a single tag for both cases, leaving the resolution - if required - to other processes; E. Macklovitch, 1992 (4th ANLP). using multiple layers of tagging: tiered tagging; D. Tufiş, 1999 (TSD), 2000 (LREC), 2004 (LREC). Building training corpora is expensive (hand-made), error-prone, boring and time consuming. Fortunately, there are some effective ways to detect most of the unsystematic human errors!

Biased evaluations of the taggers is the simplest way to catch such errors:
a) train the tagger on the training corpus b) tag the same corpus based on the learnt language model c) compare the hand-tagged and the machine-tagged versions: the non-systematic errors in the hand-tagged corpus are very likely to show up. There are several other ways to spot the human-made errors in training corpora (“gold standards”): Geoffrey Leech, David Elworthy (1994), Simone Teufel (1995), Jean-Pierre Chanod, Tapanainen (1995), Hans von Halteren (2000), Karel Oliva (2001), Yuji Matsumoto (2002), Dickinson, M., Meurers, W. D. (2003) etc. We will discuss a recent approach (“Cross Tagging”, Pârvan & Tufiş (2006)) and the result of it on a well-known corpus: Semcor (Brown)

Tiered Tagging (TT) Given: TT: Critical elements:
a large lexical tagset (MSD), a properly reduced corpus tagset (TTAG) and a mapping between the two tagsets (MAP) a training corpus (TC) annotated in terms of MSD TT: Training: transforms TC in TC’ where each tag from MSD is replaced by the corresponding tag from TTAG (using MAP) builds a LM from TC’ Tagging: any new text T is taged as TTTAG by means of LM constructed from TC’ transforms TTTAG into TMSD, recovering (RECOVER) information contained in MSD but absent in TTAG Critical elements: TTAG, MAP, RECOVER

Properties of the TTAG and of the MAP mapping
TTAG Design Properties of the TTAG and of the MAP mapping If the intersection above is always equal to 1, then recoverability is 100% and deterministic; TTAG is called in this case a Baseline Tiered Tagging tagset (BTTAG) and MAP is an information lossless transformation of the MSD into TTAG; MAP takes care about feature-values redundancy elimination.

TTAG design (cntd.) A baseline TTAG may significantly diminish the data sparseness threat induced by a large MSD; It can be constructed automatically! but for a given MSD one can find many different BTTAGs. So, one needs additional criteria to select one BTTAG (e.g. “minimal cardinality”, “best performing induced LM”, etc.) In the general case of TT, the RECOVER procedure is non-deterministic (unless additional knowledge sources are used). But the reduction of the MSD cardinality is more significant than for the deterministic case (BTTAG) RECOVER may be either a rule-based disambiguation procedure for the limited number of situations where intersection in EQ1 is not 1 (needs human expertise, language dependent, relies on a word-form lexicon; LREC2000/2002) or a ME-based procedure (language independent, needs training data for high accuracy, can work w/wo a word-form lexicon ; ESSLLI2006).

Building an “optimal” BTTAG
If the MSD contains lot of redundant information, an “optimal” BTTAG could suffice for TT Deriving an “optimal” BTTAG is a highly intensive computation (could take several running days on an average PC) but is done only once; it pays off!

An algorithm for BTTAG (LREC 2000)
extract all ambiguity classes from the MSD-lexicon ;e.g. MSD-ACj=(Ncfp-n Ncfson Vmis3s Vmm-2s Vmnp) for each ambiguity class ACi preserve only intra-categorical ambiguities ICAi ;e.g. ICAj1=(Ncfp-n Ncfson), ICAj2=(Vmis3s Vmm-2s Vmnp) endfor for each ICAi repeat for each MSDij repeat for each attribute Ak in MSDij repeat if eliminating Ak does not reduce the cardinality of any of ICAs then marked Ak as removable endif endfor endfor endfor for all Ak marked as removable compute the sets of attributes the collective removal of which would still preserve the cardinality of all ICAs (not unique solution) and generate the BTTAGs endfor choose the “optimal” BTTAG

Step 5: the “optimal” BTTAG (LREC2004):
for each BTTAG do set Best-Accuracy=prec(MSD) ; MinSize=size(MSD) ; MinSize4MSDprec=size(MSD) turn the MSD tags of the corpus in BTTAG tags (determ) use randomly 90% of data for training and 10% for evaluation compute the Average-Accuracy of a ten-fold validation run if Average-Accuracy > Best-Accuracy then Best-Accuracy <- Average-Accuracy endif if size(BTTAG)<MinSize then MinSize <- size (BTTAG) if (Average-Accuracy  prec(MSD) and size(BTTAG) <MinSize4MSDprec) then MinSize4MSDprec <- size(BTTAG) endfor

The “optimal” BTTAG (cntd.)
Determining the “optimal” BTTAG: Is language independent and does not require human intervention (thus no language skills are needed); Requires a corpus annotated in terms of MSD and word-forms lexica as delivered by the MULTEXT-EAST project for various languages; The multilingual corpus we used is “1984”; We made experiments for Czech, English, Estonian, Hungarian, Romanian and Slovene;

Results and evaluation for various languages and BTTAGs
(no external lexicon, the tagger’s lexicon learnt from training corpus) Language MSD Smallest BTTAG for (~) MSD Prec Best Prec BTTAG BTTAG with closest Prec to MSD #msds Prec #tags RO 615 95.8 56 95.1 174 96.1 81 SI 2083 90.3 385 89.7 691 90.9 585 90.4 HU 618 94.4 44 94.7 84 95.0 EN 133 95.5 45 95 52 95.6 CZ 1428 89.0 299 735 90.2 319 89.2 ET 639 93.0 208 92.8 335 93.5 246 93.1 BTTAGs for the 6 languages are available at:

BTTAG is not the Hidden Tagset of Tiered Tagging
The full recoverability property of a BTTAG does not ensure necessarily a spectacular improvement of the language model accuracy; it significantly improves its robustness over new data because it may drastically reduce data sparseness; To take full advantage of the Tiered Tagging methodology, one should go from BTTAG to Hidden Tagging Tagset (HTTAG).

HTTAG & RECOVER Converting MSD into HTTAG is an information loss transformation! The MAP mapping is not deterministic; it will replace sometimes one HTTAG tag with two or (rarely) more MSD tags, in which case a RECOVER procedure is necessary. The initial RECOVER procedure in TT was a rule-based module; writing appropriate rules requires language skills; the local grammars are focused on ambiguities remaining in the |IMAP(Ti) AMBMSD(Wk)| computed for each tagged word; it works only for words in the tagger’s lexicon (AMBMSD(Wk) is read from the lexicon).

RULE-based RECOVER Ex: the rule to distinguish among determiners and pronouns (Romanian): Ps|Ds {Ds. : (-1 Ncy) || (-1 Af. y) || (-1 Mo. y) || (-2 Af. n and –1 Ts)||(-2 Ncn and –1 Ts) || (-2 Np and –1 Ts) || (-2 D.. and –1 Ts) Ps. : true} The reading is as follows: Choose the determiner interpretation when: The previous word is tagged as a definite Noun, or a definite Adjective, or a definite Numeral (ordinal), or when the previous two words are tagged as indefinite noun followed by a possessive article, or proper noun followed by a possessive article, or determiner followed by a possessive article. Chose the pronoun interpretation if none of above holds

ME-based RECOVER (I) The main disadvantage of the rule-based RECOVER approach is that it works only for words present in the MSD lexicon (the one from which the hidden tagset was derived). Writing the disambiguation rules might be also invoked (but is not really a big deal) The ME-based RECOVER algorithm uses contextual predicates relating feature values of the current wordform and feature values of the tags in context to decide on the current feature values. The tagset converter uses more contextual information than an usual tagger (it runs on the already tagged corpus). If the word w is not in the MSD lexicon, the MSDk tag that the model predicts may not be among the IMAP(Ti) ={MSD1…MSDk-1}. In this case, the MSDk tag that the model predicted is taken into account in the K-breadth first search. This way, the converter can correct tagging errors on unknown words.

ME-based RECOVER Builds on SharpEntropy (Richard Nothhedge, It uses an a-priori mapping list containing the complete correspondences between HTTAG and MSD tagset of the form: Based on this additional resource, the tagset converter generates, with high accuracy, MSD tags even for unknown or partially known words (i.e. either missing from the learnt lexicon or learnt with an incomplete ambiguity class).

Contextual predicates: fi(tag, history)
ME framework Wordform related features: character length prefix (1-2) suffix (1-4) upper case (all, initial) is abbreviation has underscore has number hyphen position (start, middle, end, none) Context related features: previous MSD features previous MSD 1gram, 2gram, 3gram previous Ctag 1gram and 2gram next Ctag 1gram and 2gram end of sentence punctuation mark The ME-based tagging platform ensures all the Tiered Tagging steps and complements our previous HMM-based platform Contextual predicates: fi(tag, history)

Accuracy of the ME-based RECOVER
Unknown word accuracy Total word accuracy without word-form lexicon Total word accuracy with word-form lexicon 95.20% 98.66% 99.04%

Evaluation of the HMM tagging accuracy
Tagging method (HMM) MSD-tagger Tiered tagging Ctag-tagger Unknown word accuracy without word-form lexicon 77.96% 78.57% 81.93% Total word accuracy without word-form lexicon 95.79% 96.08% 96.78% Total word accuracy with word-form lexicon 98.01% 98.42% 98.59% The test data: from the same register with the test data and few unknown words. All MSDs in the test data were seen in the train data; These facts explain the small differences between MSD-tagger and Tiered tagging; Tiered tagging: more robust to register diversity and unknown words.

Evaluation of the ME tagging accuracy
Tagging method (ME) MSD-tagger Tiered tagging Ctag-tagger Unknown word accuracy without word-form lexicon 78.65% 78.76% 82.24% Total word accuracy without word-form lexicon 96.56% 96.22% 96.81% Total word accuracy with word-form lexicon 98.35% 98.58% 98.62% All figures are slightly better for the ME tagger( %), but the validity of the tiered tagging is confirmed. We expect the differences between accuracy of the tiered tagging and of the direct MSD tagging increase on other register texts.

TT methodology is useful not only when dealing with large tagsets, but also when the interest is only in simple grammatical distinctions (say only part of speech). We made an experiment on using a tagset with only 14 categories (the Multext-East grammar categories) and the average precision (ten-fold evaluation procedure) was very poor: 92.3% !! Using the TT (with combined classifiers) the accuracy was always over 99%. Evaluation for POS-only TT&CLAM Hidden tagset =92 tags; Final Tagset=14 tags

Remarks on TT The models based on HTTAG ensure not only the accuracy increase of the tagging process but also it robustness with respect to text types variation. Experiments on various languages (e.g. Hungarian, German) and by various researchers (Oravecz, Dienes, Hinrichs, Truschina, etc) proved the validity of the approach. For Romanian, the average accuracy (10-fold validation on three different register corpora) is around 98.6%.

Validation & Correction
Requires a coherent methodology for automatically identifying as many POS annotation and lemmatization errors as possible; we relied on three main techniques for automatic detection of potential errors: 1. When lemmatizing the corpus, we extracted all the triples <word-form, POS tag, lemma> that were unknown to the tagger (the tagger’s lexicon included a large word-form lexicon); 2. We checked the correctness of POS annotation for closed class lexical categories, technique described by (Dickinson & Meurers, 2003); 3. We exploited the Biased Evaluation Conjecture hypothesis

Lemmatization and re-tokenization of unknown tokens (I)
If the current token is marked by the tagger as unknown, it is checked whether its POS annotation is NP (proper noun), in which case the lemma is considered being identical to the occurrence form of the token; sequences of NPs are joined together (separated by an underscore) with the new lemma similarly constructed. Otherwise, the current token is processed by the probabilistic lemmatizer. The result, a triple <word-form tag lemma> (together with a backward reference), is saved in the file NotInTheLexicon. The content of the NotInTheLexicon file was classified and analyzed in the decreasing order of the triples frequencies. It revealed multiple errors patterns, and for some of these errors the correction was easy;

Lemmatization and re-tokenization of unknown tokens (II)
1. more than 20,000 errors due to the wrong conversion of some diacritical characters into SGML entities or misspelling 2. tokenization errors (misinterpretation of the period character, incomplete or incorrect specification of several frequent compounds) 3. incomplete specification for the cases where two or more consecutive numerals should have been taken together; 4. web and addresses (we added NNWEB and NNMAIL to our tag set, concatenating and retagging accordingly)

Using closed class analysis for identifying errors (I)
Lexical categories: - closed classes are those enumerable (e.g. classes like determiners, prepositions, conjunctions, pronouns, modal verbs, or auxiliaries) - open classes are the large, productive categories such as verbs, nouns, adjectives, adverbs Dickinson and Meurers (2003): In the majority of the known tagsets, almost half of the tags correspond to closed-class categories of words. A closed-class category contains a reduced number of words, not very difficult to enumerate. D&M ideea: search in a corpus for all occurrences of a closed-class tag and check whether each word is actually a member of his proper closed-class.

Using closed class analysis for identifying errors (II)
Our procedure: 1. Extracted from the word-form lexicon a list, L1, of closed-class tags, each of them indexing the set of words that could receive that tag. 2. From this list, we computed another list, L2, containing words in L1 indexing two or more closed-class tags. 3. Then, we extracted from the corpus under investigation all pairs <word tag> so that tag was a closed-class tag. 4. If word was not in the set of L1 indexed by tag, we checked the respective word occurrence in its context. [Obs.: In the majority of cases, we found a tagging error, but occasionally we also found errors in the word-form lexicon (a possible closed-class tag was not recorded for some words).]

Using closed class analysis for identifying errors (III)
5. Based on L2 we extracted all the words that were seen in the corpus only with a subset of their possible closed-class tagset. [Obs.: Some errors were again found in the word-form lexicon (words that were wrongly associated with some closed-class tags).] - We used a few regular expressions (defined in terms of surrounding -tags for each target word) to extract sentences in which the not seen tags could have been licensed. - Although this approach was not very precise (the most extracted sentences contained the correct tags), almost two hundred new errors were corrected.

Using biased evaluation for better error identification (I)
Biased evaluation conjecture (Tufiş,1999): an accurately and consistently tagged corpus, re-tagged with the language model learnt from it (biased tagging), should reproduce the initial tagging in the vast majority of cases (usually more than 98%). Reference Corpus: the version re-tokenized using the preceding techniques

Procedure description
trained the tagger on the newly acquired corpus, building a new language model; 2. retagged the new corpus with this new language model and compared the new tagging against the reference annotation; 3. found 96.8% identically tagged tokens and extracted the differences; 4. sorted the differences by their frequency; 5. the first 100 difference types (accounting on average for 8-10,000 difference occurrences) were examined in context, one by one, and the validation expert decided which of the tags was correct (if any); Obs.: 1. time consuming procedure, given the dimension of the corpus 2. differences explained by: inconsistent or partial corrections in the previous phases; the modification of the context for tokens neighboring the corrected ones; 6. Correcting all the errors discovered in the analysis of the first 100 difference types ends the procedure.

Using biased evaluation for better error identification (III)
We repeated this procedure several times with a continuous decrease of the number of differences. After months of analysis/correction cycles, the number of differences stabilized around 1.2% of the entire corpus (identity approx. 98.8%). At present, there are approximately 85,000 differences (22,500 distinct) between the reference corpus and its biased tagging version.

Using biased evaluation for better error identification (IV)
The analysis of the first 200 differences types (containing 168 different word-forms and accounting for 18,929 differences), outlined 96 distinct confusion pairs. For each confusion pair we constructed the set of different word affected by the respective confusion. [Obs.: The more words affected by a confusion pair, the less worrying it is - inherent statistical noise. When a confusion pair is specific to a reduced number of words and if these words are frequent, it might be useful to have a closer look on the respective confusion pair. We discovered a few words that were responsible for significantly more differences than the rest.]

Using biased evaluation for better error identification (V)
Not surprising, the first four frequent words responsible for almost 3500 tag confusions are closed-class words. The weak forms of the personal pronouns (le, ne, vă, îi) show the highest error rate: out of 11,345 occurrences, 3,368 had wrong case label (accusative instead of dative and vice-versa). [Obs.: The correct case assignment for these pronouns is very hard when only distributional properties are taken into account.] Another confusion, very difficult to avoid and relatively frequent in the RoCo-News corpus (210 occurrences), is vor tagged as Vm instead of Vaux, followed by words that could be interpreted either as infinitives (as they should have been interpreted) or as nouns (as they were actually interpreted in the respective contexts). Ex.: vízaN / vizáVinf, cífraN / cifráVinf, plácaN / placáVinf, recóltaN / recoltáVinf,sărbătóriN /sărbătoríVinf, táxaN / taxáVinf, adrésaN / adresáVinf, etc.

Conclusions We described a semi-automatic procedure by means of which we constructed a highly accurate annotated journalistic corpus for Romanian. Although it is language, tagger and tagset dependent, this approach is easy to apply for a different setting, for other languages or other linguistic registers. The type of analysis we described gives strong indications about which words might be unreliably tagged and does not require word-by-word inspection of the corpus. It does not ensure elimination of all existing errors, but the accuracy gain is substantial.

PART B: Tagset Mapping In TT, the two considered tagsets (the hidden and the lexical ones) are related by a subsumption relation. The mapping (MAP) between the tagsets is a function. A more interesting case is represented by two unrelated tagsets, such as Penn tagset and Multext-East tagset. Why should we bother?

Because, for instance, given two corpora (gold standards), each tagged with its own tagset, we might want to: merge the two corpora and have the resulted corpus confidently tagged with either of the tagsets improve the gold standards’ tagging

Definitions Biased tagging (BT): tagging a corpus with a language model learnt from the same corpus Direct tagging (DT): tagging a corpus with a language model learnt from another corpus Cross-tagging (CT): tagging two corpora, each already tagged with its own tagset, with the other one’s tagset, using a mapping data structure (automatically built) for the two tagsets;

Cross-Tagging System Architecture
MAPS LM(X) LM(Y) AGS(X) BGS(Y) Mapping System LM(X)○B=>BDT(X) LM(Y)○A=> ADT(Y) ACT(Y) BCT(X) Legend: Tagset mapping Improving the direct tagging

LM’(Y)○B=> B’DT(Y) LM’(X)○A=> A’DT(X)
AGS(X) is replaced by BCT(X) & BGS(Y) is replaced by ACT(Y) and repeat the CT procedure: Double Cross-Tagging (DCT) Compare: AGS(X):ADCT(X) and BGS(Y):BDCT(Y) MAPS’ LM’(Y) BGS(Y) LM’(X) AGS(X) ACT(Y) BCT(X) Mapping System LM’(Y)○B=> B’DT(Y) LM’(X)○A=> A’DT(X) ADCT(X) BDCT(Y) BGS(Y) AGS(X)

Major Claims: ACT(X) and BCT(Y) are more accurately tagged than ADT(X) and BDT(Y) respectively. Comparing AGS(X) against ADCT(X) and BGS(Y) against BDCT(Y) one could accurately identify and fix errors in the Gold Standards, significantly improving their quality.

Partial Maps, Global Map (non-lexicalized)
PSet(xi) = {p(xi|yj) | yjY} Yx = {yY | p(x|y)MSC(PSet(x))} PMX = {(x, y)XY | yYx} XY = {yY | p(y|x)MSC(PSet(y))} PMY = {(x, y)XY | xXY} MA(X,Y) = PMAX  PMAY MB(X,Y) = PMBX  PMBY M(X, Y) = MA(X, Y)  MB(X, Y)

Token Maps (lexicalized maps)
several possibly correct mappings are left out from the global map either because of insufficient data, or because of idiosyncratic behaviour of some lexical items built only for the token types common to both corpora (except for happax legomena) the global map will be used only for those tokens without token map (happax legomena, tokens occurring in one corpus but not both) initially the token maps are built the same way the global map was built; therefore some tags associated to a token in one corpus or the other might remain unmapped in the token’s map; these tags are subject to further processing in order to decide whether they are likely to be tagging errors or not (in the latter case a mapping will be eventually constructed).

Token Maps: Unmapped Tags
An unmapped tag yi for w may mean: the contexts in GS(Y) where w has been tagged yi are dissimilar to all the contexts of w in GS(X) (the ambiguity class observed in GS(X) for w is incomplete: the tag will be mapped using the global map) the tag yi is wrongly assigned to w in GS(Y) (the tag will be left unmapped) x1 y1 x2 y2 x3 y3 x4 Mw= Token Map of w If w occurs frequently in GS(X), the explanation a) is unlikely. Therefore in such a case decision is that yi is not the correct tag for w. Otherwise, the decision needs more evidence: tag sympathies!

Token Maps: Tag Sympathies
x1 y1 x2 The sympathy between two tags from the same tagset is defined as the number of token types they commonly tag in a certain corpus. A generalization of the notion of ambiguity class Mw may be extended by the global mappings of the unmapped tag ym iff at least one tag xk mapped to ym in the global map is sympathetic enough to all xi in Mw: xk extends the ambiguity class of w. y2 x3 y3 x4 Mw= Token Map of w x5 y3 x6 Global map fragment

Improving the Direct Tagging: Error Identification
let Mk be the map (either the token map Mwk or the global map M) used for the token k if xk is mapped, and <xk,yk>  Mk, replace yk with the tags mapped to xk  A*(Y); similarly B*(X)

Improving the Direct Tagging: Retagging
build trigram HMM language models from AGS(X) and BGS(Y) with provision for unseen ambiguity classes. Viterbi-like algorithm that is coerced in the following way: if A* says that token wk should be tagged by one of {yk1,...,ykn} then the only legal transition state from <yk-2 yk-1> are: < yk-1 yk1>, …, < yk-1 ykn>

Dealing with Unseen Events
Lexical probabilities: smooth the p(wk,xi) probabilities using the Simple Good-Turing estimation distribution of the probability mass reserved for unseen events: the probability of a token w to be tagged with a tag x, not previously observed for w, was considered to be directly proportional with the number of token types tagged x. Contextual probabilities: linear interpolation of unigram, bigram, and trigram probabilities lambda estimation: the greater the observed frequency of an n-1_gram and the fewer n-grams beginning with that n-1_gram, the more reliable such an n-gram is.

Experiments: Resources
1984 corpus, about 120,000 tokens, automatically tagged, then human validated, MTE tagset SemCor corpus, about 780,000 tokens, tagged with the Brill tagger, Penn tagset the TnT tagger developed by T. Brants, as tagger of reference

Experiment 1: Cross-tagging two corpora
Starting with 1984GS(MTE) and a comparable size fragment of SemCorGS(Penn), by direct tagging we got:  1984DT(Penn) & PSemCorDT(MTE) by cross-tagging the same texts we got: 1984CT(Penn) & PSemCorCT(MTE) randomly selected 100 differences between the direct and cross- tagged versions for each corpus cross-tagging was correct in 69 out of 100 cases for the 1984 corpus, and in 59 out of 100 cases for the PSemCor corpus: => the LM built from SemcorGS(Penn) is less accurate than the LM built from 1984GS(MTE)

Experiment 2: Improving the tagged SemCor corpus
used biased tagging as a means of evaluation: BTScore(SemCor, SemCorBT) = 93.81% double cross-tagged the entire SemCor corpus, getting a new version SemCor+ and then repeated the biased evaluation: BTScore(SemCor+, SemCor+BT) = 96.4% by analyzing the differences between SemCor+ and SemCor+BT we noticed several tokenization inconsistencies (e.g. double quotes, formulae symbols) and 6 tokens (am, are, is, was, were, and that) and 2 tag pairs (NN/NNS, NN/NNP) that caused the most differences; we identified tagging patterns and adjusted the tagging to match those patterns, thus getting SemCor++ BTScore(SemCor++, SemCor++BT) = 98.08%

Experiment 2: Improvement Assessment
57,905 differences between SemCor and SemCor++, of 10,216 types most frequent 200 types count for 25,136 differences the SemCor++ version is evaluated to be right in 84.44% of those 25,136 differences The 10 most frequent difference types

Comments on Cross-tagging
the direct tagging of a corpus can be improved two tagsets can be compared from a distributional point of view errors in the training data may be spotted and corrected successively applying the method for different pairs of corpora tagged with different tagsets would permit the building of a much larger corpus, tagged in parallel with all those tagsets in a reliable manner the mapping system applies not only to POS tags, but to other types of tags as well

Improving the accuracy of a tagging process, beyond what a single classifier can do.
Is any way to improve the accuracy of a tagging process, once we have decided on a tagset, we have one or more training corpora and one or more tagging systems (learner+tagger)? YES! By using combined classifiers. Classifier = a trained tagger (in this context). T: (Ci,Wn)  (W)n w1 w2 …wn  w1/ti1 w2/ti2 …wn/tin CC : ({ C1…Ck},Wn)  (Wk)n w1 w2 …wn  w1/{t11 …tk1} w2/{t12 …tk2} …wn/{t1n …tkn}  w1/ta1 w2/tb2 …wn/tzn

Combining classifiers = defining an decision method for selecting one of the results provided by the various classifiers. The basic assumption: different classifiers, even of comparable accuracy (e.g. McNemar’s test) do not make similar errors (e.g. Brill and Wu’s test)

a) McNemar’s test of classifiers comparisson
n= n00 + n01+ n10+ n11. Under the null hypothesis, the two classifiers should have the same error rate, which means that n01= n10 (with the estimated error rate as (n01+n10)/2). McNamer’s test considers the statistics: which is (approximately) distributed as 2 with 1 degree of freedom (the term –1 in the numerator is a “continuity” correction term and accounts for the fact that the considered statistic is discrete while 2 –is continuous). If the null hypothesis is correct, then the probability that is greater than is less than 0.05

Testing the statistical hypotheses:
b) local error complementarity (Brill&Wu, 1998): A, B = classifiers COMP(A,B) = 100*(1-Ncom/NA) , COMP(A,B)  COMP(B,A) Ncom = number of common errors NA = number of errors erori made by the A classifier If A would make the same errors as B, then COMP(A,B) = 0, therefore combining A with B is useless; COMP(A,B) percentually shows how frequently A is right when B is wrong. COMP(A,B)  COMP(B,A)  B is better than A with respect to the current text; the linguistic register of the analysed text is closer to the one of the texts in the corpus which used for building COMP(A,B)COMP(B,A) , then the two classifiers are neutral to the analyzed text (similar accuracy)

The usual approach in combining classifiers: combine different taggers
trained on the same data H.v. Halteren, J. Zavrel, W. Daelemans, 1998 (COLING-ACL) corpus: LOB taggers: HMM 3-gram tagger (Steetskaamp- TOSCA tagger) Brills tagger (ftp://ftp.cs.jhu.edu/pub/brill /Programs /RULE_BASED_TAGGER_V.1.14.tar.Z) MaxEnt tagger (ftp://ftp.cis.upenn.edu/pub/~adwait/jmx) MBL (licensed based from Walter Daelemans)

E. Brill & J. Wu, 1998 (COLING-ACL)
training corpus:WSJ taggers: HMM 3-gram tagger (undocumented) Brill’s tagger (ftp://ftp.cs.jhu.edu/pub/brill /Programs/ MaxEnt tagger (ftp://ftp.cis.upenn.edu/~adwait/jmx) CLAM Contrasting our approach with previous approaches a)Taggeri+Given_Training_Corpus = Classifieri differences in the tagging accuracy motivated just by the software (tells you which program is better b) Given_Tagger +Training_Corpusi = Classifieri (CLAM) differences in the tagging accuracy motivated just by the data

Resampling the test texts and averaging (bagging):
Evaluation 4 training corpora ==> 4 basic classifiers Unseen test data: Text (Amb2) #Sentences Text “classification” Occurrences 1994 (2,68) fiction (1984 follow-up) Barnes (2,73) scientific essay (not seen before) 20120 ziarNou(2,79) journalism (another journal) Additional Lexicon= all words in the test data not in the basic lexicon. Types of experiments: l1) partial lexicon (Basic Lexicon) l2) full lexicon (Basic Lex+Additional Lex => no unknown words) A basic experiment: 24 runs: 3 text_chunks*4 classifiers*2 exp_types Resampling the test texts and averaging (bagging): Accunk=98.6%, Accno-unk=99.0%

References on Morpho-Syntactic Tagging
Felix Pîrvan, Dan Tufiş: Tagsets Mapping and Statistical Training Data Cleaning-up. In proceedings of the 5th LREC Conference, Genoa, Italy, May, 2006, pp Dan Tufiş, Elena Irimia: RoCo_News - A Hand Validated Journalistic Corpus of Romanian. In Proceedings of the 5th LREC Conference, Genoa, Italy, May, 2006, pp Dan Tufiş, Liviu Dragomirescu. Tiered Tagging Revisited. In Proceedings of the 4th LREC Conference, Lisabona, 2004, pp Dan Tufiş “High Accuracy Tagging with Large Tagsets”. In Proceedings of the ACCIDA’2000, Tunisia, 2000 Dan Tufiş “Using a Large Set of Eagles-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging”, Second International Conference on Language Resources and Evaluation, Athens May, 2000, pp Dan Tufiş, Dienes, P., Oravecz, C., Váradi T. “Principled Hidden Tagset Design for Tiered Tagging of Hungarian” Second International Conference on Language Resources and Evaluation, Athens May, 2000, pp. 1421:1426 Dan Tufiş “Tiered Tagging and Combined Classifiers”. In F. Jelinek, E. Nöth (eds) Text, Speech and Dialogue, Lecture Notes in Artificial Intelligence 1692, Springer, 1999, pp Tomaz Erjavec, Nancy Ide, Dan Tufiş: Development and Assessment of Common Lexical Specifications for Six Central and Eastern European Languages, ALLC-ACH '98 Conference, Debrecen, Hungary, July , pp Dan Tufiş, Nancy Ide, Tomaz Erjavec.: Standardised Specifications, Development and Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European Languages, First International Conference on Language Resources and Evaluation, Granada, May, 1998, pp Dan Tufiş, Oliver Mason: Tagging Romanian texts: A Case Study for QTAG, a Language Independent probabilistic tagger, First International Conference on Language Resources and Evaluation, Granada, May, 1998 Ludmila Dimitrova, Tomaz Erjavec, Nancy Ide, Heiki Kaalep, Vladimir Petkevic, Dan Tufiş: Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and Eastern European Languages, COLING, Montreal 1998 Tomaz Erjavec, Nancy Ide, Dan Tufiş, “Encoding and Parallel Alignment of Linguistic Corpora in Six Central and Eastern European Languages”. In Michael Levison (ed) Proceedings of the Joint ACH/ALL Conference Queen's University, Kingston, Ontario, June 1997

BLARK: Lemmatization Lemmatization after tagging is very easy, provided a wordform lexicon (á-la Multext) is available (other solutions are possible as well). A wordform lexicon entry: <wordform> <lemma> <tag> For most languages, in the vast majority of cases: <wordform> + <tag>  <lemma> When this is not the case, one could use statistical evidence: <wordform> + <tag>  <lemma1>P1, <lemma2>P2…

Lemmatisation (cntd) The previous approach works for the words in the wordform lexicon, otherwise for inflectional languages a retrograde ending analysis could be one way to solve the problem;  *A *E *L *I *U R *O … . Ending Tree * nodes  { (endingi tagi)+} wordform+tagi=>root+endingi ET can be learnt from a core wordform lexicon; root  lemma

Lemmatisation (cntd) root  lemma
If the root is common for all inflectional variants, then root+standard+affix for the category= lemma Otherwise: use knowledge about the regular root alternations rules: {sub-category='common-noun' & gender='feminine' & <X> = >e|a|< & number='singular' & case='nom-acc'}  { [number='singular' & case='nom-acc']  >e<}

BLARK: Chunking chunking – dividing a text in syntactically correlated sequences of words; an intermediate step towards full parsing, easier to achieve. For example, the sentence He reckons the current account deficit will narrow to only # 1.8 billion in September . can be divided as follows: [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] . Text chunking is an intermediate step towards full parsing. It is usually implemented by means of regular expressions defined over the tags of the tagged input text (language dependent). ex: NP->Indef-art[gender=; number=] Det[gender=; number=]* Noun[gender=; number=] VP->Aux1 {Aux2}{Adv} Vpart|Vmain The goal of this task is to come forward with machine learning methods which after a training phase can recognize the chunk segmentation of the test data as well as possible. The training data can be used for training the text chunker. The chunkers will be evaluated with the F rate, which is a combination of the precision and recall rates: F = 2*precision*recall / (recall+precision). The precision and recall numbers will be computed over all types of chunks. It was the shared task for CoNLL Training and test data for this task is available. This data consists of the same partitions of the Wall Street Journal corpus (WSJ) as the widely used data for noun phrase chunking: sections as training data ( tokens) and section 20 as test data (47377 tokens):

Dependency Linker A processing module that links words which are likely to depend on each other (syntactically or semantically) A slightly modified IBM-1 (EM algorithm), fed with the same source and target text is able to detect interesting dependencies among the words of a sentence (not necessarily adjacent). The collocation pairs which comply with a set of restrictions (Constraint Lexical Attraction Model – e.g. graph planarity and syntactic rules defined over the POS tagset) identifies also interesting dependencies.

Dependency Linker There are two basic approaches to computing the statistical dependency of the tokens in one text: Compute the collocation scores among all the adjacent words and retain those pairs with a score beyond a given threshold Use IBM model 1 (EM algorithm) to detect dependencies between the words of each sentence, ignoring the identity relation (a word being related to itself). Since this approach does not impose word adjacency, it is able to detect noncontiguous related words We used both approaches and combined the results

CLAM Linker Lexical Attraction Model (LAM)
Yuret, D. (1998). Discovery of linguistic relations using lexical attraction. PhD thesis, Dept of Computer Science and Electrical Engineering, MIT. Constrained LAM: a link is rejected if it does not pass any of the linking rules of a language: for instance the number agreement Radu Ion, Verginica Barbu Mititelu: Constrained Lexical Attraction Models, in Proc. Trends in Natural Language Processing, Special Track at 19th FLAIRS 2006, Melbourne Beach, Florida, May 11-13, 2006 Verginica Barbu Mititelu and Radu Ion, Cross-language Transfer of Syntactic Relations Using Parallel Corpora, in Proceedings of the Workshop on Cross-Language Knowledge Induction, Cluj-Napoca, July 2005, pp

IBM-1 and CLAM linkers

References on tokenisation lemmatization and dependency linking
Dan Tufiş, Radu Ion, Elena Irimia, Verginica Barbu Mititelu, Alexandru Ceauşu, Dan Ştefănescu, Luigi Bozianu, Cătălin Mihăilă: Resources, Tools and Algorithms for the Semantic Web. In Proceedings of the “IST – Multidisciplinary approaches” Workshop, Romanian Academy, 2006, pp , ISBN X, ISBN Radu Ion “Methods for Semantic Disambiguation; Applications for English and Romanian“, PhD Thesis, Romanian Academy, May 2007 Radu Ion and Verginica Barbu Mititelu. Constrained lexical attraction models. In Proceedings of the Nineteenth International Florida Artificial Intelligence Research Society Conference, pages 297–302, Menlo Park, Calif., USA, AAAI Press. Dan Tufiş, Radu Ion Parallel Corpora, Alignment Technologies and Further Prospects in Multilingual Resources and Technology Infrastrcture. In C. Burileanu, H.N.Teodorescu (eds): Proceedings of the 4th International Conference on Speech and Dialogue Systems (SPED2007), May, 2007, Romanian Academy Publishing House, pp Radu Ion, Alexandru Ceauşu, Dan Tufiş. Dependency-Based Phrase Alignment. In Proceedings of the 5th LREC Conference, Genoa, Italy, May, 2006, pp Dan Tufiş, Barbu Ana-Maria, "Extracting multilingual lexicons from parallel corpora". In Proceedings of the ACH-ALLC conference, New York, June, 2001. Tomaz Erjavec, Dan Tufiş, Tamas Varadi “Developing TEI-Conformant Lexical Databases for CEE Languages”. In Proceedings of the 4th International Workshop on Computational Lexicography COMPLEX, Pecs, Ungaria, 1999, pp Vladimir Petkevic, Dan Tufiş: The Multext-East lexicon, ALLC-ACH '98 Conference, Debrecen, Hungary, July Dan Tufiş, Octav Popescu. “A Unified Management And Processing of Word-Forms, Idioms And Analytical Compounds”, in Jurgen Kunze and Dorothy Reinman (eds.), Proceedings of the 5th European Conference of the Association for Computational Linguistics, Berlin, 1991 Dan Tufiş. “It Would Be Much Easier If WENT Were GOED”, in Harry Somers, Mary McGee Wood (eds.), Proceedings of the 4th European Conference of the Association for Computational Linguistics, Manchester, 1989

ELARK: Alignment Reification
An alignment: a set of mapping links between the entities of two sets of information representations (N to M). To reify is “To regard or treat (an abstraction) as if it has a concrete or material existence” ( alignment

Types of Aligned Entities
Multilingual parallel texts (bitexts or multitexts)-implicit representation structures of the same underlying meaning paragraph, sentence, phrase, word level; Multilingual thesauri (concept names, relation names, hierarchical structures); Ontologies (concept names and properties, relation names and properties, hierarchical structures); Text to semantic or conceptual structures (semantic dictionaries, thesauri, ontologies); WSD, document classification/clustering, IR, QA Multimedia data (text-speech, text-video, video-speech); …etc.

Texts alignment Regards texts to be aligned as implicitly structured representations of the underlying meaning; Revealing the linguistic structure requires minimal preprocessing such as segmentation (paragraph, sentence, lexical token), POS-tagging, lemmatization (stemming), chunking (parsing), WSD; the more fine grained the preprocessing, the more precise the alignment and better its benefits; If two texts encode the same meaning, text alignment enables automatic identification of the matching meaning blocks and computing their common meaning; typical examples of text encoding the same meaning: paraphrase corpora (for the monolingual case) and parallel corpora (for the multilingual case)

Approaches to Alignments
Symbolic, usually rule-based, rely on a priori knowledge about the entities to be aligned (knowledge intensive) Statistical, rely on large data; the ML approaches may require training data (human prepared) based on which an alignment statistical model is computed; based on this model, new unseen data can be aligned. Mix, combine statistical and symbolic methods. Few approaches are purely symbolic or statistical.

Why parallel text alignment is so important?
Multilingual lexicography Multilingual terminology Annotation import in parallel corpora and automatic induction of annotation models Multilingual information retrieval Multilingual question-answering Machine translation

Sentence alignment (I)
The aligner is designed for alignment of large parallel corpora. Initial step: create a (hand-validated) reference sentence alignment of a sample of the corpus (about 1000 sentences per language pair ). We used Moore’s aligner for this step; the result was hand validated. This is a cheap process as it does not require more than minutes. This set of sentences, supplemented, with 1000 wrong alignments (generated automatically, based on the correct ones), is used to train as SVM classifier. The features used for this initial model are: sentence length-both word or chars based, correlation factor for the number of non-lexical tokens contained in the candidate sentences, rank correlation for the words in the candidate sentences.

Sentence Alignment (II)
The proper sentence alignment process of the current bitext is a two step procedure: The first step produces an initial alignment of the bitext and uses the initial SVM classifier to label the paired sentences as GOOD or BAD. The pairs that were classified as GOOD with a probability higher than 90% are used to retrain the classifier, this time with a highly discriminative feature: translation equivalence (enough data is supposed in order to extract reliable translation equivalence tables – aka IBM-1). The second phase of the process is an iterative procedure that generates a reduced set of sentence-pair candidates which are evaluated by the new SVM classifier. The GOOD ones are retained in the final alignment. The set of GOOD links gets larger and larger form one iteration to the next one and they act as restrictors and/or supporters of the new candidate links

Sentence Alignment (example)
Done at Brussels , 14 August 2003. Adoptat la Bruxelles , For the Commission 14 august Franz Fischler Pentru Comisie Member of the Commission Franz FISCHLER ( 1 ) OJ L 198 , , p. 1 . Membru al Comisiei ( 2 ) OJ L 85 , , p. 15 . ANEXĂ ( 3 ) OJ L 169 , , p. 1 . Comisia studiază în prezent ... ANNEX 1 JO L 198 , , p1 . The Commission is currently... 2 JO L 85 , , p. 15 . 3 JO L 169 , , p. 1. State of the art sentence aligner: Moore, R. C. (2002). Fast and Accurate Sentence Alignment of Bilingual Corpora. In Machine Translation: From Research to Real Users (Proceedings, 5th Conference of the Association for Machine Translation in the Americas, Tiburon, California), Springer-Verlag, Heidelberg, Germany, pp

Evaluation of the Sentence Aligner
~1000 sentences per language pair (4 files in AcqComm corpus) Precision Recall F-measure Moore En-It 100% 97.76% 98.87% RACAI En-It 98.93% 98.99% 98.96% Moore En-Fr 98.62% 99.31% RACAI En-Fr 99.46% 99.60% 99.53% Moore En-Ro 99.80% 93.93% 96.78% RACAI En-Ro 99.24% 99.04% 99.14% Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Daniel Varga. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th LREC Conference, Genoa, Italy, 2006 Alexandru Ceauşu, Dan Ştefănescu, Dan Tufiş: Acquis Communautaire sentence alignment using Support Vector Machines. In Proceedings of the 5th LREC Conference, Genoa, Italy, 2006

Word Alignment An automatic procedure identifying for each word in one part of a bitext which is the word in the other part of the bitext that translates it; In general the correspondence between the words in the two parts of a bitext is not always 1-1. 1-1 links are easier to detect; N-M links require special treatment.

Requirements Irrespective of the granularity, the alignment should be as accurate as possible; any alignment error in an early stage of a bi-text processing, would generate lots of other errors in the next steps (or, in the best case, would generate “silence” in the output) When the quantity of textual data is large (the interesting case), the alignment should be as fast as possible; With the current computing paradigms and technologies, a trade-off between speed and accuracy must be accepted;

Word Alignment: an example
 The patrols did not matter , however . Şi totuşi , patrulele nu contau .

Our Aligners YAWA, MEBA&COWAL
We developed various aligners at the sentence, chunk and word levels, and they can work either independently (pipe-lined) or may be coupled in a feedback-based architecture, providing much better alignment results at the price of a longer response time. The sentence alignment and monolingual chunking, precedes the word alignment; however: if the number of aligned words in the current pair of sentences is below an expected number, it is highly probable that the sentence alignment is wrong; this is an extremely precise and valuable hint for automatic correction of the sentence alignment; if the links starting from the words in a chunk of one language point to words in different chunks in the other language it is highly probable that either the chunking or word alignment is locally wrong; this is an extremely precise and valuable hint for automatic correction of chunking, word alignment or both. Word Alignment is the most critical processing phase!

Lexical alignment of parallel texts
Method: Reified lexical alignment (very high accuracy) Hub language – English: From En-X and En-Y alignments we automatically derive the X-Y alignment The derived alignments are used to generate translation models for the X-Y pair of languages and by bootstrapping (if necessary and linguistic expertise is available) to improve the initial X-Y alignment Reified Alignments - COWAL A bitext alignment is a set of lexical token pairs (links), each of them being characterized by a weighted feature structure the score of which should be higher than the acceptability threshold. a feat1: val1 b feat2: val2 z featn: valn

Features characterizing a link
A link <Token1 Token2> is characterized by a set of features, the values of which are real numbers in the [0,1] interval. context independent features – CIF, they refer to the tokens of the current link cognate, translation equivalents (TE), POS-affinity, “obliqueness”, TE entropy context dependent features – CDF, they refer to the properties of the current link with respect to the rest of links in a bi-text. strong and/or weak locality, number of links crossed, collocations Based on the values of a link’s features we compute for each possible link a global reliability score which is used to license or not a link in the final result. ACL2005 Word Alignment Shared Task (En-Ro) 1. COWAL (F=73.90%, AER=26.10%) The current version (July 2007) evaluated against our corrected ACL2005 GS: COWAL (F=85.22%, AER=14.78%)

Technical details (1) n11 n12 n21 n22
Building a translation model (TM): Estimating the probabilities of each possible translation pair from the training corpora and retaining only the most promising ones. The search space is a subset of {TT1-n}{TS1-m}; TLi=<lemma_tag> Threshold < LL(TT, TS) = The search space is dramatically reduced by sentence alignments, tagging and lemmatization and by imposing a minimal LL threshold. For the “promising” translation pairs, we estimate their translation probabilities: P(TT |TS)&P(TS|TT) TT ØTT TS n11 n12 n1* ØTS n21 n22 n2* n*1 n**

Technical details (2) Multiple-step algorithms, controlled by several parameters (linearly interpolated): P1: Translation probabilities: P(lemmaL1i_tagL1i|lemmaL2k_tagL2k) P2: Fertility: P(n |lemmaL2k_tagL2k)/collocation score P3: Translation equivalence entropy: P4: Distorsions: P(pozL1i | pozL2k) P5: POS affinity: P(POSL1i|POSL2k) P6: String Similarity SS(lemmaL1i, lemmaL2k) P7: Local and/or global localities

Technical details (3) Argmax AL(lemmaL1i|lemmaL2k)= N-M alignments;
the lambdas are different from each step to the other; additionally, there are minimal thresholds (also varying from step to step) for the value of each parameter (if below the threshold, it’s contribution is nil)

Translation equivalents (TE)
YAWA, TREQ-AL use competitive linking based on ll-scores, plus the Ro-En aligned wordnets MEBA uses GIZA++ generated candidates filtered with a log-likelihood threshold (11). The TE candidates search space is limited by lemmatization and POS meta-classes (e.g. meta-class 1 includes only N, V, Aj and Adv; meta-class 8 includes only proper names) For a pair of languages translation equivalents are computed in both directions. The value of the TE feature of a candidate link <TOKEN1 TOKEN2> is 1/2 (PTR(TOKEN1, TOKEN2) + PTR(TOKEN2, TOKEN1).

Entropy Score (ES) The entropy of a word's translation equivalents distribution proved to be an important hint on identifying highly reliable links (anchoring links) Skewed distributions are favored against uniform ones For a link <A B>, the link feature value is 0.5(ES(A)+ES(B))

Cognates (COGN) The cognates feature assigns a string similarity to the tokens of an candidate link We estimated the probability of a pair of orthographically similar words, appearing in aligned sentences, to be cognates, with different string similarity thresholds. For the threshold 0.6 we didn’t find any exception. Therefore, the value of this feature is either 1 (if the similarity score is above the threshold or 0 otherwise). Before computing the string similarity score the words are normalized (duplicate letters are removed, diacritics are removed, some suffixes are discarded).

COGN TS=1 k ; TT=1 m i and are j the matching characters, & (i) is the distance (in chars of TS) from the previous matching , & ( i) is the distance (in chars of TT) from the previous matching 

Part-of-speech affinity (PA)
An important clue in word alignment is the fact that the translated words tend to keep their part-of-speech and when they have different POSes, this is not arbitrary. The information was computed based on a gold standard (the revised NAACL2003), in both directions (source-target and target-source). For a link <A,B> PA=0.5*(P(cat(A)|cat(B))+P(cat(B)|cat(A))

Collocation “A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things” (Manning & Schütze, 1999) “Collocations of a given word are statements of habitual or customary places of that word” (Firth 1957) "a recurrent combination of words that co-occur more often than expected by chance and that correspond to arbitrary word usages” (Smadja, 1993) “a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. " (Choueka 1988) An n-gram analysis program (such as Ted Pedersen’s) is extremely useful in discovering collocations; only adjacent words are considered. Out of the different available collocation texts we found LL (with a threshold of 9) working the best

Refining Collocation Extraction
Considers not only adjacent words; where n is the total number of distances, di are the distances,  is the mean and 2 is the variance. If two words appear together always at the same distance, the variance is equal to 0. If the distribution of the distances is random (the case of those words which appear together by chance), then, the variance has high values. Smadja (1990) shows that one can find collocations by looking for pairs of words for which the standard deviations () of distances are small.

Refining collocation extraction
We were interested in finding V-N (N-V), N-N and N-A (A-N) collocations. We should note that while V-N collocations usually characterize verb sub-categorization structures, the N-N and N-A ones are ordinarily terminological compounds. The method was applied on a tagged and lemmatized version of the Acquis Communautiare (AC) corpus. The size of the corpus is around 350 Mb. We computed the standard deviation for all V-N, N-N and N-A pairs (from the AC corpus) within a window of 11 non-functional words length for French, German and Romanian. We considered as good, all the pairs for which standard deviation was smaller than 1.5 (this is a reasonable threshold according to the examples from Manning & Schütze (1999)) and we kept them in a list along with their mean. This method allows us to find good candidates for collocations but not good enough. We want to further filter out some of the pairs so that we keep only those composed by words which appear together more often than expected by chance. This can be done using Log-Likelihood. The idea behind the LL score is finding the hypothesis which describes better the data obtained by analyzing a text. The two hypotheses considered are: H0 : P(w2|w1) = p = P(w2|¬w1) (null hypothesis - independence) H1 : P(w2|w1) = p1 ≠ p2 = P(w2|¬w1) (non-independence hypothesis)

We computed the LL score for all the pairs obtained using Smadja’s method. In the computation for a certain (for example) V-N pair at distance d (the round of the mean of all distances between the words of this certain pair) we used only the V-N pairs of words for which the distance is the same (d). We kept in a final list, the pairs for which the LL score was higher than 9. For this threshold the probability of error is less then 0.004 If neither token of a candidate link has a relevant collocation score with the tokens in its neighborhood, the link value of this feature is 0. Otherwise the value is the maximum of the collocation probabilities of the link’s tokens. Competing links (starting or finishing in the same token) are licensed only and only if at least one of them have a non-null collocation score

Collocations analysis in a parallel corpus Large parallel corpus (Acq-Com) University Marc Bloch from Strasbourg, IMS Stuttgart University and RACAI independently extracted the collocation in Fr, Ge, Ro and En (hub). We identified the equivalent collocations in the four languages. SURE-COLLOCX= COLLOCX  TRX-COLLOCY (EQ1) member states, European Communities, international treaty, etc. INT-COLLOCZ = COLLOCZ \ SURE-COLLOCZ (EQ2) adversely affect <-> a aduce atingere[1]; legal remedy <-> cale de atac[2], to make good the damage <-> a compensa daunele[3] etc. [1] A mot-a-mot translation would be to bring a touch [2] A mot-a-mot translation would be way to attack [3] A mot-a-mot translation would be to compensate the damages

Localization This feature is relevant with or without chunking or dependency parsing modules. It accounts for the degree of the cohesion of links. With the chunking module is available, and the chunks are aligned via the linking of their respective heads, the links starting in one chunk should finish in the aligned chunk. When chunking information is not available, the link localization is judged against a window, the span of which is dependent on the aligned sentences length. Maximum localization (1) is the one with all the tokens in the source window are linked to all tokens in the target window

Combining classifiers If multiple classifiers are
Weak Locality When chunking/dependency links information is not available, the link localization is judged against a window containing m links. The value of m dependents on the aligned sentences length. The window is centered on the candidate link. s1 s2 . s sm t1 t2 t tm Combining classifiers If multiple classifiers are comparable, and if they do not make similar errors, combining their classifications is always better than the individual classifications.

Distorsion/Relative position
Each token in both sides of a bi-text is characterized by a position index, computed as the ratio between the relative position in the sentence and the length of the sentence. The absolute value of the difference between tokens’ position indexes, gives the link’s “obliqueness” The distorsion feature of a link is its obliqueness D(link)=OBL(SWi, TWj)

Crossed links The crossed links feature computes (for a window size depending on the categories of the candidates and the sentences lengths) the links that were crossed. The normalization factor (maximum number of crossable links) is empirically set, based on categories of the link’s tokens

Heuristics for improving the alignment (1)
The words unaligned in the previous step may get links via their aligned dependents (HLP: Head Linking Projection heuristics): if b is aligned to c and b is linked to a, link a to c, unless there exist d in the same chunk with c, linked or not to it, and the POS category of d has a significant affinity with the category of a. a c a c b b d Alignment of sequences of words surrounded by the aligned chunks Filtering out improbable links (e.g.links that cross many other links)

Heuristics for improving the alignment (2)
Unaligned chunks surrounded by aligned chunks get probable phrase alignment: SL TL Wsi ↔ Wtj Wsk ↔ Wtm Wsp Wsp+1…↔ Wtq Wtq+1…

Dependency Links Alignment (I)
If instead of taking lexical tokens as alignment units, one considers dependency links, COWAL produces dependency links alignment; from the links alignment => word alignment Radu Ion, Alexandru Ceauşu and Dan Tufiş: Dependency-Based Phrase Alignment, in Proceedings of the LREC 2006, Genoa, Italy

Dependency Links Alignment (II)

Dependency chunks &Translation Model
Regular expressions defined over the POS tags and dependency links Non-recursive chunks Chunk alignment based on their aligned constituents (one or more).

Final Word Alignment

YAWA YAWA starts we all plausible links (licensed by the translation model). Then, using a competitive linking strategy, retains the links that maximizes sentence translation equivalence score, and minimizing the number of crossing links. This way, it generates only 1-1 alignments. N-M alignments are possible only with chunking and/or dependency linking available. In this case, the unaligned words may get links via their heads’ links. Very good recall!

MEBA Unlike YAWA, MEBA iterates several times over each pair of aligned sentences, at each iteration adding only the highest score link. The links already established in previous iterations give support or create restrictions for the links to be added in a subsequent iteration. Generates N-M alignments (no competitive linking filtering). MEBA uses different weights and different significance thresholds on each feature and iteration step. They were set manually. Very good precision!

Combining the Alignments
The simplest method: just compute the union of YAWA and MEBA, remove the duplicates and eliminate impossible multiple links (i-j and i-0). The winner in the last case was heuristically determined by the properties of the language pair (En-Ro) The above method has a better recall but the precision is significantly deteriorated unless the language specific filtering is used; solution? Try to remove as many bad links as possible from the union. We used an SVM ( binary classifier (Good/Bad) trained on our version of the GS2003 (for positive examples) and the differences among the basic alignments (YAWA, MEBA) and the GS2003 (for negative examples). The SVM classifier (LIBSVM (Fan et al., 2005) uses the default parameters: C-SVC classification (soft margin classifier) and RBF kernel (Radial Basis Function ) Features used for the training (10-fold validation; about 7000 good examples and 7000 bad examples) : TE(S,T), TE(T,S), OBL(S,T), LOC(S,T), PA(S,T), PA(T,S) The links labeled as incorrect links were removed from the merged alignments.

COWAL An integrated platform that takes two parallel raw texts and produces their alignment basic preprocessing modules: sentence aligner, tokenizers, lemmatizers, POS-taggers, dependency linkers, chunkers, translation models builder two or more comparable word-aligners (YAWA-superseded TREQ-AL, MEBA), alignment combiner; an XML generator (XCES schema compliant) an alignment viewer & editor optional modules : bilingual lexical ontologies (Ro-En aligned wordnets)

En-Ro Word Alignment

Word Alignment Competitions
There is a general interest in increasing the WA accuracy (mainly in the Statistical Machine Translation community). International competitions (very similar to TREC, CLEF etc) for evaluating the WA state of the art: 2003 NAACL in Edmonton, Canada 2005 ACL in Ann Arbor Michigan, USA Although WA is far from being perfect, its accuracy is rapidly improving! our word aligners, rated the best in the two competitions progressed almost 10% in two years: TREQ-AL, 2003: F-measure: 73.39% (highest F-measure for the Ro-En track) COWAL, 2005: F-measure: 82.52% (highest F-measure for the Ro-En track) The volume of the training data was approximately the same, the texts were more difficult in 2005 => real technological improvement In the next few years is very likely to see results superior to 90%!

ACL 2005 Word Alignement Competition
J. Martin, R. Mihalcea, T,Pedersen, Word Alignment for Languages with Scarce Resources, in Proceedings of the ACL Workshop on Building and Using Parallel Texts, June,2005, Ann Arbor, Michigan, Association for Computational Linguistics, pp. 65—74, 3 pairs of languages, with different quantities of training data Inuktitut-English (1.6 Mw-3.4 Mw) Romanian-English (0.85 Mw-0.9 Mw) Hindi-English (0.07 Mw-0.06Mw) Major differences in preparing the Gold Standard

Evaluation measures Let us consider an alignment A, each link is labeled either as S(ure) or P(ossible). If we have a Gold Standard (G) with S and P links then AS / GS represents the subset of A/G containing only Sure links As any Sure link is a Possible link as well, AP=A, GP=G

Comparing different pairs of languages against differently constructed GS
The way AER is defined makes difficult to compare alignment performance for pairs of languages with different strategies in building their GS. If one pair of languages has only S links and the other has both S and P links, the results (in terms of AER) will be always better for the second pair. This is because adding P links to a GS always produces a better AER for the alignments judged against such a GS. Demonstration is very simple if one observes that in the definition of AER in (Mihalcea&Pedersen, 2003) Ap means both P links and S links (any probable link is a sure link). Adding P links in the GS inexorably decrease AER…!

Official Ranking 1. RACAI.COWAL (F=73.90%, AER=26.10%)
2. ISI.Run5.vocab.grow (F=73.45%, AER=26.55%) but correcting the obvious errors in the GS, actually: COWAL (F=79.79%, AER=21.21%) and considering our GS (a different tokenization): COWAL (F=82.52%, AER=17.48%)

References on Sentence and Word Alignment
Dan Tufiş, Radu Ion. Parallel Corpora, Alignment Technologies and Further Prospects in Multilingual Resources and Technology Infrastrcture. In C. Burileanu, H.N.Teodorescu (eds): Proceedings of the 4th International Conference on Speech and Dialogue Systems (SPED2007), 2007, Romanian Academy Publishing House, Alexandru Ceauşu, Dan Ştefănescu, Dan Tufiş:Acquis Communautaire sentence alignment using Support Vector Machines. In Proceedings of the 5th LREC Conference, Genoa, Italy, May, 2006, pp Dan Tufiş, Radu Ion, Alexandru Ceauşu, Dan Ştefănescu: Improved Lexical Alignment by Combining Multiple Reified Alignments. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL2006), Trento, Italy, 3-7 April, 2006, pp Dan Tufiş, Radu Ion, Alexandru Ceauşu, Dan Stefănescu: Combined Aligners. In Proceedings of the ACL2005 Workshop on “Building and Using Parallel Corpora: Data-driven Machine Translation and Beyond”, June, 2005, Ann Arbor, Michigan, June, Association for Computational Linguistics, pp Dan Tufiş, Ana Maria Barbu, Radu Ion, Extracting Multilingual Lexicons from Parallel Corpora, Computers and the Humanities, Volume 38, Issue 2, May 2004, pp. 163 – 189 Dan Tufiş, Ana-Maria Barbu, Radu Ion: „TREQ-AL: A word-alignment system with limited language resources”, Proceedings of the NAACL 2003 Workshop on Building and Using Parallel Texts; Romanian-English Shared Task, Edmonton, Canada, 2003, pp Dan Tufiş ”A cheap and fast way to build useful translation lexicons” in Proceedings of the 19th International Conference on Computational Linguistics, COLING2002, Taipei, August, 2002, pp Dan Tufiş, Ana-Maria Barbu: „Revealing translators knowledge: statistical methods in constructing practical translation lexicons for language and speech processing”, in International Journal of Speech Technology. Kluwer Academic Publishers, no.5, pp , 2002

Lexical ontologies Princeton Wordnet (PWN)= semantic lexicon for En
PWN+ SUMO+MILO+Domains = lexical ontology for En EuroWordNet = multilingual lexical ontology with PWN as ILI  BalkaNet = multilingual lexical ontologyPWN as ILI Recently, SentiWordNet a subjectivity marked-up version of PWN2.0

The BalkaNet project (2001-2004)
An EU funded project (IST ) for the development of a (core, approx 8000 synsets/language) multilingual semantic lexicon along the principles of EuroWordNet; Languages concerned: Bulgarian, Czech, Greek, Romanian, Serbian, Turkish EuroWordnet-BalkaNet liaison: Piek Vossen (consultant) and MemoData (industrial partner)

Balkanet 30 (Li->Lj) At the end of the project more than 8000 ILIs
implemented in all 6 languages (more than 336,000 virtual translation pairs of synsets; or almost 1 million word translation pairs) Most of the wordnets contained more than 18,000 synsets.

Main features of the BalkaNet wordnets
Compatibility with the EuroWordNet wordnets Structured ILI = PWN+BKN BKN = Balkan specific concepts Relations defined within a monolingual wordnet have precedence over the relations in the ILI (redundancy, but more expressive power) SUMO/MILO & DOMAINS which are aligned with PWN2.0 are available in the monolingual wordnets. Some monolingual wordnets (CZ, RO and BG) contain valency frames for a common number of verbal synsets A single interlingual relation (EQ-SYN); the rest may be emulated; less expressive power but extremely efficient management of the multilingual displays of entries (VISDIC)

Design Principles d1) ensuring as much as possible compatibility with the EuroWordnet approaches (e.g. unstructured ILI based on Princeton WordNet) and maximisation of cross-lingual coverage d2) synset structuring (relations) inside each wordnet (lots of redundancy, but much more powerful) d3) keeping up with Princeton WordNet (PWN) developments d4) ensuring conceptually dense wordnets d5) defining a reusable methodology for data acquisition and validation (open for further development) d6) linguistically motivated (reference language resources, with human experts actively involved in all decision makings and validation) d7) minimizing the development time and costs

Maximisation of the cross-lingual coverage (1)
ILI= the set of PWN synsets (labeled by their offsets in the database) taken as interlingual concepts: ( n; v; a; b) The consortium selected a common set of ILI codes to be implemented for all languages; this selection took place in three steps: BCS1 (essentially the BC set of EuroWordnet):1218 concepts BCS2: 3471 concepts BCS3: 3827 concepts

Maximisation of the cross-lingual coverage (2)
Selection criteria for BCS1,2,3…(8516 ILI-codes) number of languages in EuroWordNet linked to an ILI code (imperative) conceptual density: once a concept was selected, all its ancestors (nouns and verbs), up to the top level were also selected (imperative); adjectives were selected so that they would typically be related to nominal concepts in the selection (be_in_state) language specific criteria: each team proposed a set of concepts of interest and the maximum intersection set among these proposals became imperative

Synsets structuring (1)
At the level of each individual wordnet Common set of relations (the semantic relations) as used in the PWN Language specific relations (the lexical relations: such as derivative, usage_domain, region_domain)

Synsets structuring(2)
Principle of hierarchy preservation M1L1 H+ M2L1 M1L1 = N1L N1L2 H+ N2L2 M2L1 = N2L2 Allows for importing taxonomic relations and checking interlingual alignments. When taxonomic relations were imported, they were hand validated.

Keeping up with PWN developments
When the project started ILI was based on PWN1.5 (as EuroWordNet was). BalkaNet ILI was updated following the new releases of PWN: PWN => PWN1.7.1 PWN1.7.1 => PWN2.0 As the automatic remapping is not always deterministic the partners manually solved the remaining ambiguities in their wordnets.

Defining a reusable methodology for data acquisition and validation
Each partner developed own specific tools for acquisition and validation, having a commonly agreed set of functionalities. These tools were documented for a lay computer user. The language specific tools differ mainly because of the set of language resources available to each partner; depending on available resources each partner chose the appropriate balance among the d6) and d7)  next issue

Trading effort and development time for language centricity (1)
This issue has been addressed by each partner differently, basically, depending on: available man power and language resources available. For instance, if relevant (encoded) electronic dictionaries (2lang. Dicts + Expl. Dicts + Syn. Dicts + Antonym Dicts + etc.) were available, the development effort concentrated to a large extent on equivalence interlingual mappings. This approach allowed a more language centric development (merge model).

Trading effort and development time for language centricity (2)
if reliable dictionaries other than bilingual dictionaries (which every partner had) were not available (e.g. because of the reluctance of the copyright holders to release or to allow the use of their data) a translation approach of the literals in the PWN was generally followed (approximately an expand model); additional efforts were necessary in this case to check out the translated synsets as well as their language adequacy.

Validation methodologies
Syntactic validations (wordnet well-formedness checking) Semantic validation (word sense alignment in parallel corpora)

Syntactic validations
Validation of syntactically well-formed wordnets: -compliance with the dtd for the VISDIC editor. -no duplicate literals in the same synset -no sense duplications (literal&sense number) -valid set of semantic relations -no dangling nodes (conceptual density) -no loops -valid synsets identifiers … and many others

Consistency checking Sense conflicts (a literal&sense-label in two or more synsets): easy to solve (obvious human errors in sense assignment) hard to solve (provide evidence for the Wordnet sense distinctions hard to make in other languages; hints for ILI soft clustering)

Cross-lingual validation of the ILI mapping
A bilingual lexicon might say TR (wL1)=w1L2, w2L2, … (not enough) A lexical alignment process might give you contextual translation information: The mth word in language L1 (wmL1) is translated by the nth word in language L2 (wnL2) (step1) TR-EQ (wmL1)= wnL (not enough, but better) One of the main objectives of the BALKANET project (which adopted a merge model approach) is to ensure as much as possible overlap between the concepts lexicalized in the concerned languages. A significant overlap may be hampered either by conceptually different lexical stocks for the different languages or by inconsistent projection of the monolingual concepts onto the ILI concepts. In order to achieve this objective, we propose a cross-lingual validation of the ILI mapping, based on a principle, a conjecture that we called the hierarchy preservation principle.

Cross-lingual validation of the ILI mapping
A sense clustering procedure might give you info on similar senses of different occurrences of the same word: Sense (Occ(wiL1, p), Occ(wiL1, q) …) =  (step2) Sense (Occ(wjL2, m), Occ(wjL2, n) …) = β , β= ? (sense labeling) synset(wiL1) TR-EQV synset(wjL2) (step3) , β are ILI-codes (ideally  = β) One of the main objectives of the BALKANET project (which adopted a merge model approach) is to ensure as much as possible overlap between the concepts lexicalized in the concerned languages. A significant overlap may be hampered either by conceptually different lexical stocks for the different languages or by inconsistent projection of the monolingual concepts onto the ILI concepts. In order to achieve this objective, we propose a cross-lingual validation of the ILI mapping, based on a principle, a conjecture that we called the hierarchy preservation principle.

Cross-lingual validation of the ILI mapping (idealistic view)
Translation(WiL1)=WjL2 => Syn1L1, Syn2L2 so that WiL1  Syn1L1 and WjL2  Syn2L2 and => EQ-SYN (Syn1L1)=EQ-SYN(Syn2L2) = ILIk WN1 WN2 ILI EQ-SYN WiL1 WjL2 One of the main objectives of the BALKANET project (which adopted a merge model approach) is to ensure as much as possible overlap between the concepts lexicalized in the concerned languages. A significant overlap may be hampered either by conceptually different lexical stocks for the different languages or by inconsistent projection of the monolingual concepts onto the ILI concepts. In order to achieve this objective, we propose a cross-lingual validation of the ILI mapping, based on a principle, a conjecture that we called the hierarchy preservation principle. ILIk TR-EQ

Cross-lingual validation of the ILI mapping (more realistic view)
EQ-SYN EQ-SYN ILI WiL1 WjLk One of the main objectives of the BALKANET project (which adopted a merge model approach) is to ensure as much as possible overlap between the concepts lexicalized in the concerned languages. A significant overlap may be hampered either by conceptually different lexical stocks for the different languages or by inconsistent projection of the monolingual concepts onto the ILI concepts. In order to achieve this objective, we propose a cross-lingual validation of the ILI mapping, based on a principle, a conjecture that we called the hierarchy preservation principle. WN1 WN2 TR-EQ

Monolingual wordnets construction
Expand model (essentially, based on translating the PWN synsets and importing the relations) Merge model (mapping independently built synsets already related onto the best matching PWN synsets and relations) Combined model (*Expand+*Merge) Interlingual relation: EQ-SYN; the other types of interlingual relations emulated via the non-lexicalized synsets.

Interlingual relations emulation
years years 50years 50years EQ-HAS-HYPO(jubilee:1, jubileu:1); EQ-HAS-HYPER(jubileu:1, jubilee:1) EQ-NEAR-SYN(jubileu:1, silver jubilee:1); EQ-NEAR-SYN(jubileu:1, diamond jubilee:1) 60years 50years 25years NL1 Romanian EQ-SYN jubilee:1 ILI jubileu:1 diamond jubilee:1 silver jubilee:1 Hyper NL2 NL3 NL4 EQ-HAS-HYPO EQ-NEAR-SYN Hyper

RoWn Development strategy:
Merge model for the synsets and lexical relations and Expand model for the semantic relations Facilitated by the availability of resources and tools: EXPD – XML encoding of the reference explanatory dictionary for Romanian (includes among others, sense definitions, grammatical information, expressions, examples, synonyms, etymology, derivation (for derivatives) SYND – XML encoding of the reference dictionary of synonyms in Romanian English-Romanian bilingual dictionary (automatic extracted, using WA technology, from very large parallel corpora Various statistical tools for corpus processing and information extraction Lexicographer’s tools (language independent, but dependent on the annotation schema (XCES, CONCEDE) for the language resources),

Example of an EXPD-entry (CONCEDE schema)
<entry id="TUFIS_"> <hw>TUFIŞ</hw> <stress>TUF`IŞ</stress> <alt> <brack> <gram> nom_neut_sing_indef</gram><orth>tufiş</orth> </brack> <brack><gram> nom_neut_pl_indef</gram><orth>tufişuri</orth></brack> </alt> <pos>substantiv</pos> <gen>neutru</gen> <struc> <def>Desiş de tufe sau de arbuşti</def> <def>mulţime de copaci tineri, stufoşi</def> <syn>tufăriş, tufărie </syn> </struc> <etym> <m>tufă</m>+suf.<m>-iş</m> </etym> </entry>

XCES annotation in the parallel corpus <tu id="Ozz20">
<seg lang="en"> <s id="Oen "> <w lemma="the" ana="Dd">The</w> <w lemma="patrol" ana="Ncnp" sn=“3" oc=“Group" dom="military"> patrols</w> <w lemma="do" ana="Vais">did</w> <w lemma="not" ana="Rmp" sn="1" oc="not“ dom="factotum"> not</w> <w lemma="matter" ana="Vmn" sn="1" oc="SubjAssesAttr" dom="factotum"> matter</w><c>,</c> <w lemma="however" ana="Rmp" sn="1" oc="SubjAssesAttr|PastFn” dom="factotum"> however</w><c>.</c></s></seg> <seg lang="ro"> <s id="Oro "> <w lemma="şi" ana=Crssp>Şi</w> <w lemma="totuşi" ana="Rgp" sn="1“ oc="SubjAssesAttr |PastFn" dom="factotum"> totuşi</w><c>,</c> <w lemma="patrulă" ana="Ncfpry" sn="1.1.x" oc=“Group" dom="military"> patrulele</w> <w lemma="nu" ana="Qz" sn="1.x" oc="not" dom="factotum"> nu</w> <w lemma="conta" ana="Vmii3p" sn="2.x" oc="SubjAssesAttr" dom="factotum"> contau</w><c>.</c></s></seg> … </tu>

Selection criteria for the BalkaNet lexical stock
All the partners implemented the synsets in BCS1, BCS2 and BCS3 (BalkaNet Common Synsets): 8516 synsets they were selected based on the same criteria as in EuroWordNet (base concepts, number of hyponyms, position in the PWN hierarchies, etc) Partner specific criteria for other synsets BUT conceptually dense!

Selection criteria for the RoWn lexical stock (I)
Frequency (extracted from 100 mil. corpus) A ranked list of more than 50,000 words (nouns, verbs, adjectives and adverbs with at least 3 occs.) All the 5,000 words in the Frequency Dictionary of Romanian (Alphonse Juliand) were in our list Number of senses listed in EXPD Definition fertility (the number of definitions in EXPD containing a given word) Based on the three criteria, a list of approx. 56,000 words has been re-ranked. They were looked up in SYND & EXPD and clustered in about 51,000 candidate synsets.

Selection criteria for the RoWn lexical stock (II)
The 51,000 candidate Romanian synsets were ranked based on their literal ranking in the previous mentioned list; The synsets corresponding to the 8516 obligatory synsets (BCS1,2,3) were among the first 15,112 synsets computed as above. We keep up the development of the RoWn with new synsets, selected according to their rank, from this ordered list (currently approaching its end).

Selection criteria for the RoWn lexical stock (III)
Adding new synsets is done according to the conceptual density criterion: if SRO is aligned to SPWN then all ancestors of SPWN must be aligned to ancestors of SRO (some of them might be NL synsets) RO ILI SRO SPWN

Hierarchy Preservation Principle
Notation H - the hypernymy relation + - the Kleene operator (H+ meaning at least one H relation) ML1 = NL2 – synset M in L1 and synset N in L2 are mapped onto the same ILI concept First, let me present some notations we used: As such, if H is the hypernymy relation, + is the Kleene operator (H+ meaning at least one H relation), and by ML1 = NL2 is meant that meaning M in language L1 is aligned with meaning N in language L2, then:

Hierarchy preservation principle
M1L1 H+ M2L1 M1L1 = N1L N1L2 H+ N2L2 M2L1 = N2L2 Ma Mb M2 M1 Na N2 N1 language L1 language L2 The hierarchy preservation principle is based on the belief that the hypernymy relation (and its opponent – hyponymy) has a language independent definition (according to the classics, Miller, Felbaum, etc.) and reflects a subsumption relation from a more particular linguistic concept to a more general one. So, the principle “spune”: If synset M1 is a hyperonym of synset M2 in language L1 and M1L1 = N1L2 and M2L1 = N2L2 , then it is a must that N1L2 H+ N2L2 even if the chain of hypernymy relations in one language and the other could be of different lengths. The difference in lengths could be induced by the existence of meanings in the chain of language L1 which are not lexicalized in language L2

Examples of inconsistencies
Mingredient RO EN Mcondiment Msos Mmirodenie Mmuştar Mdafin Mflavorer Msauce Mmustard Mspice Mketchup Mmaioneză Mmayonnaise Maromaizant A cross-lingual validation using the hierarchy preservation principle would signal inconsistencies of the kind illustrated by the fragments of the Ro-Wordnet and WN1.5 shown in the figure. The arrows represent hyponymy relations in the two wordnets. The blue heavy lines represent translational links between the synsets in the two languages, meaning that the respective synsets are mapped onto the same ILI concept. The plum heavy line represents a translational link that is reported as wrong during the cross-validation of the two wordnets. The reason for this comes from the violation of the hierarchy preservation principle. The inconsistency is signaled because in language RO the hierarchical relations (hypohym) between MmirodenieRO H McondimentRO as well as MketchupRO H MsosRO are not verified in language EN by the equivalent pair meanings (MspiceEN - McondimentEN) and (MketchupEN – MsauceEN) (in EN they are sisters).

Examples of inconsistencies
RO EN Mingredient Mingredient Maromaizant Mflavorer Mcondiment Msos Mcondiment Mspice Mmirodenie Mmuştar Msauce Mmustard Mketchup Mdafin Mmaioneză Mketchup Mmayonnaise

The first inconsistency
Maromaizant Mflavorer Mcondiment Msos Mcondiment Mspice Mmirodenie Msauce McondimentRO = name given to some (spicy) substances of mineral, vegetal, animal or with synthesis origin, which added to some alimentary products, gives them a specific taste or flavor, enjoyable (DEX) Such cases of inconsistency must be presented to the consortium in order to be resolved. We think that the first case of inconsistency is due to the wrong projection of the monolingual synsets onto the ILI concepts. Definitii…. Yet, any native speaker of Romanian will accept with difficulty sauces as condiments, although the DEX definition does not explicitly specify such a differentiation. So, Romanians don’t consider sauce as being a condiment while Englishmen consider sauce as being a condiment. Given the fact that through sauce they all understand the same thing, then the only conclusion that can be made is that the Romanian synset McondimentRO is not eq-synonym with the English synset McondimentEN so they should not be mapped onto the same ILI concept. McondimentEN = a powder or liquid, such as salt or ketchup that you use to give special taste to food (Longman)

The first inconsistency - solved
Maromaizant Mflavorer Mcondiment Msos Mcondiment Mspice Mmirodenie Msauce McondimentRO = name given to some (spicy) substances of mineral, vegetal, animal or with synthesis origin, which added to some alimentary products, gives them a specific taste or flavor, enjoyable (DEX) McondimentEN = a powder or liquid, such as salt or ketchup that you use to give special taste to food (Longman)

The second inconsistency
RO EN Msos Mcondiment Msauce Mketchup 1 sense of ketchup Sense 1 catsup, ketchup, cetchup, tomato ketchup -- (thick spicy sauce made from tomatoes) => condiment -- (a preparation (a sauce or relish or spice) to enhance flavor or enjoyment: "mustard and ketchup are condiments") => flavorer, flavourer, flavoring, flavouring, seasoner, seasoning -- (something added to food primarily for the savor it imparts) => ingredient -- (food that is a component of a mixture in cooking) In this case, there is no doubt that synsets MsosRO and MsauceEN must be mapped onto the same ILI concept. The same thing with ketchup. But, if we look at this excerpt from the Wordnet1.5, we can see that, indeed, the first hyperonym of kechup is condiment and not sauce. On the ather hand, the gloss of this synset is “thick spicy sauce made from tomatoes”. We think that this permits ourselves to think that something is wrong here. So, ketchup is a sauce!

The second inconsistency -solved
RO EN Msos Mcondiment Mketchup Msauce Mketchup 1 sense of ketchup Sense 1 catsup, ketchup, cetchup, tomato ketchup -- (thick spicy sauce made from tomatoes) => condiment -- (a preparation (a sauce or relish or spice) to enhance flavor or enjoyment: "mustard and ketchup are condiments") => flavorer, flavourer, flavoring, flavouring, seasoner, seasoning -- (something added to food primarily for the savor it imparts) => ingredient -- (food that is a component of a mixture in cooking)

Successful overlapping
Mingredient RO EN Mcondiment Msos Mmirodenie Mmuştar Mdafin Mflavorer Msauce Mmustard Mspice Mketchup Mmaioneză Mmayonnaise Maromaizant Thus, we achieved a successful overlapping; every portion of these fragments respects the hierarchy preservation principle, and therefore the cross-validation is positive. If this philosophy is accepted then there are good reasons to believe that the hierarchy should stay within ILI and not within individual WNs. What should stay in an individual WN are then only those parts of the hierarchy which are not replicated in ILI.

Valency frames (I) An interesting experiment: the Cz partner gave us access to about 600 Cz verbal synsets associated with valency frames extracted from the Czech National Corpus. Via translation equivalence relations among the Ro-Cz wordnets we imported the original valency frames and manually checked for applicability in Romanian. About 84% of the imported valency frames were valid (sometimes with minor modifications)! Only 98 valency frames needed significant modifications. Very promissing. The semantic restrictions for the frame elements are wordnet-endogenous.

Valency frames (II) The synset v: (a_se_afla:3.1, a_se_găsi:9.1, a_fi:3.1) with the gloss: be located or situated somewhere; occupy a certain position (nom*AG(fiinţă:1.1) | nom*PAT(obiect_fizic:1)) = prep-acc*LOC(loc:1). fiinţă:1.1 =: a living thing that has (or can develop) the ability to act or function independently obiect_fizic:1 =: a tangible and visible entity; an entity that can cast a shadow; loc:1 =: a point or extent in space. The reading is the following: Any verb in the synset v subcategorizes for two arguments: the first one, which usually precedes the verb, is an NP in nominative case with the semantic role either AGent or PATient depending on the category of the filler The second one, which usually follows the verb, is a PP in accusative case with the semantic role of LOCation (it’s filler must be a loc:1 or a hyponym of it)

Development tools (I) WnBuilder (Java)
Graphical user interface putting together all the lexical resources (EXPD, SYND, PWN, RO-EN dictionary) based on which the lexicographer select the best matching synsets and assign sense numbers to the literals in the RO synset; Allows for distributed and independent work of different lexicographers

Development tools (II)
Distributed work makes room for inherent mapping errors; they are dealt with in a centralized way by means of WnCorrect. WnCorrect (Java) Graphical user interface putting together all the independently developed portions of the wordnet. Allows for immediate spotting of most mapping errors (two or more synsets in the target wordnet mapped onto the same PWN synset; same literal with the same sense number occurring in two or more synsets, etc.)

Development tools (III)
Relations-Import (Perl) Allows for automatic import from PWN of the semantic relations and assists the user in defining lexical relations among the target synsets Alternative solution VISDIC (Brno University); we use it as the standard multilingual viewer

Development tools (IV) WSDTool
WSDTool is a Java application which by its GUI allows the user to edit the semantic annotation of a XCES compliant corpus. In editing mode, the resources involved (translation dictionaries and wordnets) can be validated and corrected in accordance with the corpus findings. Explici ce inseamna XCES, cum se ajunge la formatul ala (aligner, lemmatizer and tagger) si cum se reprezinta in XML

Some Quantitative Data about RoWordNet (October, 2007)
Synsets: 43,302 Relations: 57,178 Literals: 67,270 SUMO/MILO labels: 39,538 (1821 concepts) DOMAINS labels: 49,563 (165 domains) Sentiment labeled synsets: 43,302 Incorporation into Alexandria (Memodata) multilingual reading-support system Incorporation into MultiWordNet

Some Direct Applications of These Technologies and Resources
Word Alignment (translation models building) Best results in the shared task on aligning En-Ro parallel corpora (NAACL 2003 Edmonton, Canada; ACL 2005 Ann Arbor, USA) Word Sense Disambiguation Multilingual Thesauri Alignment Semantic annotation import (valency frames, frame semantics) Terminology consistency over translated texts Opinion mining Cross-lingual QA in open domains

Word Alignment The alignment tool has been incorporated into a complex processing platform which put together most of the tools presented so far: sentence segmentation, tokenisation, POS tagging (monolingual) sentence alignment, word and phrase alignment, word sense disambiguation with multiple sense inventories: PWN, SUMO, Domains (bitexts) D. Tufis, A.M.Barbu, R. Ion: „TREQ-AL: A word-alignment system with limited language resources”, Proc. of the NAACL 2003 Workshop on Building and Using Parallel Texts; Romanian-English Shared Task, Edmonton, Canada, 2003, pp D. Tufis, R. Ion, A. Ceauşu, D. Ştefănescu: Combined Aligners. In Proc. of the ACL2005 Workshop on “Building and Using Parallel Corpora: Data-driven Machine Translation and Beyond”, June, 2005, Ann Arbor, Michigan, June, Association for Computational Linguistics, pp D. Tufis, R. Ion, A. Ceauşu, D. Ştefănescu: Improved Lexical Alignment by Combining Multiple Reified Alignments. In Proc. of the EACL, Trento, 3-7 April, 2006.

En-Ro Word Alignment

Word Sense Disambiguation
Setting: parallel texts Uses: Word Alignment (major part of the WSD task in our setting) WordNet (as ILI) and theXx-Wn (for En-Xx bitext) SUMO, Domains Model: If <WL1 WL2> then at least one sense of WL1 and one sense of WL2 must be closely conceptually related

Parallel Corpora We made several experiments on three parallel corpora, all of them represented in the same format (Multext-East XCES-ANA). They are tokenized, POS tagged, lemmatized and sentence aligned. “1984”: corpus based on Orwell’s novel; contains 9 parallel translations (BG, CZ, ET, GR, HU, RO, SL, SR, TR) plus the EN original; the validation procedure is currently applied on all the bitexts pairing BalkaNet languages to English (BG-EN, CZ-EN, GR-EN, RO_EN, SR-EN and TR-EN) for the validation of the respective wordnets; “VAT”: corpus based on the Sixth VAT Directive -77/388 EEC VAT; contains 3 languages sentence-aligned variants (EN, NL, FR) NAACL parallel corpus (EN, RO) Similar work is ongoing with a much larger corpus (JRC-Acquis): 22 languages, more than 50,000,000 words per language; ever growing

Parallel Corpus Representation
<tu id="Ozz.113"> <seg lang="en"> <s id="Oen "> <w lemma="Winston" ana="Np">Winston</w> <w lemma="be" ana="Vais3s">was</w> ... </s> </seg> <seg lang="ro"> <s id="Oro "> <w lemma="Winston" ana="Np">Winston</w> <w lemma="fi" ana="Vmii3s">era</w> ... <seg lang="cs"> <s id="Ocs "> <w lemma="Winston" ana="Np">Winston</w> <w lemma="se" ana="Px---d--ypn--n">si</w> ... ... </tu>

Traditional WSD: monolingual data
Various conceptual problems (continuum/discrete, how many senses one word has, independence on the intended application-the right level of granularity, etc) A typical classification of the WSD solutions: Unsupervised (with raw or pre-processed texts) The cheapest way; sense-inventories are established ad-hoc, depending on applications; Supervized (hand annotated WSD training data) Expensive, requires lots of hand-annotated data; sense inventory is dictated by the one used by human annotators in the training data Knowledge-Based (based on MRD dictionaries, ontologies, domain taxonomies) ; they can be built on either supervised or unsupervised methods Sense inventory biased to the one used in the supporting KS

WSD based on parallel texts and aligned wordnets (I)
A very different approach than the one used for monolingual data, with less conceptual problems; First approaches to using parallel corpora for WSD ( ) were shadowed by a lack of interest by virtue of the lack) in those days) of sufficient parallel data (this cannot be an argument nowadays) Relatively cheap (provided aligned wordnets do exist): a mixture of unsupervised and KB approaches Sense inventory biased by the interlingual index Given that these approaches can use any of the methods in monolingual WSD systems, but additionally have access to an invaluable additional KS (the translators’ linguistic decisions) any decent implementation of a WSD system based on parallel texts and aligned wordnets is “doomed” to be more accurate than any competing monolingual system.

WSD based on parallel texts and aligned wordnets (II)
The advantage of using an interlingual index (PWN, the backbone of aligned wordnets) is many-folded: PWN has been aligned with different other conceptual structures (SUMO, MILO, DOMAINS, various domain specific ontologies, topic signatures, synset clusters, etc) which are at no cost available in the other aligned ontologies; Multiple sense inventories become thus available for any application of the WSD in the languages of a multilingual wordnet Multilingual R&D can benefit in a much controlled and principled way of any advancement achieved in languages interlingually connected. …

WSD based on parallel texts and aligned wordnets(III)
Our approach: A mixture of unsupervised and KB approaches with multiple processing steps: word alignment in parallel corpora (COWAL) and translation equivalents extraction (translation model); sense labeling using (BalkaNet): Princeton word sense inventory; SUMO/MILO ontology IRST Domains classes EXPD labels Based on aligned wordnets (covers ~75% of the word occurrences in a corpus) sense clustering based on translation equivalents extracted from parallel corpora (takes care of the uncovered cases by the previous step; generation of the WSD annotation in the parallel corpus

WSD MAIN STEPS 1.Word Alignment & Filtering of the Translation Equivalents
The word alignment system (COWAL) is producing N-M cross-POS alignment links with an average F-measure of more than 80% (F=82.52%, AER=17.48%). Only the preserving major POS (V, N, A, R) translation links retained. In such a case F is better than 92% (F=92.04%, AER=7.96%)

Word Sense Disambiguation

WSD MAIN STEPS 2. Sense Labeling
Aligned wordnets (lexical ontologies) Conceptual knowledge structuring (upper and mid level ontologies): SUMO/MILO Domain taxonomies (UDC-librarian’s taxonomy): IRST-DOMAINS Explanatory Dictionary of Romanian (3 sense levels) Coverage heuristics: if one of the words in a translation pair is not member of any synset in the respective language wordnet, but the other word is present in the aligned wordnet, and moreover it is monosemous, the first word gets the sense as given by the monosemous word. If one of the languages is English, any other language can benefit from this heuristics (approx. 80% of the PWN literals are monosemous): Ex: hilarious <-> hilar => ENG a burp <-> râgâi => ENG v prospicience <->clarviziune => ENG n

Example (I): <lamp lampă>
PWN2.0 (lamp) = { n, n} RoWN (lampă) = { n, n} WN1 WNk ILI EQ-SYN W(i)L1 w(j)Lk ILIk TR-EQV SUMO Domains EQ-SYN ILI= n => <lamp(2) lampă(1)> SUMO ( n) = +Device DOMAINS ( n) = furniture

Example (II): <lamp felinar>
PWN2.0 (lamp) = { n, n} RoWN (felinar) = { n} δ ( n, n)=0.5 δ ( n, n)=0.125 WN1 WNk ILI TR-EQV EQ-SYN SUMO Domains ILI= n => <lamp(1) felinar(1.1)> SUMO ( n) = IlluminationDevice DOMAINS ( n) = factotum

Example (III): <contain conţine>
PWN2.0 (contains) = { v, v, v, v, v, v} RoWN (conţine) = { v, v, v, v, v} contain:2 ( v) EQ-SYN conţine:1.1  verb_group  contain:5 ( v) EQ-SYN conţine:1.1.1 SUMO ( v, v) = contains SUMO ( v) = part (disjointRelation contains part)

Example (IV): <contain conţine>
SUMO definition for contains (The relation of spatial containment for two separable objects): (subrelation contains partlyLocated) (instance contains SpatialRelation) (instance contains AsymmetricRelation) (domain contains 1 SelfConnectedObject) (domain contains 2 Object) (subclass SelfConnectedObject Object) (<=> (contains ?OBJ1 ?OBJ2) (exists (?HOLE) (and (hole ?HOLE ?OBJ1) (properlyFills ?OBJ2 ?HOLE))))  ILI= v => <contain(2) conţine(1.1)>

WSD MAIN STEPS 3.Word Sense Clustering
_____|-> (1) |-----| |-> (1) | |_____|---> (1) | |___|-> (1) | |-> (1) | |---> (1) |----| | _|-> (1) | | | |-| |-> (1) | | |---| |-| |-> (1) | | | | |-| |-> (1) | |-----| | |-| |-> (1) |--| | |---| |-> (1) | | | |-> (1) | | |___|---> (6) | | |___|-----> (1) | | |-----> (1) | | _____|-----> (6) -| |----| |-----> (6) | |-----> (4) | | |---> (2) | |---| _|-> (2) | | | |-| |-> (2) | | |---| |-> (2) |--| |-> (2) | |-----> (2) | | ___|-----> (2) |---| |----| |-----> (2) | | | _|-> (2) | | |---| |-> (2) |-----| |-> (2) | ____|-> (3) |----| |-> (2) | _|-> (2) |----| |-> (2) |-> (2) An agglomerative, hierarchical algorithm using a vector space model, Euclidean distance and cardinality weighted computation of the centroid (the “center of weight” of a new class). The undecidable problem of how many classes, gets hints from the work already done in step 1 and 2

Evaluation (I) “lexical sample” and 1-tag annotation evaluation (with k-tag, the performance would be essentially the one of the filtered word alignment, i.e % ) 216 English ambiguous words (at least two senses/POS) with 2081 occurrences in “1984” were semantically disambiguated by three experts in terms of PWN2.0 sense inventory. The experts negotiated all the disagreement cases, thus resulting the Gold Standard annotation (GS) - this is “lexical sample/lexical choice” evaluation type of SENSEVAL (much harder than “all words” which includes monosems and homographs as well) For each PWN2.0 sense number, the GS was deterministically enriched with the SUMO category and the DOMAINS label. Thus, we had three sense inventories in the GS and could evaluate system’s WSD accuracy in terms of each of them.

Evaluation (II) Automatic WSD was performed three ways:
using only RO-EN aligned BalkaNet wordnets (AWN) combining AWN with clustering (AWN+C) combining AWN+C with the simple heuristics (AWN+C+MFS) Out of 2081 total occurrences 61 (34 words) could not receive a sense tag because the target literal was wrongly aligned by the translation equivalents extractor module of the WSDtool, or because it was not translated or wrongly translated by the human translator. In this case we used MFS, a simple heuristics assigning the most frequent sense label (42 occurrences were correctly tagged).

WSD based on WN2.0+RoWN (PWN2.0 id)
Evaluation (III) WSD based on WN2.0+RoWN (PWN2.0 id) WSD annotation Precision Recall F-measure AWN 84.86% 62.22% 71.80% AWN+C 80.29% 77.94% 79.10% AWN+C+MFS 79.96%

Evaluation depends on the sense inventories
PWN2.0+RoWN ( AWN+C+MFS) SENSE INVENTORY PRECISION RECALL PWN meanings ( categories) 79.96% SUMO/MILO (2066 categories) 86.54% IRST DOMAINS (163 categories) 93.46%

<tu id="Ozz.1"> <seg lang="en">... </seg> <seg lang="ro">... </seg>... </tu> 4. WSD annotation in the parallel corpus

Thesauri Alignment Eurovoc (the multilingual thesaurus used for indexing the Acquis Communautaire corpus) En version: the reference Ro version: partial, unmapped (about 600 terms missing, some problematic translations) Alignment: (translation equivalents (lemma-based) have been extracted from Acq-Com) a) Full topological matching => translation equivalence checking (editable) b) Partial topological matchings => select the identity & translation equivalence suggestions (semi-automatic: human selects from system’s suggestions and edit the translations) Dan Ştefănescu, Dan Tufiş: “Aligning multilingual thesauri”, in Proceedings of LREC 2006, Genoa, Italy

Topological Alignment (full)

Topological Alignment (partial)
RO5 RO1 RO2 RO3

Term Translations in Parallel Discovery and Consistency Check:Background
FF-POIROT (IST ) Consistent multilingual lexicalization of ontological concepts and relations, ensuring common understanding of the legal stipulations. The domain area is VAT: VAT 6th Directive of EEC (77/388/EEC of 17th May 1977) Different cross-countries interpretations of the VAT directive favours fiscal frauds

A sample of XCES-Ana Aligned Encoding
TITLE I – INTRODUCTORY PROVISIONS Hoofdstuk I : Inleidende bepalingen Titre I er : Dispositions introductives <tu id="Ozz.21"> <seg lang="en"><s id="Oen.21"><w lemma="title" ana="Vm">TITLE</w> <w lemma="I" ana="Pp">I</w> <c>-</c> <w lemma="introductory" ana="Adj">INTRODUCTORY</w> <w lemma="provision" ana="Nc">PROVISIONS</w> </s></seg> <seg lang="nl"><s id="Onl.21"><w lemma="hoofd#stuk" ana="Nc">Hoofdstuk</w> <w lemma="i" ana="M">I</w> <c>:</c> <w lemma="inleiden" ana="A">Inleidende</w> <w lemma="bepaling" ana="Nc">bepalingen</w> </s></seg> <seg lang="fr"><s id="Ofr.21"><w lemma="titre" ana="Nc_sg">Titre</w> <w lemma="i" ana="M">I</w> <w lemma="er" ana="Nc_sg">er</w> <c>:</c> <w lemma="disposition" ana="Nc_pl">Dispositions</w> <w lemma="introductif" ana="Adj_pl">introductives</w> </s></seg> </tu>

VAT Corpus Overview LANGUAGE EN FR NL No. of occurrences 41722 45458 40594 No. of word forms 3473 3961 3976 No. of lemmas 2641 2755 3165 Additional resource: a list of EN terms, manually extracted by an expert in VAT legislation from the English variant of the VAT directive; 1043 (inflected) forms; after lemmatization and duplicates removal remained only 900 terms.

the POS of French word membres (common noun plural)
and it’s English translation equivalent (member) The English sentence of the Ozz.2 translation unit The French sentence of the Ozz.2 translation unit English words and their French translations Main functions

Parallel Corpus A) Extract the 1-1 translation equivalents;
Finding the Multiword Terms in a Parallel Corpus A) Extract the 1-1 translation equivalents; B) Using a “witness” monolingual term collection, we identify term translations in the other parts of the parallel corpus; this is based on exploiting the distribution of the indexes of the aligned words and defining a target span of the text where the candidates are looked for. The identified spans are checked for the longest common sequence of translations. The ranking of the different possible translation equivalents is based on the DICE score and takes into account the number of words in the source term, the number and adjacency score of the translated words from the source term and the length of the target candidate.

Translation equivalents:
EX: transit procedure = procédure de transit communautaire Translation equivalents: Community  communautaire (cross-part of speech equivalents) transit  transit procedure  procédure Main motivation for this process: checking consistency in cross-lingual term usage, ensuring correct projection of a multilingual terminological data-base over the concepts of a language-independent ontology. Community transit procedure procédure de communautaire

A (rough) evaluation of the French and Dutch Terms
Number of English terms: 900 Number of French terms: 871 Recall 96,73% Precision % Number of missed French multiword terms because of data preprocessing errors: 18 (the 11 were single word terms, occurring only once) Number of Dutch terms: 861 Recall 95,67% Precision ?? (no idea!) Number of missed Dutch multiword terms because of data preprocessing errors: 32

Hypotheses for text mining (term discovery):
a) a multiword term has a simple constituent structure b) if a “witness” term is translated in different ways in other languages, its usage in those languages is not terminologically “clean”; c) a “witness” term which is not translated systematically in the same way in the other languages is probably not a proper term; d) a significant term should re-occur in a representative document . Considering these hypotheses being reasonable, we developed a multilingual term discovery tool, thus removing the requirement for a witness monolingual term glossary.

Term discovery with translation equivalents
for each language in a parallel corpus we extract (by means of NSP, using loglikelihood scoring and DICE-score ranking) the statistically meaningful collocations; for loosening the effect of data sparseness, lemmatisation is needed); lists of stop-words and 18 regular expressions describing the term constituent structure (e.g. a term cannot begin with a number, a term cannot contain a conjunction, etc.) filter out parasitic high scored n-grams.

4. With each such list of monolingual collocations, used as a source “witness term collection”, the translations in the other target languages are extracted as we did before. 5. If the translations in the target languages are also found in the language specific collocation lists, these are taken to represent terms; with N languages in the parallel corpus, we have N lists of monolingual collocations and N*(N-1) bilingual translation equivalence extraction exercises (L1 source and L2 target is not the same as L2 source and L1 target; they reinforce each other).

Evaluation of the term miner
Based on collocational analysis and grammar well-formedness rules (we used 18 rules for English). Several terms in the human extracted VAT list of terms were not compliant with our grammar well-formedness rules (out of the 755 multiword terms, only 357 passed the well-formedness filters). 144 terms were discovered, 79 were discovered partially and 144 were not found at all; about 1500 terms found by the term extractor which could be terms (in our view)

Knowledge Induction/Transfer
dependency relations, word senses, valency frames, semantic paradigmatic (wordnet) relations syntagmatic (framenet) relations

Annotation import Romance Framenet initiative
We started translating the SemCor corpus (about 1000 sentences by now), word-aligned the Ro-En bitext and WSDed the Romanian part; this complements the Italian initiative (MultiSemCor). Based on the alignment the annotations were imported from English into Romanian (no evaluation done yet). The results (outdated now) can be seen at:

Dependency Relation Transfer
Setting: a En-Ro bitext in which the English part is parsed with the FDG parser The Romanian part is linked with CLAM linker The two parts of the bitext are word and dependency aligned Import the orientation and labelling of a English dependency relation if the English relation is properly aligned with the corresponding Romanian link. In other words, if the paired words of the English dependency relation are aligned with the paired words of the Romanian link The filter component controls the import and labeling (e.g. translations of active voice as passive voice).

Tagged, Lemmatized and Parsed English side of the “1984” multilingual corpus
1 The the det:>2 @DN+ 2+,Dd 2 hallway hallway subj:>3 @SUBJ 1+,Ncns 3 smelt smell main:>0 @+FMAINV 1+,Vmis 4 of of phr:>3 @ADVL 5+,Sp 5 boiled boil attr:>6 @A+ 1+,Afp 6 cabbage cabbage 1+,Ncns 7 and and cc:>6 @CC 31+,Cc-n 8 old old attr:>9 @A+ 1+,Afp 9 rag rag attr:>10 @A+ 1+,Ncns 10 mats mat 1+,Ncnp $. In order to provide maximally accurate source data for knowledge induction and transfer, the EN annotated data has been hand validated

Dependeny & Chunking Annotation

Dependency Relations Transfer Cases
Perfect transfer Transfer with amendments Language specific phenomena Impossibility of transfer

1. Perfect Transfer

2.Transfer with amendments
Dummy anticipatory ‘It’ is subject and book is complement. Yet in Romanian, ‘carte’ is subject.

3. Language Specific Phenomena
Pro-drop phenomenon

4. Impossibility of transfer (I)
Equivalent verbs with different syntactic behaviour: ‘like’ – ‘plăcea’

Valency frames (I) A preliminary experiment: the Cz partner gave us access to about 600 Cz verbal synsets associated with valency frames extracted from the Czech National Corpus. Via translation equivalence relations among the Ro-Cz wordnets we imported the original valency frames and manually checked for applicability in Romanian. About 84% of the imported valency frames were valid (sometimes with minor modifications)! Only 98 valency frames needed significant modifications. Very promissing. The semantic restrictions for the frame elements are wordnet-endogenous.

Valency frames (II) The synset v: (a_se_afla:3.1, a_se_găsi:9.1, a_fi:3.1) with the gloss: be located or situated somewhere; occupy a certain position (nom*AG(fiinţă:1.1) | nom*PAT(obiect_fizic:1)) = prep-acc*LOC(loc:1). fiinţă:1.1 =: a living thing that has (or can develop) the ability to act or function independently obiect_fizic:1 =: a tangible and visible entity; an entity that can cast a shadow; loc:1 =: a point or extent in space. The reading is the following: Any verb in the synset v subcategorizes for two arguments: the first one, which usually precedes the verb, is an NP in nominative case with the semantic role either AGent or PATient depending on the category of the filler The second one, which usually follows the verb, is a PP in accusative case with the semantic role of LOCation (it’s filler must be a loc:1 or a hyponym of it)

Opinion Mining Goal: to assess the overall sentiment of an opinion holder with respect to a subject matter Different Granularities (Document, Sentence) Identify the opinionated sentences (OpinionFinder) and the opinion holder Select those referring to the subject matter of interest Classify the opinionated sentences on the subject matter, according to their polarity (positive, negative, undecided) and force.

SentiWordNet Andrea Esuli, Fabrizio Sebastiani
SentiWordNet Andrea Esuli, Fabrizio Sebastiani. SentiWordNet: A publicly Available Lexical Resourced for Opinion Mining, LREC2006 The basic assumptions: words have graded polarities along the orthogonal axes: Subjective-Objective (SO) & Positive-Negative (PN) The SO and PN polarities depend on the various senses of a given word (context)

State of the art Monolingual research: more and more numerous and in more and more languages Multilingual comparative studies (different comparable text-data, different languages): not very many, but their number is increasing We are not aware of cross-lingual studies (parallel texts) why? Possible answers: The original opinions are those expressed in the source language; the target language contains (presumably faithful) translations of the holders’ opinions; Expressing opinions is a cultural matter: most translations are concerned with the factual content preservation

Questions (Case 1) Assume a collection of original documents in Japanese SJP and two translations of it in English TEN1 and TEN2 with the first translation done by an Japanese with a perfect command of English and the second translation done by an American with a perfect command of Japanese. Would opinions in SJP and TEN1 be “the same”? Would opinions in SJP and TEN2 be “the same”? Would opinions in TEN1 and TEN2 be “the same”? “the same”=#opinionated sentences, polarity Answers: No idea! Possible guesses: YES?, YES?, YES?

Questions (Case 2) Assume a collection of original documents in Japanese SJP containing reports (newspaper articles, news agencies briefs, official statements) on specific international events and a collection of documents in English SEN containing reports of similar lengths and from corresponding sources on the same international events Would opinions in SJP and SEN be “the same”? “the same”=#opinionated sentences, polarity Answer: Probably, no! Why? Due to cultural differences. For instance, (cf Kim & Myaeng, NTCIR 2007) “a sentence in Japanese, reporting on a merge of two companies should be judged to have negative sentiment whereas the same kind of activities in the US would be a positive event”

Opinion analysis across languages NTCIR6
David Kirk Evans, Lun-Wei Ku, Yohei Seki, Hsin-His Chen, Noriko Kando (2007) Case 1.5 experiments (comparable texts) in Japanese, English and Chinese (English translations probably done by Japanese and Chinese employees of the local news agencies) Japanese data( ): Mainichi Daily News, Yomiuri English data( ): Mainichi Daily News, Korea Times, Xinhua Chinese data( ): United Daily News, China Times, China Time Express, China Daily News, etc Language Topics Documents Sentences Opinionated lenient/strict Relevant lenient/strict Chinese 32 843 8546 62% / 25% 39% / 16% English 28 439 8528 30% / 7% 69% / 37% Japanese 30 490 12525 29% / 22% 64% / 49%

What does this experiment show?
Big differences across languages in the Gold Standards Despite using similar approaches, big differences in the performances of the competing systems with respect to the processed language (best in Chinese, worst in English) Are these differences explained by the existing differences in the annotation? Partly! Annotators training could be a better explanation (big differences between the annotators in the three languages) Language and cultural differences probably significantly mattered

Why so much interest in subjectivity analysis ?
Social community-oriented websites and user generated content are becoming an extremely valuable source of information for everyday information consumer. But also for various kinds of decision makers; Two main types of textual information on the web: facts (objective) and opinions (subjective) Current search engines search for facts not opinions (current search ranking strategy is not appropriate for opinion search/retrieval)

Word-of mouth on the web is sometimes perceived as being more trustful than the regular mass-media sources! In user generated content (review sites, forums, discussion groups, blogs etc) one can find descriptions of personal experiences and opinions on almost anything; valuable for common people for practical daily decisions (buying products or services, going to a movie/show, traveling somewhere, finding opinions on political topics or on various events… valuable for professional decision makers in many areas, so they support this trend.

Feature-based opinion and sentiment analysis
The building blocks: sentiment words The words become sentiment words in context The bag-of-words approach works pretty bad (but works!) and there are various ways (maybe expensive) to improve the opinion and sentiment analysis Syntax and punctuation (usually discarded) play also an important role in judging the subjectivity of a piece of text.

Resources Polarity is a matter of context. However, lexical resources give you only prior polarities. Sentence/phrase polarity is compositional, based on prior polarities which could be altered by valence shifters (such as negation). Most researchers came to the conclusion that prior polarity is a matter of word senses! Having good resources, creates premises for building accurate opinion miners. Princeton WordNet 2.0 SUMO&MILO SENTIWORDNET Domains

Annotating WordNet for prior polarities
Starting with a set of words, hand annotated for their prior polarities, most sentiment resources are built by applying some ML techniques and inducing prior polarities for lexical items stored in lexico-semantic repositories. As WordNet is a highly praised such a repository, not surprisingly its structure and content are the backbone of such enterprises.

Word-Senses and Subjectivity
SentiWordNet associates subjectivity scores (P, R, O) to WordNet synsets, i.e.to the word-senses. Lexical semantics is very important here. WSD would be highly instrumental (e.g. JW&RM) Dependency Linking (which is less than parsing, but easier to obtain) is much more appropriate than B-o-W

Cross-lingual opinion and sentiment analysis
A parallel text (EN-XX) e.g,Orwell’s “1984”, MultiSemCor, Euro-Parl, JRC-Acquis etc. Word Align and WSD the EN-XX bitext Use a scoring method for the senti-words and valency-shifter words in each part of the bitext (based on SentiWordNet scores) to classify the opinionated sentences Try to answer Question 1 Evaluate monolingualy (both in En and XX) whether the mark-ups hold true; for En you might use OpinionFinder (Wiebe, Riloff et al.) and compare its classification with the SentiWordNet-based classification Write immediately a breakthrough paper (whatever the results of the evaluation)

What you would need to do it?
Quality multilingual lexical and sentiment marked-up resources (multilingual lexical and sentiment ontologies are probably the best) List of valency shifters and rules to define their scope and results on the sentiment words (Polanyi& Zaenen, 2006) Preprocesing tools (sentence alignment, tokenizer, POS taggers, lemmatizers, dependency linkers) Alignment tools (e.g. COWAL) WSD tools (WSD Tool, SynWSD) Sentence opinion scorer and classifier Annotation transfer tools

English Lexical and Sentiment Ontology (ELSO)
Princeton WordNet2.0 (Fellbaum) SUMO/MILO (Niles, Pease) DOMAINS (Magnini, Cavaglià SentiWordNet (Esuli,Senbastiani) English Lexical & Sentiment Ontology (ELSO)

Multilingual Lexical and Sentiment Ontologies (MLSO)
English Lexical & Sentiment Ontology BalkaNet wordnets (see D. Tufis (ed) ROMJIST Special Issue on BWN) EuroWordNet wordnets see P. Vossen (ed) CHUM Special Issue on EWN) Multilingual Lexical & Sentiment Ontology (MLSO) The EuroWordNet and BalkaNet multilingual wordnets use the Princeton WordNet as the InterLingual Index (ILI) => any sentiment and ontological mark-up in PWN is available in the aligned monolingual wordnets; altogether they make a MLSO.

The encoding of a (sentiment) synset in RoWordNet
<ID>ENG n</ID> <BCS>3</BCS> <DOMAIN>factotum</DOMAIN> <SUMO>SubjectiveAssessmentAttribute<TYPE>+</TYPE></SUMO> <POS>n</POS> <SYNONYM><LITERAL>bine<SENSE>16</SENSE></LITERAL> <LITERAL>bun<SENSE>51</SENSE></LITERAL> <LITERAL>virtute<SENSE>2</SENSE></LITERAL> </SYNONYM> <DEF> Înclinaţie statornică specială către un anumit fel de îndeletniciri sau acţiuni frumoase. < /DEF> <ILR>ENG n<TYPE>hypernym</TYPE></ILR> <ILR>ENG n<TYPE>near_antonym</TYPE></ILR> <SENTIWN> 0.75<N>0</N><O>0.25</O> </SENTIWN> </SYNSET>

How much trust one can have in such an approach?
Pretty high, but if there are different viewpoints, one should better harmonize them. Domains: psychological_features, psychology, quality, military etc. SUMO/MILO: EmotionalState, Happiness, Psychological Process, SubjectiveAssesmentAttribute, StateOfMind, TraitAttribute, Unhappiness, War etc. SENTIWN, DOMAINS, SUMO&MILO, General Inquirer annotations should, intuitively, match! Do they? Not really!

Some statistics 2637 synsets labeled by the SUMO concept SubjectiveAssessmentAttribute or EmotionalState have the Sentiwn annotation P:0, N:0, O:1 Eg.(SAA): prosperously, impertinently, best, victimization, oppression, honeymoon, prettify, beautify, curse, threaten, justify, waste, cry, experience… Eg. (ES): unsatisfactorily, lonely, sordid, kindness, disappointment, frustration…

Some statstics (ctnd.) 28434 synsets are marked for subjectivity. Many of them are questionably marked so: e.g. Abstract, BodyPart, Building, Device, EngineeringComponent (most of the time with negative polarity), FieldOfStudy (nonparametric statistics is bad: 0.0<N>0.625</N><O>0.375</O>, while gastronomy is much better: 0.5<N>0.0</N><O>0.5</O>) Happyness: Happy, pleased which is similar to glad (1) is not good (0.0<N>0.75</N><O>0.25</O> Human instances (both real persons or literary characters LinguisticExpression (extralinguistic is very bad: 0.0<N>0.75</N><O>0.25</O>) PrimeNumber (0.0<N>0.375</N><O>0.625</O>) Proposition (conservation of momentum: 0.0<N>0.25</N><O>0.75</O> DiseaseOrSyndrome (influenza, flu, grippe: 0.75<N>0.0</N><O>0.25</O>) Prison (Jail is not bad at all, it’s a little fun:0.25<N>0.0</N><O>0.75</O>) etc

Some other statistics We took Wiebe’s hand-crafted list of PositivePolarity and MinusPolarity words (based on General Inquirer): PolPman file contains 657 words PolMman file contains 679 words We extracted all the synsets in PWN2.0 containing the literals in Wiebe’s files PwNPolPman file contains 2305 synsets 817 synsets are marked as entirely objective O:1 239 synsets have non-positive subjectivity (P:0) 486 synsets have P 0,5 (corresponding to 293 literals) PwPolMman file contains 1803 synsets 461 synsets are marked as entirely objective O:1 213 synsets have non-negative subjectivity (N:0) 656 synsets have N 0,5 (corresponding to 356 literals)

Why these happen? Assuming WordNet structuring is perfect &
Assuming the SUMO&MILO classification is perfect: Assuming Wiebe’s polarity lists are perfect Taxonomic generalization is not always working Nightmare is bad! Nightmare is a dream But dream is not bad (per se) An emotion is something good (P:0,5) and so is love, but hate or envy are not! Glosses are full of valence shifters (BOW is not sufficient): honest, honorable: not disposed to CHEAT- or DEFRAUD- , not DECEPTIVE- or FRAUDULENT- intrepid: invulnerable to FEAR- or INTIMIDATION- superfluous: serving no USEFUL+ purpose; having no EXCUSE+ for being Majority voting is democratic but not the best solution 0,5

Why these happen (cntd)?
But Wiebe’s polarity lists are not perfect & SUMO&MILO classification is not perfect & WordNet structuring is not perfect

Sentence Subjectivity Scorer
A very naïve implementation: For each sentence in each language add the P, N and O figures of each word He has(1) no(1) merits(1). P:0.0;N:0.0;O:1 P:0.25;N:0.25:O:0.5 P:0.625;N:0.0:O:0.375 Sentence_1 score: P:0.292;N:0.083;O:0.625 He has(1) all the merits(1). Sentence_2 score: P:0.208;N:0.0;O:0.792

What’s wrong with this naïve scorer?
It doesn’t consider the valency shifters. The stuff(1) was_like nitric_acid(1)… P:0;N:0:O:1=>P:0.5;N:0.5;O:0 … had _sensation of being hit (4) on the back_of_the_head (1) with a rubber(1) club(3). With valency shifters considered, either the SO or the PN or both polarities are switched. Sentence_1 score: P:0,063;N:0.563;O:0.375 Now this is in line with OpinionFinder!

Exploiting SentiWordNet
Connotation analyzer (CONAN): Given a sentence, check if it may have a subjective/objective unwanted connotation (and to what extent, if any) The experimental data: SEMCOR (English and its Romanian translation); also tagged, lemmatized, chunked, word aligned and WSDed.

A Translation Unit from the
En-Ro SEMCOR parallel corpus (tagged, lemmatized, chunked, word aligned and WSDed)

Three runtime modes: a) Get the most objective reading For each word, the sense with the highest O score is selected b) Get the most positive subjective reading For each word, the sense with the highest P score is selected c) Get the most negative subjective reading For each word, the sense with the highest N score is selected If one of the scores is significantly higher than the others, there is no risk of inducing an unwanted connotation; Otherwise, one could spot the word(s) the senses of which determined the subjectivity polarity variation and maybe make other lexical choices. The system works on texts tagged and lemmatized, irrespective if it is WSDed or not. If the words are WSDed, CONAN returns the Subjectivity-Objectivity scores calculated according to the already assigned senses (the same scores, irrespective of the runtime mode).

What else could be CONAN good for?
Jan Wiebe & Rada Mihalcea (ACL2006) state the following Hypothesis: instances of subjective senses are more likely to be in subjective sentences, so sentence subjectivity is an informative feature for WSD of words with both subjective and objective senses. S-O sentence polarity is a cheap process with pretty high accuracy. If sentence S-O polarity sentence is known, this might be a very strong clue for sense disambiguation: Subjective sentence: He was boiling with anger boil(5): ENG v anger(3): ENG n Objective sentence: The water boiled boil(1): ENG v

Personal opinions  SentiWordNet is one of the best resource for Opinion Mining; We can bring evidence that it works cross-linguistically –via aligned wordnets- (almost) as well as for English (see my talk on Thursday) It should be supported with a concerted validation work; Different synset labellings pertaining to subjectivity should be conciliated Multilingual experiments could bring strong evidence for prior polarity assignment to lexical items; The verb (and deverbal noun)’s argument structure is essential in finding out who or what is good or bad The polarity of several adjectives and adverbs (modifiers) are head-dependent: long- response time vs. long+ life battery high- pollution vs. high+ standard Prior polarity assignment only to head-independent modifiers; for others, very useful WN relations might be typically-modifies, is-characteristic-of etc. which could have attached prior polarities

Language Web Services This is just started; it was fostered by the need to closer cooperate with our partners at UAIC in various projects (ROTEL, CLEF, LT4L, etc). Currently we added basic text processing for Romanian and English: tokenisation, tiered tagging, lemmatization, RoWordNet (SOAP/WSDL/UDDI). Some others, for parallel corpora (sentence aligner, word aligner, dependency linker, etc.) will be soon there.

CONCLUSIONS: Availability
The basic tools for preprocessing Romanian texts (sentence splitting, tokenisation, lemmatization and tagging) are already available as web services (SOAP/WSDL/UDDI). Some others, for parallel corpora (sentence aligner, word aligner, dependency-links aligner, etc.) will be soon there… Some reference corpora for English (1984, Brown, NAACL-news) were re-tagged and freely distributed

Thank you!

Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest

Similar presentations

Presentation on theme: "Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest "— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest

Similar presentations

Presentation on theme: "Dan Tufiş RACAI - Institute for Artificial Intelligence, Bucharest "— Presentation transcript:

Similar presentations

About project

Feedback