Presentation on theme: "SPED 2007, IaşiDan Tufis, Radu Ion1 Parallel Corpora, Alignment Technologies and Further Prospects in Multilingual Resources and Technology Infrastructure."— Presentation transcript:
SPED 2007, IaşiDan Tufis, Radu Ion1 Parallel Corpora, Alignment Technologies and Further Prospects in Multilingual Resources and Technology Infrastructure Dan TUFIŞ, Radu ION Research Institute for Artificial Intelligence, Romanian Academy
SPED 2007, IaşiDan Tufis, Radu Ion2 Parallel corpora More and more data available (Hansard, EuroParl, JRC- Acquis) in the range of tens of millions tokens per language Contain a lot of implicit multilingual knowledge on lexicons, word senses, grammars, collocations, idioms, phraseology, etc. This knowledge, once revealed, is fundamental in supporting cross-lingual and cross-cultural studies, communication and cooperation. Computer applications: Cross-lingual comprehension aids, machine (aided) translation, evaluation of machine translation, language learning aids, cross- language information retrieval, cross-language question answering, etc…
SPED 2007, IaşiDan Tufis, Radu Ion3 Corpora alignment Fully exploiting such linguistic information sources requires parallel corpora alignment ( frequently, high accuracy alignment requires basic language pre-processing: e.g. sentence splitting, tokenization, POS- tagging and lemmatization, chunking, dependency linking/parsing, and word sense disambiguation ). –Sentence alignment –Phrase alignment –Word alignment Immediate outcomes: Translation lexicons, translation memories, translation models, annotation transfer facilities, cross-lingual induction facilities, support for evidence-based cross- linguistic studies. etc…
SPED 2007, IaşiDan Tufis, Radu Ion4 Reified Alignments (1) A bitext alignment is a set of lexical token pairs (links), each of them being characterized by a feature structure. Merging two or more comparable alignments of the same bitext, and using a trained link classifier, one can obtain a better alignment. COWAL a wrapper/merger of the alignments produces by YAWA and MEBA independent word aligners. The classifier’s decisions are entirely based on the links’ feature structures, and the improbable links (competing or not) are removed from the union of the initial alignments.
SPED 2007, IaşiDan Tufis, Radu Ion5 Reified Alignments (2) Features characterizing a link –The feature values are real numbers in the [0,1] interval. –context independent features – CIF, they refer to the tokens of the current link cognate, translation equivalents (TE), POS-affinity, “obliqueness”, TE entropy –context dependent features – CDF, they refer to the properties of the current link with respect to the rest of links in a bi-text. strong and/or weak locality, number of links crossed, collocations –Based on the values of a link’s features we compute for each possible link a global reliability score which is used to license or not a link in the final result.
SPED 2007, IaşiDan Tufis, Radu Ion6 Translation equivalents (TE) –YAWA uses an external bilingual lexicon (TREQ+RO&EN wordnets) –MEBA uses GIZA++ generated candidates filtered with a log-likelihood threshold (11). –For a pair of languages translation equivalents are computed in both directions. The value of the TE feature of a candidate link is 1/2 (P TR (TOKEN1, TOKEN2) + P TR (TOKEN2, TOKEN1). Translation Entropy Score (ES) –The entropy of a word's translation equivalents distribution proved to be an important hint on identifying highly reliable links (anchoring links) –Skewed distributions are favored against uniform ones – For a link, the link feature value is 0.5(ES(A)+ES(B))
SPED 2007, IaşiDan Tufis, Radu Ion7 Cognates (COGN) T S = 1 2... k ; T T = 1 2... m if i and are j the matching characters, & if ( i ) is the distance (in chars of T S ) from the previous matching , & if ( i ) is the distance (in chars of T T ) from the previous matching then Part-of-speech affinity (PA) The translated words tend to keep their part-of-speech and when they have different POSes, this is not arbitrary. The information was computed based on a gold standard (GS2003), in both directions (source-target and target-source). For a link PA=0.5*(P(cat(A)|cat(B))+P(cat(B)|cat(A))
SPED 2007, IaşiDan Tufis, Radu Ion8 Collocation –Bi-gram lists (only content words) were built from each monolingual part of the training corpus, using the log-likelihood score (threshold of 10) and minimal occurrence frequency (3) for candidates filtering. Collocation probabilities are estimated for each surviving bi-gram. –If neither token of a candidate link has a relevant collocation score with the tokens in its neighborhood, the link value of this feature is 0. Otherwise the value is 1. Competing links (starting or finishing in the same token) for YAWA are licensed only and only if at least one of them have a non-null collocation score. Obliqueness –Each token in both sides of a bi-text is characterized by a position index, computed as the ratio between the relative position in the sentence and the length of the sentence. The absolute value of the difference between tokens’ position indexes subtracted from 1, gives the link’s “obliqueness” OBL( ).
SPED 2007, IaşiDan Tufis, Radu Ion9 Locality –When the dependency chunking module is available, and the chunks are aligned via the linking of their constituents, the new candidate links starting in one chunk should finish in the aligned chunk (strong locality). Strong Locality:EM and CLAM Combined Linkers –We have modified the EM algorithm of IBM-1 to work on a ‘bitext’ that contains the source sentence and a replica of it as the target: »disregard the NULL alignment links »disregard words that are on the same position –LAM introduced by Yuret, D. (1998). Discovery of linguistic relations using lexical attraction. PhD thesis, Dept of Computer Science and Electrical Engineering, MIT (subject to planarity restriction) –Constrained LAM: a link is rejected if it does not pass any of the linking rules of a language: for instance the number agreement –When the dependency chunking is not available, the locality is judged in a variable length window depending on the length of the current aligned sentences (weak locality)
SPED 2007, IaşiDan Tufis, Radu Ion10 Weak Locality When chunking/dependency links information is not available, the link localization is judged against a window containing m links. The value of m dependents on the aligned sentences length. The window is centered on the candidate link. s1s2....s...sms1s2....s...sm t1t2...t...tmt1t2...t...tm Combining classifiers If multiple classifiers are comparable, and if they do not make similar errors, combining their classifications is always better than the Individual classifications.
SPED 2007, IaşiDan Tufis, Radu Ion11 COWAL An integrated platform that takes two parallel raw texts and produces their alignment –basic modules: collocations detector, tokenizers, lemmatizers, POS-taggers, two or more comparable word-aligners (YAWA, MEBA), GIZA++ translation model builder, alignment combiner, –optional modules : sentence aligner,, dependency “linkers, chunkers and bilingual dictionaries (Ro-En aligned wordnets) –The platform also includes an XML generator (XCES schema compliant), an alignment viewer & editor, and a WSD based on WA and aligned wordnets.
SPED 2007, IaşiDan Tufis, Radu Ion12 Combining the Alignments COWAL filters the reunion of the alignments. The filtering is achieved by a SVM classifier (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) trained on our version of the GS2005 (for positive examples) and the differences among the basic alignments (YAWA, MEBA) and the GS2005 (for negative examples);http://www.csie.ntu.edu.tw/~cjlin/libsvm/ The SVM classifier (LIBSVM (Fan et al., 2005) uses the default parameters: C-SVC classification (soft margin classifier) and RBF kernel (Radial Basis Function ) Features used for the training (10-fold validation; about 7000 good examples and 7000 bad examples) : TE(S,T), TE(T,S), OBL(S,T), LOC(S,T), PA(S,T), PA(T,S) The links labeled as incorrect links were removed from the merged alignments.
SPED 2007, IaşiDan Tufis, Radu Ion13 The words unaligned in the previous step may get links via their aligned dependents (HLP: Head Linking Projection heuristics): if b is aligned to c and b is linked to a, link a to c, unless there exist d in the same chunk with c, linked or not to it, and the POS category of d has a significant affinity with the category of a. ac bbd –Alignment of sequences of words surrounded by the aligned chunks –Filtering out improbable links (e.g.links that cross many other links) Heuristics for improving the alignment (1)
SPED 2007, IaşiDan Tufis, Radu Ion14 Heuristics for improving the alignment (2) Unaligned chunks surrounded by aligned chunks get probable phrase alignment: SL TL Ws i ↔ Wt j Ws k ↔ Wt m Ws p Ws p+1 … ↔ Wt q Wt q+1 …
SPED 2007, IaşiDan Tufis, Radu Ion15 Dependency chunks &Translation Model Regular expressions defined over the POS tags and dependency links Non-recursive chunks Chunk alignment based on their aligned constituents (one or more).
SPED 2007, IaşiDan Tufis, Radu Ion18 Exploiting the alignments (1) Applying the same methodology and the same assumption (two aligned words MUST have at least one cross-lingual equivalent meaning-ILI code): Aligned Wordnets Validation (“1984”) –Identifying the wrong ILI sense mappings –Identifying missing synsets from the commonly agreed set of synsets (BCS1, BCS2, BCS3, …) Extending the wordnets (“Ro-En-SemCor”) –Identifying missing literals in the existing aligned synsets –Automatically adding new synsets (monosemous literals, instances) WSD (arbitrary Ro-En bitexts)
SPED 2007, IaşiDan Tufis, Radu Ion19 Exploiting the alignments (2) Annotation transfer as a cross-lingual collaboration task –“1984” parallel corpus; word aligned; the English part dependency parsed (Wolverhampton), validated and corrected (Univ. A.I. Cuza, Iasi); the Romanian part imported the parsing. NoRelROLostENAc 1qn1001283.3% 2neg1001376.9% 3oc30475.0% 4dat30475.0% 5cnt801172.7% 6ad2503571.4% 7pcomp218931671.0% 8det12617335569.2% 9comp70111263.0% NoRelROLostENAc 10attr151424562.7% 11cc94215559.4% 12pm4417558.5% 13obj79213758.5% 14mod11417455.4% 15cla801553.3% 16tmp2394650.0% 17man1603250.0% 18subj1217231948.9%
SPED 2007, IaşiDan Tufis, Radu Ion20 Collocations analysis in a parallel corpus –Large parallel corpus (Acq-Com) –University Marc Bloch from Strasbourg, IMS Stuttgart University and RACAI independently extracted the collocation in Fr, Ge, Ro and En (hub). –We identified the equivalent collocations in the four languages. SURE-COLLOC X = COLLOC X TR X -COLLOC Y (EQ1) member states, European Communities, international treaty, etc. INT-COLLOC Z = COLLOC Z \ SURE-COLLOC Z (EQ2) adversely affect a aduce atingere; legal remedy cale de atac, to make good the damage a compensa daunele etc.  A mot-a-mot translation would be to bring a touch   A mot-a-mot translation would be way to attack  A mot-a-mot translation would be to compensate the damages Exploiting the alignments (3)
SPED 2007, IaşiDan Tufis, Radu Ion21 Language Web Services This is just started; it was fostered by the need to closer cooperate with our partners at UAIC, University of Texas, University of Strasbourg, University of Stutgart in various projects (ROTEL, CLEF, LT4L, AUF, etc). Currently we added basic text processing for Romanian and English: tokenisation, tiered tagging, lemmatization (SOAP/WSDL/UDDI). Some others, for parallel corpora (sentence aligner, word aligner, dependency linker, RoWordNet, etc.) will be soon there.
SPED 2007, IaşiDan Tufis, Radu Ion22 Initiatives Towards Language Infrastructures Global Wordnet Association Language Grid CLARIN (including DAMLR) Major goal: construction and operation of a shared distributed infrastructure that aims at making language resources and technology available to anybody. An infrastructure has to offer persistent services that allow to operate on language resources and technologies with a high availability and proper security for its users; The automatic processing of language material is of a complexity that cannot be tackled with the current fragmented approaches. What is needed, is primarily to turn existing, fragmented technology and resources into accessible and stable services so that users can use them the way they want it.