Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Similar presentations


Presentation on theme: "Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow."— Presentation transcript:

1 Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

2 Distributional semantics new area of linguistic research inferring semantic properties of linguistic units from corpora Theoretical foundations: distributional methodology by Z. Harris, F. de Saussure, L. Wittgenstein. Distributional hypothesis: semantically similar words occur in similar contexts. J. R. Firth “You shall know a word by the company it keeps”.

3 Vector space drink coffee – occurred 1 time drink tea – occurred 2 times

4 Cosine measure of vector similarity

5 Main application areas lexical ambiguity resolution information retrieval dictionaries of semantic relations multilingual dictionaries semantic maps of different domains modelling of synonymy document topic detection sentiment analysis

6 The present research Goal: to apply distributional semantics models to extraction of translation correspondences from a parallel corpus. Vector space model + test corpus

7 Test corpus Patent texts in French translated into Russian Texts splitted into sentences Alignment at the sentence level – manually verified (in the visual editor MakeBilingua) Uploaded to the Sketch Engine corpus manager

8 Preprocessing Lemmatization Frequent words removed (prepositions, conjunctions etc.) Punctuation marks removed

9 Vector space model type of linguistic units: single words; type of context: aligned regions; frequency measure: Boolean frequency (equal either to 1 or 0); method used to compute the distance between vectors: cosine measure.

10 Example (aligned region as a context) Aligned region #1 présent invention concerner liant minéral notamment hydraulique настоящий изобретение касаться неорганический связующий частность гидравлический связующий

11 Example (vector space) Aligned region#1#2#3 présent1…… invention1…… concerner1…… настоящий1…… изобретение1…… касаться1……

12 Results A list of translation correspondences. Linguistic filter: the same part of speech. Precision: 78%.

13 Correspondences with different POS Syntactic transformations verbal infinitive (French) → noun (Russian) traiter (“to process”) → обработка (“processing”) noun (French) → adjective (Russian) crochet (“hook”) → крюкообразный (“hook-shaped”) verbal infinitive (French) → adjective (Russian) connaître (“to know”) → известный (“well-known”)

14 Correspondences with different POS Parts of multi-word expressions au moins (“at least”) → по меньшей мере (“at least”) The output of the program: moins → мера

15 Evaluation Eduardo Cendejas, Grettel Barceló, Alexander Gelbukh, Grigori Sidorov. Incorporating Linguistic Information to Statistical Word-Level Alignment // Proceedings of the 14th Iberoamerican Conference on Pattern Recognition, CIARP 2009, Guadalajara, Jalisco, Mexico, November 15-18, 2009. Vector space model + similarity measures PMI, T- score, Log-likelihood ratio and Dice coefficient. Precision – 53%.

16 Conclusion Distributional semantics methodology can be used to extract translation correspondences from a parallel corpus with a high level of precision. It can be used to study productive syntactic transformations occurring in translation. The present vector space model needs to be enhanced to take into account multi-word expressions.

17 Thank you!


Download ppt "Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow."

Similar presentations


Ads by Google