Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.

Slides:



Advertisements
Similar presentations
Natural Language and Text Processing Laboratory Projects and Research Directions Head: Alexander Gelbukh
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
CLiNG - May Overview of Research - Computational Terminology - Knowledge extraction from Text - Study of causal relation - Corpus building - Uncertainty.
Corpus Processing and NLP
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Sentiment Analysis with a Multilingual Pipeline 12th International Conference on Web Information System Engineering (WISE 2011) October 13, 2011 Daniëlla.
Machine translation Context-based approach Lucia Otoyo.
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield,
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark Greenwood Natural Language Processing Group University of Sheffield, UK.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Comparative Analysis of Automatic Term and Collocation Extraction Sanja.
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
A Social Network Approach to Unsupervised Induction of Syntactic Clusters for Bengali Monojit Choudhury Microsoft Research India
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Measuring Monolinguality
Linguistic Graph Similarity for News Sentence Searching
A German Corpus for Similarity Detection
Sentiment analysis algorithms and applications: A survey
Web News Sentence Searching Using Linguistic Graph Similarity
NETWORK-BASED MODEL OF LEARNING
Improved Word Alignments Using the Web as a Corpus
Text Mining & Natural Language Processing
Statistical NLP: Lecture 10
Presentation transcript:

Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow

Distributional semantics new area of linguistic research inferring semantic properties of linguistic units from corpora Theoretical foundations: distributional methodology by Z. Harris, F. de Saussure, L. Wittgenstein. Distributional hypothesis: semantically similar words occur in similar contexts. J. R. Firth “You shall know a word by the company it keeps”.

Vector space drink coffee – occurred 1 time drink tea – occurred 2 times

Cosine measure of vector similarity

Main application areas lexical ambiguity resolution information retrieval dictionaries of semantic relations multilingual dictionaries semantic maps of different domains modelling of synonymy document topic detection sentiment analysis

The present research Goal: to apply distributional semantics models to extraction of translation correspondences from a parallel corpus. Vector space model + test corpus

Test corpus Patent texts in French translated into Russian Texts splitted into sentences Alignment at the sentence level – manually verified (in the visual editor MakeBilingua) Uploaded to the Sketch Engine corpus manager

Preprocessing Lemmatization Frequent words removed (prepositions, conjunctions etc.) Punctuation marks removed

Vector space model type of linguistic units: single words; type of context: aligned regions; frequency measure: Boolean frequency (equal either to 1 or 0); method used to compute the distance between vectors: cosine measure.

Example (aligned region as a context) Aligned region #1 présent invention concerner liant minéral notamment hydraulique настоящий изобретение касаться неорганический связующий частность гидравлический связующий

Example (vector space) Aligned region#1#2#3 présent1…… invention1…… concerner1…… настоящий1…… изобретение1…… касаться1……

Results A list of translation correspondences. Linguistic filter: the same part of speech. Precision: 78%.

Correspondences with different POS Syntactic transformations verbal infinitive (French) → noun (Russian) traiter (“to process”) → обработка (“processing”) noun (French) → adjective (Russian) crochet (“hook”) → крюкообразный (“hook-shaped”) verbal infinitive (French) → adjective (Russian) connaître (“to know”) → известный (“well-known”)

Correspondences with different POS Parts of multi-word expressions au moins (“at least”) → по меньшей мере (“at least”) The output of the program: moins → мера

Evaluation Eduardo Cendejas, Grettel Barceló, Alexander Gelbukh, Grigori Sidorov. Incorporating Linguistic Information to Statistical Word-Level Alignment // Proceedings of the 14th Iberoamerican Conference on Pattern Recognition, CIARP 2009, Guadalajara, Jalisco, Mexico, November 15-18, Vector space model + similarity measures PMI, T- score, Log-likelihood ratio and Dice coefficient. Precision – 53%.

Conclusion Distributional semantics methodology can be used to extract translation correspondences from a parallel corpus with a high level of precision. It can be used to study productive syntactic transformations occurring in translation. The present vector space model needs to be enhanced to take into account multi-word expressions.

Thank you!