Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab

Slides:

Advertisements

Similar presentations

ECG Signal processing (2)

Advertisements

Chapter 5: Introduction to Information Retrieval

Introduction to Information Retrieval

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Recognizing hand-drawn images using shape context Gyozo Gidofalvi Computer Science and Engineering Department University of California, San Diego

Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.

Modern Information Retrieval Chapter 5 Query Operations.

Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.

Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Chapter 5: Information Retrieval and Web Search

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦

Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

1 Query Operations Relevance Feedback & Query Expansion.

Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.

Chapter 6: Information Retrieval and Web Search

UMass at TDT 2000 James Allan and Victor Lavrenko (with David Frey and Vikas Khandelwal) Center for Intelligent Information Retrieval Department of Computer.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

Hendrik J Groenewald Centre for Text Technology (CTexT™) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom.

Information Transfer through Online Summarizing and Translation Technology Sanja Seljan*, Ksenija Klasnić**, Mara Stojanac*, Barbara Pešorda*, Nives Mikelić.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs SHONOSUKE ISHIWATARI NOBUHIRO KAJI NAOKI YOSHINAGA.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.

Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

Structured Analysis Methods and Tools

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

An Automatic Construction of Arabic Similarity Thesaurus

CHAPTER 11 Inference for Distributions of Categorical Data

Natural Language Processing of Knee MRI Reports

Descriptive Analysis and Presentation of Bivariate Data

Exploring Matching Networks for Cross Language Information Retrieval

Helen Jefferis, Soraya Kouadri & Elaine Thomas

CHAPTER 11 Inference for Distributions of Categorical Data

MATERIAL Resources for Cross-Lingual Information Retrieval

Memory-augmented Chinese-Uyghur Neural Machine Translation

CHAPTER 11 Inference for Distributions of Categorical Data

Resource Recommendation for AAN

AST Based Sequence to Sequence Natural Language Question to SQL Method

Domain Mixing for Chinese-English Neural Machine Translation

Dialogue State Tracking & Dialogue Corpus Survey

Unsupervised Pretraining for Semantic Parsing

Part of Speech Tagging with Neural Architecture Search

Natural Language to SQL(nl2sql)

CHAPTER 11 Inference for Distributions of Categorical Data

Using Multilingual Neural Re-ranking Models for Low Resource Target Languages in Cross-lingual Document Detection Using Multilingual Neural Re-ranking.

CHAPTER 11 Inference for Distributions of Categorical Data

CHAPTER 11 Inference for Distributions of Categorical Data

Word embeddings (continued)

Retrieval Utilities Relevance feedback Clustering

Sequential Question Answering Using Neural Coreference Resolution

CHAPTER 11 Inference for Distributions of Categorical Data

Deep Learning for the Soft Cutoff Problem

CHAPTER 11 Inference for Distributions of Categorical Data

Stylistic Author Attribution Methods for Identifying Parody Trump Tweets Isaac Pena Department of Computer Science, Yale University, New Haven, CT LILY.

CHAPTER 11 Inference for Distributions of Categorical Data

Retrieval Performance Evaluation - Measures

Information Retrieval and Web Design

CHAPTER 11 Inference for Distributions of Categorical Data

SVMs for Document Ranking

Presentation transcript:

Evaluating Word Embeddings for Cross-Lingual Information Retrieval in Low Density Languages Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab 1Department of Computer Science, Yale University, New Haven, CT Introduction Results This project is under Project MATERIAL, MAchine Translation for English Retrieval of Information in Any Language. Over the course of this project, we will find methods that will aid in information retrieval from documents in low density languages. The queries provided into the system and the summaries outputted will both be in English. We will use Tagalog and Swahili to research and develop our models initially. In order to evaluate our model and methodology, we will be assigned another low density language at a later date. To aid with information retrieval, the IR system will use cross-lingual word embeddings. To perform preliminary evaluation, I find the “most similar” words, as defined by cosine similarity. Similarity is important because because we will use a similar metric when we perform query expansion, the process of reformulating a seed query to improve retrieval performance in information retrieval operations. Finding similar words to replace or add into our query will certainly involve where words are located relative to other words. We see that a lot of the similar words are plural forms of query terms. However, there area lot of similar/related words that will help in query expansion, especially for query terms like “buildings”, “bird”, “picture”, and “eye”. We also have the 5 closest terms for Swahili (Table 2) and Tagalog (Table 3) below as well. For these language, we first translate the query terms from English to the respective language using our parsed dictionaries. We then looked at the transformed word embeddings to find the closest five words. Looking across the cosine similarity scores, we see that in general, the similarity scores of the corresponding pairs are not far off from similarity scores of the query terms. For Tagalog, the similarity scores seem higher across the board and there were not too many “drop-offs” in similarity scores between query terms, especially in the first or second most similar terms. For Swahili, there are more deviations from the similarity score of the query terms. However, most of these deviations were in the third, fourth, and fifth most similar terms. This suggests that our embedding space is fairly well set up, although we may want to consider way to obtain a higher similarity score for translations. Table 1. Table of 5 closest words to query terms. The first column indicates the query term and column i indicates the i-th closest word. Table 4. Table of cosine similarity scores for query terms and 5 closest words to query terms in English-Swahili. The first column indicates the query term pair and column i indicates the i-th closest word pair across translations. Materials and Methods After creating the word embeddings, I was also tasked with doing some preliminary evaluations of their quality. Using Gensim and Word2Vec packages in Python, I first computed the five most similar words to those listed in the sample queries. I also restricted the vocabulary to only 200,000 words in order to filter out less frequently occurring words that could have been the result of imperfect parsing or unclean data. Now, we will find the cosine similarity between the original query terms in English and in one of the low-density languages and then compare the similarity between each of the i-th closest words. That is, I will take the first closest word to “buildings” and the first closest word to the Swahili translation of “buildings” and then find the cosine similarity between those two words. By doing this, we can see how consistent the constructed embedding spaces are. Conclusion and Future Work Next steps for this project can involve making further improvements to the cross-lingual embeddings. We can consider the idea of using “relevance-based” word embeddings where we place word vectors close to each other by relevance instead of by proximity. We can also integrate our word embeddings to perform query extension in our information retrieval systems. Following this, we have to create relevance judgments for evaluating our information retrieval system. We can use French to train the system and determine what improvements need to be made, and then simulate a “surprise” language by using Tagalog. Table 2. Table of 5 closest words to query terms in Swahili. The first column indicates the query term and column i indicates the i-th closest word. Table 5. Table of cosine similarity scores for query terms and 5 closest words to query terms in English-Tagalog. The first column indicates the query term pair and column i indicates the i-th closest word pair across translations. Acknowledgement I would like to thank Professor Dragomir Radev providing guidance and helping me with this project. I would also like to thank the other members of the LILY lab for their willingness to answer all my questions. Table 3. Table of 5 closest words to query terms in Tagalog. The first column indicates the query term and column i indicates the i-th closest word.