Chinese Academy of Sciences, Beijing, China

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Discriminative and generative methods for bags of features
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Computer Vision Group, University of BonnVision Laboratory, Stanford University Abstract This paper empirically compares nine image dissimilarity measures.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Dimensional reduction, PCA
Relevance Feedback based on Parameter Estimation of Target Distribution K. C. Sia and Irwin King Department of Computer Science & Engineering The Chinese.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Scalable Text Mining with Sparse Generative Models
Content-Based Image Retrieval using the EMD algorithm Igal Ioffe George Leifman Supervisor: Doron Shaked Winter-Spring 2000 Technion - Israel Institute.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Finding Similar Questions in Large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee Retrieval Models for Question and Answer Archives.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Chapter 6: Information Retrieval and Web Search
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Character Identification in Feature-Length Films Using Global Face-Name Matching IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009 Yi-Fan.
Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.
GRASP Learning a Kernel Matrix for Nonlinear Dimensionality Reduction Kilian Q. Weinberger, Fei Sha and Lawrence K. Saul ICML’04 Department of Computer.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Compact Query Term Selection Using Topically Related Text Date : 2013/10/09 Source : SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling,
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Cross-modal Hashing Through Ranking Subspace Learning
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Large-Scale Content-Based Audio Retrieval from Text Queries
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Personalized Social Image Recommendation
Latent Semantic Indexing
Vector-Space (Distributional) Lexical Semantics
Compact Query Term Selection Using Topically Related Text
John Lafferty, Chengxiang Zhai School of Computer Science
Michal Rosen-Zvi University of California, Irvine
Information Retrieval
Using Multilingual Neural Re-ranking Models for Low Resource Target Languages in Cross-lingual Document Detection Using Multilingual Neural Re-ranking.
Relevance and Reinforcement in Interactive Browsing
INF 141: Information Retrieval
Recuperação de Informação B
Naïve Bayes Text Classification
CS639: Data Management for Data Science
Topic: Semantic Text Mining
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
CS249: Neural Language Model
A Neural Passage Model for Ad-hoc Document Retrieval
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Chinese Academy of Sciences, Beijing, China Semantic Matching by Non-Linear Word Transportation for Information Retrieval Jiafeng Guo* Yixing Fan* Qingyao Ai+ W. Bruce Croft+ *CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China +Center for Intelligent Information Retrieval, University of Massachusetts Amherst, MA, USA

Outline Introduction Non-Linear Word Transportation Model Discussion Experiments Conclusions

Introduction Effective Retrieval Models Bag-of-Words (BoW) Vocabulary mismatch Relevance score exact matching of words() semantically related words()

Techniques Query Expansion Latent Models Translation Models Word Embedding Word Mover’s Distance

Query Expansion Global Method Local Method Problem corpus being search or hand-crafted thesaurus Local Method top ranked documents(PRF) Problem Query drift

Latent Models Latent space in reduced dimensionality Problem Query and Documents(e.g. LDA-based document model) Problem Loss of many detailed matching signals over words Do not improve the performance(need to combine)

Translation Models Documents -> Queries(word dependency) Problem mixture model and binomial model(Berger et al.) title and Document pair(Jin et al.) mutual Information between words(Karimzadehgan et al) Problem How to formalize and estimate the translation probability

Word Embedding Semantical representations of words semantics and syntactic The Potential in IR need to be further explored Bag of Word Embedding(BoWE) monolingual and bilingual(Vulic et al.) generalized language model(Ganguly et al.)

Word Mover’s Distance Transportation problem Earth Mover’s Distance urban planning and civil engineering Earth Mover’s Distance image retrieval and multimedia search Word Mover’s Distance document classification

Non-Linear Word Transportation Bag of Word Embedding(BoWE) Non-linear transportation(Inspired by WMD) Fixed document capacity and non-fixed query capacity Efficiently approximate Neighborhood pruning and indexing strategies

Bag of Word Embedding(BoWE) Richer Representation Similarity between words(e.g., “car” and “auto”) Word Embedding Matrix 𝑊∈ ℝ 𝐾× 𝑉 𝐷={ 𝑤 1 𝑑 , 𝑡𝑓 1 , …, 𝑤 𝑚 𝑑 , 𝑡𝑓 𝑚 } 𝑄={ 𝑤 1 𝑞 ,𝑞 𝑡𝑓 1 , …, 𝑤 𝑛 𝑞 , 𝑡𝑓 𝑛 }

Non-Linear Word Transportation Information Capacity Document word(fixed) Query word(unlimited) Vague nature of query intent Information Gain(Profit) Law of diminishing marginal returns

Non-Linear Word Transportation Find optimal flows 𝐹= 𝑓 𝑖𝑗

Non-Linear Word Transportation Document Word Capacity 𝑐 𝑖 = 𝑡𝑓 𝑖 +𝑢 𝑐𝑓 𝑖 |𝐶| 𝐷 +𝑢 Transportation Profit 𝑟 𝑖𝑗 = 𝑐𝑜𝑠 𝑤 𝑖 𝑑 , 𝑤 𝑖 𝑞 =max⁡(𝑐𝑜𝑠 𝑤 𝑖 𝑑 , 𝑤 𝑖 𝑞 ,0)

Transportation Profit Risk parameter 𝛼 exactly word > semantically related word multiple times “salmon” and “fish”(0.72) The higher 𝛼, the less profit the transportation can bring

Model Summary Non-linear word transportation model Damping Effect Exact and Semantic matching signal Damping Effect Document word capacity Transportation Profit Neighborhood pruning 𝑉 × 𝑄 (e.g. kNN)

Model Discussion word alignment effect due to the relaxation of constraints on the query side and the marginal diminishing effect a document will be assigned a higher score interpret more distinct query words

Semantic Matching Query Expansion Latent Models local analysis are orthogonal to our work Latent Models represents the document as a bag of word embeddings Statistical Translation models more flexibility, multiple feature in estimation

Word Mover’s Distance NWT WMD Relevance between queries and documents Maximum profit and non-linear problem WMD Dissimilarity between documents Minimum cost and linear transportation problem

Experiments

Word Embedding and Evaluation Word Embeddings Corpus Specific(CBOW and Skip-Gram) Corpus Independent(Glove) Evaluation Measures MAP, NDCG@20 and P@20

Retrieval Performance and analysis

Case Studies Named Entities Ambiguous Acronyms “brazil america relation” “argentina” and ”spain” for “brazil” “europe” and ”africa” for “america” Ambiguous Acronyms “Find Information on taking the SAT college entrance exam” “fri”, “tue” and “wed”

Impact of Word Embeddings

Different Dimensionality

Indexed Neighbor Size

Linear vs. Non-Linear

Conclusions Transportation based on the BoWE capture detailed semantic matching signals The non-linear formulation relaxation of constraints and the margin diminishing effect The flexibility in model definition word capacity and transportation profit