Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chinese Academy of Sciences, Beijing, China

Similar presentations


Presentation on theme: "Chinese Academy of Sciences, Beijing, China"— Presentation transcript:

1 Chinese Academy of Sciences, Beijing, China
Semantic Matching by Non-Linear Word Transportation for Information Retrieval Jiafeng Guo* Yixing Fan* Qingyao Ai+ W. Bruce Croft+ *CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China +Center for Intelligent Information Retrieval, University of Massachusetts Amherst, MA, USA

2 Outline Introduction Non-Linear Word Transportation Model Discussion
Experiments Conclusions

3 Introduction Effective Retrieval Models Bag-of-Words (BoW)
Vocabulary mismatch Relevance score exact matching of words() semantically related words()

4 Techniques Query Expansion Latent Models Translation Models
Word Embedding Word Mover’s Distance

5 Query Expansion Global Method Local Method Problem
corpus being search or hand-crafted thesaurus Local Method top ranked documents(PRF) Problem Query drift

6 Latent Models Latent space in reduced dimensionality Problem
Query and Documents(e.g. LDA-based document model) Problem Loss of many detailed matching signals over words Do not improve the performance(need to combine)

7 Translation Models Documents -> Queries(word dependency) Problem
mixture model and binomial model(Berger et al.) title and Document pair(Jin et al.) mutual Information between words(Karimzadehgan et al) Problem How to formalize and estimate the translation probability

8 Word Embedding Semantical representations of words
semantics and syntactic The Potential in IR need to be further explored Bag of Word Embedding(BoWE) monolingual and bilingual(Vulic et al.) generalized language model(Ganguly et al.)

9 Word Mover’s Distance Transportation problem Earth Mover’s Distance
urban planning and civil engineering Earth Mover’s Distance image retrieval and multimedia search Word Mover’s Distance document classification

10 Non-Linear Word Transportation
Bag of Word Embedding(BoWE) Non-linear transportation(Inspired by WMD) Fixed document capacity and non-fixed query capacity Efficiently approximate Neighborhood pruning and indexing strategies

11 Bag of Word Embedding(BoWE)
Richer Representation Similarity between words(e.g., “car” and “auto”) Word Embedding Matrix 𝑊∈ ℝ 𝐾× 𝑉 𝐷={ 𝑤 1 𝑑 , 𝑡𝑓 1 , …, 𝑤 𝑚 𝑑 , 𝑡𝑓 𝑚 } 𝑄={ 𝑤 1 𝑞 ,𝑞 𝑡𝑓 1 , …, 𝑤 𝑛 𝑞 , 𝑡𝑓 𝑛 }

12 Non-Linear Word Transportation
Information Capacity Document word(fixed) Query word(unlimited) Vague nature of query intent Information Gain(Profit) Law of diminishing marginal returns

13 Non-Linear Word Transportation
Find optimal flows 𝐹= 𝑓 𝑖𝑗

14 Non-Linear Word Transportation
Document Word Capacity 𝑐 𝑖 = 𝑡𝑓 𝑖 +𝑢 𝑐𝑓 𝑖 |𝐶| 𝐷 +𝑢 Transportation Profit 𝑟 𝑖𝑗 = 𝑐𝑜𝑠 𝑤 𝑖 𝑑 , 𝑤 𝑖 𝑞 =max⁡(𝑐𝑜𝑠 𝑤 𝑖 𝑑 , 𝑤 𝑖 𝑞 ,0)

15 Transportation Profit
Risk parameter 𝛼 exactly word > semantically related word multiple times “salmon” and “fish”(0.72) The higher 𝛼, the less profit the transportation can bring

16 Model Summary Non-linear word transportation model Damping Effect
Exact and Semantic matching signal Damping Effect Document word capacity Transportation Profit Neighborhood pruning 𝑉 × 𝑄 (e.g. kNN)

17 Model Discussion word alignment effect due to the relaxation of constraints on the query side and the marginal diminishing effect a document will be assigned a higher score interpret more distinct query words

18 Semantic Matching Query Expansion Latent Models
local analysis are orthogonal to our work Latent Models represents the document as a bag of word embeddings Statistical Translation models more flexibility, multiple feature in estimation

19 Word Mover’s Distance NWT WMD Relevance between queries and documents
Maximum profit and non-linear problem WMD Dissimilarity between documents Minimum cost and linear transportation problem

20 Experiments

21 Word Embedding and Evaluation
Word Embeddings Corpus Specific(CBOW and Skip-Gram) Corpus Independent(Glove) Evaluation Measures MAP, and

22 Retrieval Performance and analysis

23 Case Studies Named Entities Ambiguous Acronyms
“brazil america relation” “argentina” and ”spain” for “brazil” “europe” and ”africa” for “america” Ambiguous Acronyms “Find Information on taking the SAT college entrance exam” “fri”, “tue” and “wed”

24 Impact of Word Embeddings

25 Different Dimensionality

26 Indexed Neighbor Size

27 Linear vs. Non-Linear

28 Conclusions Transportation based on the BoWE
capture detailed semantic matching signals The non-linear formulation relaxation of constraints and the margin diminishing effect The flexibility in model definition word capacity and transportation profit


Download ppt "Chinese Academy of Sciences, Beijing, China"

Similar presentations


Ads by Google