Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Fast Algorithms For Hierarchical Range Histogram Constructions
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Information Retrieval in Practice
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Evaluating Search Engine
Chapter 7 Retrieval Models.
1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Presented by Zeehasham Rasheed
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Scalable Text Mining with Sparse Generative Models
Language Modeling Approaches for Information Retrieval Rong Jin.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Search Engines and Information Retrieval Chapter 1.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Using Social Annotations to Improve Language Model for Information Retrieval Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
PERSONALIZED DIVERSIFICATION OF SEARCH RESULTS Date: 2013/04/15 Author: David Vallet, Pablo Castells Source: SIGIR’12 Advisor: Dr.Jia-ling, Koh Speaker:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Language Model for Machine Translation Jang, HaYoung.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at
Hidden Markov Models BMI/CS 576
Neighborhood - based Tag Prediction
CSCI 5417 Information Retrieval Systems Jim Martin
Introduction to Statistical Modeling
N-Gram Model Formulas Word sequences Chain rule of probability
CS590I: Information Retrieval
INF 141: Information Retrieval
Conceptual grounding Nisheeth 26th March 2019.
Presentation transcript:

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi

Introduction Social Annotation: A process where users collaboratively assign a short sequence of keywords (tags) to a number of resources ▫Each tag sequence is a concise and accurate summary of the resource’s content ▫Meant to aid navigation through a collection Leads to searching via tags ▫Enables relevant text retrieval ▫Allows accurate retrieval of non-textual objects ▫Presents a need for an efficient retrieval and ranking method based on user tags

RadING Ranking annotated data using Interpolated N- Grams Searching and ranking method based exclusively on user tags Uses interpolated n-grams to model tag sequences associated with every resource How does it rank?

Probabilistic Foundations Goal: To rank resources by the probability that they will be relevant to the query Given keyword query Q, and a collection of resources R, we apply Bayesian theorem to get: p(R is relevant | Q) = p(Q|R is relevant)p(R is Relevant) p(Q) Where p(R is relevant) is the probability that R is relevant, independent of the query posed and p(Q) is the probability of the query issued

Probabilistic Foundations p(R is relevant) is constant throughout the resource collection, as well as p(Q) ▫Meaning: ranking resources by p(R is relevant|Q) is equivalent to ranking by p(Q|R is relevant) In order to estimate the probability of the query being “generated” by each resource, resources need to be modeled based on knowledge of social annotation

Dynamics and Properties of the Social Annotation Process The goal of the tagging process is to describe the resource’s content User opinions crystallize quickly, can find annotation trends after witnessing a small number of assignments Therefore we assume the following: ▫p(Q | R is relevant) = p(Q is used to tag R) ▫In English: Users will use keyword sequences derived from the same distribution to both tag and search for a resource

Social Annotation Process: Things to consider… Resources are rarely given assignments with one tag Also, tag positions are not random, progress from left to right from more general to more specific Tags representing different perspectives on a resource are less likely to occur together in the same assigment Used n-gram models to model these co- occurance patterns

N-gram Models Given an assignment made up of a sequence (s) of l tags t 1 …t l, the probability of this sequence being assigned to a resource is: ▫p(t 1, …,t l ) = p(t 1 )p(t 2 |t 1 )…p(t l |t 1,…, t l-1 ) The purpose of using n-gram models is to approximate the probability of a subsequence with only the last n-1 tags ▫In the case of a bi-gram model, p(t k |t 1,…,t k-1 ) approximates to p(t k |t k-1 )

N-gram Models Calculate the probability using the Maximum Likelihood equation c(t 1, t 2 ) = the number of occurrences of the bi- gram The summation is the sum of the occurrences of all bigrams involving t 1 as the first tag

Interpolation Interpolation is used to compensate for sparse data, distributes probability mass from high counts to low counts Used the Jelinek-Mercer interpolation technique. Applied to a bi-gram, yields:

Parameter Optimization Goal: to maximize the likelihood function L(λ 1,λ 2 ) in order to find the ideal interpolation parameters Definitions: ▫D*: The constrained domain of λ 1 and λ 2 ▫λ * : The global maximum of L(λ 1,λ 2 ) ▫λ c : The point at which L(λ 1,λ 2 ) evaluates to its maximum value within D*, which must be found to optimize parameters

RadING Optimization Framework Step 1: If L(λ 1,λ 2 ) is unbounded, perform 1D optimization to locate λ c Step 2: If L(λ 1,λ 2 ) is bounded, apply 2D optimization to find λ* Step 3: If λ* is not in D*, locate λ c

Searching Process Step 1: Train a bi-gram model for each resource ▫Compute the bi-gram and unigram probability and optimize the interpolation parameters Step 2: At query-time compute the probability of the query keyword sequence being generated by each resource’s bi-gram model Use Threshold Algorithm to compute top-k results

Searching Example

Experimental Evaluation Test data: web crawl of del.icio.us ▫70,658,851 assignments ▫Posted by 567,539 users ▫Attached to 24,245,248 unique URLs ▫Average length of assignment: 2.77 ▫Standard deviation: 2.70 ▫Median: 2

Optimization Efficiency

Ranking Effectiveness Compares RadING ranking method to adaptations of tf/idf ranking ▫Tf/Idf: concatenates resources’ assignments into a document and performs raking based tf/idf similarity to each document ▫Tf/Idf+: computes tf/idf similarity of each individual assignment and rank resources based on average similarity 10 Judges contacted through Amazon Mechanical Turk to measure precision

Ranking Effectiveness