Sampath Jayarathna Cal Poly Pomona

Sampath Jayarathna Cal Poly Pomona
Relevance Feedback Sampath Jayarathna Cal Poly Pomona Credit for some of the slides in this lecture goes to Prof. Ray Mooney at UT Austin & Prof. Rong Jin at MSU

Queries and Information Needs
An information need is the underlying cause of the query that a person submits to a search engine information need is generally related to a task A query can be a poor representation of the information need User may find it difficult to express the information need User is encouraged to enter short queries both by the search engine interface, and by the fact that long queries don’t work

Paradox of IR If user knew the question to ask, there would often be no work to do. “The need to describe that which you do not know in order to find it” Roland Hjerppe It may be difficult to formulate a good query when you don’t know the collection well, but it is easy to judge particular documents returned from a query.

Interaction Key aspect of effective retrieval
users can’t change ranking algorithm but can change results through interaction helps refine description of information need Interaction with the system occurs during query formulation and reformulation while browsing the result System can help with query refinement Fully automatically or With the user in the loop

Global Vs Local Methods
Query expansion / reformulation with thesaurus or WordNet Query expansion via automatic thesaurus generation Techniques like spelling corrections Local Relevance Feedback Pseudo relevance feedback (blind relevance)

Relevance Feedback After initial retrieval results are presented, allow the user to provide feedback on the relevance of one or more of the retrieved documents. User identifies relevant (and maybe non-relevant) documents in the initial result list Use this feedback information to reformulate the query. System modifies query using terms from those documents and re-ranks documents Produce new results based on reformulated query. Allows more interactive, multi-pass process.

Relevance Feedback Architecture
Document corpus Query String Query Reformulation Revised Query IR System Rankings ReRanked Documents 1. Doc2 2. Doc4 3. Doc5 . Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . 1. Doc1  2. Doc2  3. Doc3  . Feedback

Query Reformulation Revise query to account for feedback:
Query Expansion: Add new terms to query from relevant documents. Term Reweighting: Increase weight of terms in relevant documents and decrease weight of terms in irrelevant documents.

Query Reformulation Change query vector using vector algebra.
Add the vectors for the relevant documents to the query vector. Subtract the vectors for the irrelevant docs from the query vector. This both adds both positive and negatively weighted terms to the query as well as reweighting the initial terms.

Relevance Feedback in Vector Model
Java Microsoft Starbucks D2 D5 D4 D3 Query D1 D6

Relevance Feedback in Vector Space
Goal: Move new query closer to relevant documents and meanwhile far away from the irrelevant documents Approach: New query is a weighted average of original query, and relevant and non-relevant document vectors Since all relevant documents unknown, just use the known relevant (R) and irrelevant (NR) sets of documents and include the initial query Q. Original Query

Goal: Move new query closer to relevant documents and meanwhile far away from the irrelevant documents Approach: New query is a weighted average of original query, and relevant and non-relevant document vectors  weights the relevant documents R: the set of relevant docs |R|: the number of relevant docs

Goal: Move new query closer to relevant documents and meanwhile far away from the irrelevant documents Approach: New query is a weighted average of original query, and relevant and non-relevant document vectors  weights the irrelevant documents NR: the set of irrelevant docs |NR|: the number of irrelevant docs

Evaluating Relevance Feedback
By construction, reformulated query will rank explicitly-marked relevant documents higher and explicitly-marked irrelevant documents lower. Method should not get credit for improvement on these documents, since it was told their relevance. In machine learning, this error is called “testing on the training data.” Evaluation should focus on generalizing to other un-rated documents.

Fair Evaluation of Relevance Feedback
Remove from the corpus any documents for which feedback was provided. Measure recall/precision performance on the remaining residual collection. Compared to complete corpus, specific recall/precision numbers may decrease since relevant documents were removed. However, relative performance on the residual collection provides fair data on the effectiveness of relevance feedback.

Relevance Feedback Problems
Users sometimes reluctant to provide explicit feedback. Relevance feedback is not used in many applications Reliability issues, especially with queries that don’t retrieve many relevant documents Some applications use relevance feedback filtering, “more like this” Query suggestion more popular may be less accurate, but can work if initial query fails

Pseudo Relevance Feedback
What if users only mark relevant documents? What if users only mark irrelevant documents? What if users do not provide any relevance judgments?

Pseudo Relevance Feedback
What if users only mark relevant documents? Assume documents ranked at bottom to be irrelevant What if users only mark irrelevant documents? Let query be the relevant document What if users do not provide any relevance judgments? Treat top ranked documents as relevant Treat bottom ranked documents as irrelevant Implicit relevance feedback User click through data

Pseudo Feedback Architecture
Document corpus Query String Query Reformulation Revised Query IR System Rankings ReRanked Documents 1. Doc2 2. Doc4 3. Doc5 . Ranked Documents 1. Doc1 2. Doc2 3. Doc3 . 1. Doc1  2. Doc2  3. Doc3  . Pseudo Feedback

PseudoFeedback Results
Found to improve performance on TREC competition ad-hoc retrieval task. Works even better if top documents must also satisfy additional boolean constraints in order to be used in feedback.

Thesaurus A thesaurus provides information on synonyms and semantically related words and phrases. Example: physician syn: ||croaker, doc, doctor, MD, medical, mediciner, medico, ||sawbones rel: medic, general practitioner, surgeon,

Thesaurus-based Query Expansion
For each term, t, in a query, expand the query with synonyms and related words of t from the thesaurus. May weight added terms less than original query terms. Generally increases recall. May significantly decrease precision, particularly with ambiguous terms. “interest rate”  “interest rate fascinate evaluate”

WordNet A more detailed database of semantic relationships between English words. Developed by famous cognitive psychologist George Miller and a team at Princeton University. About 144,000 English words. Nouns, adjectives, verbs, and adverbs grouped into about 109,000 synonym sets called synsets.

WordNet Synset Relationships
Antonym: front  back Attribute: benevolence  good (noun to adjective) Pertainym: alphabetical  alphabet (adjective to noun) Similar: unquestioning  absolute Cause: kill  die Entailment: breathe  inhale Holonym: chapter  text (part to whole) Meronym: computer  cpu (whole to part) Hyponym: plant  tree (specialization) Hypernym: apple  fruit (generalization)

WordNet Query Expansion
Add synonyms in the same synset. Add hyponyms to add specialized terms. Add hypernyms to generalize a query. Add other related terms to expand query.

Context and Personalization
If a query has the same words as another query, results will be the same regardless of who submitted the query why the query was submitted where the query was submitted what other queries were submitted in the same session These other factors (the context) could have a significant impact on relevance difficult to incorporate into ranking

User Models Generate user profiles based on documents that the person looks at such as web pages visited, messages, or word processing documents on the desktop Modify queries using words from profile Generally not effective imprecise profiles, information needs can change significantly

Query Logs Query logs provide important contextual information that can be used effectively Context in this case is previous queries that are the same previous queries that are similar query sessions including the same query Query history for individuals could be used for caching

Local Search Location is context
Local search uses geographic information to modify the ranking of search results location derived from the query text location of the device where the query originated e.g., “cpp”

Local Search Identify the geographic region associated with web pages
location metadata, or automatically identifying the locations such as place names, city names, or country names in text Identify the geographic region associated with the query 10-15% of queries contain some location reference Rank web pages using location information in addition to text and link-based features

Query Expansion Conclusions
Expansion of queries with related terms can improve performance, particularly recall. However, must select similar terms very carefully to avoid problems, such as loss of precision.

Sampath Jayarathna Cal Poly Pomona

Similar presentations

Presentation on theme: "Sampath Jayarathna Cal Poly Pomona"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sampath Jayarathna Cal Poly Pomona

Similar presentations

Presentation on theme: "Sampath Jayarathna Cal Poly Pomona"— Presentation transcript:

Similar presentations

About project

Feedback