Chapter 6 Queries and Interfaces. Keyword Queries n Simple, natural language queries were designed to enable everyone to search n Current search engines.

Slides:

Advertisements

Similar presentations

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Chapter 5: Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Information Retrieval in Practice

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.

Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.

Evaluating Search Engine

Chapter 7 Retrieval Models.

Information Retrieval in Practice

Chapter 6 Queries and Interfaces. Keyword Queries n Simple, natural language queries were designed to enable everyone to search n Current search engines.

Search Engines and Information Retrieval

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Information Retrieval in Practice

To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.

Search Engines and Information Retrieval Chapter 1.

MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.

Chapter 6 Queries and Interfaces. Keyword Queries n Simple, natural language queries were designed to enable everyone to search n Current search engines.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.

Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.

Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

Chapter 6: Information Retrieval and Web Search

1 Computing Relevance, Similarity: The Vector Space Model.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

1 FollowMyLink Individual APT Presentation Third Talk February 2006.

Chapter 23: Probabilistic Language Models April 13, 2004.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.

Vector Space Models.

C.Watterscsci64031 Probabilistic Retrieval Model.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Information Retrieval in Practice

Queries and Interfaces

Plan for Today’s Lecture(s)

Information Retrieval in Practice

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Search Engine Architecture

Information Retrieval in Practice

Multimedia Information Retrieval

Relevance and Reinforcement in Interactive Browsing

INF 141: Information Retrieval

Information Retrieval and Web Design

Presentation transcript:

Chapter 6 Queries and Interfaces

Keyword Queries n Simple, natural language queries were designed to enable everyone to search n Current search engines do not perform well (in general) with natural language queries n People trained (in effect) to use keywords  Compare average of about 2.3 words/Web query to average of 30 words/Community-based Question Answering (CQA) query n Keyword selection is not always easy  Query refinement techniques can help 2

Query-Based Stemming n Make decision about stemming at query time rather than during indexing  Improved flexibility and effectiveness n Query is expanded using word variants  Documents are not stemmed  e.g., “rock climbing” expanded with “climb”, not stemmed to “climb” 3

Stem Classes n Stemming generates stem classes n A stem class is the group of words that will be transformed into the same stem by the stemming algorithm  Generated by running stemmer on large corpus  e.g., Porter stemmer on TREC News on 3 stem classes with the first entry being the stem 4

Stem Classes n Stem classes are often too big and inaccurate n Modify using analysis of word co-occurrence n Assumption:  Word variants that could substitute for each other should co-occur often in documents. For example, Meeting  Assembly Adult  Grown up 5

Modifying Stem Classes 6 n For all pairs of words in the stem classes, count how often they co-occur in text windows of W words, where W is in the range n Compute a co-occurrence or association metric for each pair, which measures the degree of association between the words n Construct a graph where the vertices represent words and the edges are between words whose co-occurrence metric is above a threshold value, which is set empirically n Find the connected components of the graph. These are the new stem classes.

Modifying Stem Classes n Dices’ Coefficient is an example of a term association measure , where n x is the number of windows (documents) containing x  a measure of the proportion of term co-occurrence n Two vertices are in the same connected component of a graph if there is a path between them  Forms word clusters based on a threshold value n Sample output of modification 7

Spell Checking n Important part of query processing  10-15% of all Web queries have spelling errors n There are many types of errors, e.g., errors extracted from query logs 8

Spell Checking n Basic approach: suggest corrections for words not found in spelling dictionary  Many spelling errors are related to websites, products, companies, people, etc. that are unlikely to be found n Suggestions found by comparing word to words in dictionary using similarity measure n Most common similarity measure is edit distance  Number of operations required to transform one word into the other 9

Edit Distance n Damerau-Levenshtein Distance  Counts the minimum number of insertions, deletions, substitutions, or transpositions of single characters required  e.g., Damerau-Levenshtein distance 1 (single-character errors)  Distance 2 10

Edit Distance n Different techniques used to speed up calculation of edit distances -- restrict to words that  start with same character (spelling errors rarely occur in the first letter)  come with similar length (spelling errors rarely occur on words with the same length)  sound the same (homophone, rules map words to codes) n Last option uses a (same) phonetic code to group words  e.g., Soundex 11

Soundex Code 12 (correct word may not always have the same Soundex code)

Spelling Correction Issues n Ranking corrections (> 1 possible corrections for an error)  “Did you mean...” feature requires accurate ranking of possible corrections (more likely: the best suggestion) n Context  Choosing right suggestion depends on context (other words)  e.g., lawers → lowers, lawyers, layers, lasers, lagers but trial lawers → trial lawyers n Run-on errors (word boundaries are skipped/mistyped)  e.g., “mainscourcebank”  Missing spaces can be considered another single character error in right framework 13

Noisy Channel Model n Address the issues of ranking, context, and run-on errors n User chooses word w based on probability distribution P(w)  Called the language model  Can capture context information about the frequency of occurrence of a word in text, e.g., P(w 1 | w 2 )  The probability of observing a word, given that another one has just been observed n User writes word w, but noisy channel causes word e to be written instead with probability P(e | w)  Called error model  Represents information about the frequency of spelling errors 14

Noisy Channel Model n Need to estimate probability of correction – to represent info. about the frequency of different types of errors  P(w | e) = P(e | w)P(w), i.e., the probability that given a written word e, the correct word is w n Estimate language model probability using context P(w) = λP(w) + (1 − λ)P(w | wp) where wp is a previous word of w, and λ is a parameter which specifies the relative importance of P(w) & P(w | wp) n Examples.  “fish tink”: “tank” and “think” both likely corrections, but 15 Error Model Probability Probability of Word Occurrence P(tank | fish) > P(think | fish) Probability of Word Co-occurrence rank

Noisy Channel Model n Language model probabilities estimated using corpus and query log n Both simple and complex methods have been used for estimating error model  Simple approach: assume that all words with same edit distance have same probability, only edit distance 1 and 2 considered  More complex approach: incorporate estimates based on common typing errors Estimates are derived from large collections of text by finding many pairs of (in)correctly spelled words 16

Relevance Feedback n A query expansion and refinement technique n User identifies relevant (and maybe non-relevant) documents in the initial result list n System modifies query using terms from those documents and re-ranks documents n Pseudo-relevance feedback  Assumes top-ranked documents are relevant – no user input  Keywords are added/dropped or their weights increase/ decrease 17

Pseudo-relevance Feedback Example Top 10 documents for “tropical fish” 18 Snippets

Relevance Feedback Example n If we assume top 10 are relevant, most frequent terms are (with frequency): a (926), td (535), href (495), http (357), width (345), com (343), nbsp (316), www (260), tr (239), htm (233), class (225), jpg (221)  Too many stopwords and HTML expressions n For query expansion, use only snippets and remove stopwords tropical (26), fish (28), aquarium (8), freshwater (5), breeding (4), information (3), species (3), tank (2), Badman’s (2), page (2), hobby (2), forums (2) 19

Relevance Feedback - Query Logs n Drawback of the pseudo-relevance feedback strategy:  When the initial ranking does not contain many relevant documents, the expansion are unlikely to be helpful n Query logs provide important contextual information that can be used effectively for query expansion n Context in this case is  Previous queries that are the same  Previous queries that are similar  Query sessions including the same query n Query history for individuals could be used for caching 20

Relevance Feedback n Rocchio algorithm  Based on the concept of optimal query  Maximizes the difference between the i. average vector representing the relevant documents, and ii. average vector representing the non-relevant documents n Modifies query according to  α, β, and γ are parameters Typical values 8, 16, and 4 21

Snippet Generation n Successful search engine interface depends on users’ understanding of the (contents of) query results n Snippets are query-dependent document summaries n Snippet generation is a simple text summarization  Rank each sentence in a document using a significance factor, first proposed by H. P. Luhn in 50’s  Select the top sentences for the summary with a number of significance words 22

Sentence Selection n Significance factor for a sentence is calculated based on the occurrence of significant words  If f d,w is the frequency of word w in document d, then w is a significant word if it is not a stopword, i.e., a high-frequency word, and where s d is the number of sentences in document d  Text is bracketed by significant words (limit on number of non-significant words in bracket) 23

Sentence Selection n Significance factor for bracketed text spans is computed by (i) dividing the square of the number of significant words in the span by (ii) the total number of words n Example. Significance factor = 4 2 /7 = The limit set for non-significant words in a bracket is typically 4

Snippet Generation n Involves more features than just significance factor n A typical sentence-based, snippet-generation approach: 1. Whether the sentence is a heading 2. Whether it is the 1 st or 2 nd line of the document 3. The total number of query terms occurring in the sentence 4. The number of unique query terms in the sentence 5. The longest contiguous run of query words in the sentence 6. A density measure of query words (i.e., significance factor on query words in sentences) n Weighted combination of features used to rank sentences 25

Clustering Results n Result lists often contain documents related to different aspects of the query topic n Clustering is used to group related documents to simplify browsing Example clusters for query “tropical fish” 26

Result List Example Top 10 docs for “tropical fish” 27