| 1 › Gertjan van Noord2014 Search Engines Lecture 7: relevance feedback & query reformulation.

Slides:



Advertisements
Similar presentations
Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Advertisements

CS276 Information Retrieval and Web Search
Introduction to Information Retrieval
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Information Retrieval in Practice
The process of increasing the amount of visitors to a website by ranking high in the search results of a search engine.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
The effects of Web Spam on The Evolution of Search Engines CS315-Web Search and Mining.
1 SOCIAL BOOKMARKING 101. HIBA KHALID BILAL SAEED KHAN FARID ALIANI ASKARI HASAN SOCIAL BOOKMARKING.
ITCS 6265 Information Retrieval and Web Mining Lecture 10: Web search basics.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
Designing for Search Engines MIS 314 MIS 314 Professor Sandvig Professor Sandvig.
Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.
Adversarial Information Retrieval The Manipulation of Web Content.
Query Expansion.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Courtney Forsmann IT Help Desk Manager Lewis-Clark State College October 1, 2014.
Search Engine Optimization ext 304 media-connection.com The process affecting the visibility of a website across various search engines to.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Data Structures & Algorithms and The Internet: A different way of thinking.
The Business Model and Strategy of MBAA 609 R. Nakatsu.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Search Engine Marketing Gay, Charlesworth & Esen Chapter 6.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Search Engine Comparisons By: Thomie Ventura. Search Engines Today, much, but not all, of the work we do revolves around the web Today, much, but not.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Search Engines AGCM 4143 Electronic Communications in Agriculture.
Web Searching. How does a search engine work? It does NOT search the Web (when you make a query) It contains a database with info on numerous Web sites.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Lecture 4 Title: Search Engines By: Mr Hashem Alaidaros MKT 445.
Chapter 6: Information Retrieval and Web Search
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
The Business Model of Google MBAA 609 R. Nakatsu.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Google, Bing, MSN, Yahoo! and many more!. How useful are search Engines? We discussed some of the techniques involved in the previous lesson. Search Engines.
Search Engine Optimization Information Systems 337 Prof. Harry Plantinga.
Steps to an E-business  Developing Concept and Selling Points  Domain name  Website Development  Sales and Marketing.
Relevance Feedback Hongning Wang
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 9: Relevance feedback & query.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Lecture 9: Query Expansion. This lecture Improving results For high recall. E.g., searching for aircraft doesn’t match with plane; nor thermodynamic with.
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
WEB SEARCH BASICS By K.KARTHIKEYAN. Web search basics The Web Ad indexes Web spider Indexer Indexes Search User Sec
Information Retrieval in Practice
Search Engine Architecture
WEB SPAM.
Lecture 12: Relevance Feedback & Query Expansion - II
Lecture 16: Web search/Crawling/Link Analysis
Information Retrieval on the World Wide Web
Data Mining Chapter 6 Search Engines
Discussion Class 9 Google.
Presentation transcript:

| 1 › Gertjan van Noord2014 Search Engines Lecture 7: relevance feedback & query reformulation

Query expansion Global methods: e.g. add (near) synonyms - Using a thesaurus - Using automatically constructed resources Local methods: based on initial results of query - Relevance feedback - Pseude-relevance feedback - Indirect relevance feedback

Relevance feedback 1.The user formulates a query 1.The system gives a list of results 1.The user marks one or more documents of the result list as relevant 1.The system uses this information to modify the original query 1.The system gives a new (or reordered) list of results

Which document information? Query and documents both seen as vectors Positive feedback Add terms from the document vector to the query / give some original query terms more weight Negative feedback Give some query terms less weight

Negative feedback Not commonly asked from the user The highest ranked document that is NOT marked relevant can be considered not-relevant by default

Rocchio feedback formula q mod / q 0 modified query / original query D r relevant docs D nr not-relevant docs d j document vector

Rocchio feedback formula α,β,γ: size of effect for the factor

alpha = 1, beta = 1, gamma = 0.5 Original query: query retrieval interface Results: Document 1: query interfacerelevant Document 2: query textrelevant Document 3: gps interfacenot relevant Exercise

Answer Termsq0d1d2d3Value in qmod gps = -0.5 = 0 interface = 1 query – 0 = 2 retrieval – 0 = 1 text – 0 = 0.5 the new vector qmod is:(interface,query,retrieval,text) (1,2,1,0.5 )

Changing the query vector

Evaluation of RF At least 5 docs should be marked for good results Calculating the effect: what to do with the first result list? User studies, time-based comparison fairest

When is RF not sufficient? Insufficient relevance judgements (min. 5) If the first query fails: spelling cross-language vocabulary mismatch If the result includes more than one cluster of relevant documents subsets with different vocabulary disjunctive answer set (examples)

Problems of RF Users don’t like to go on with a search Results incomprehensible High computing costs In web search, recall is not very important

Relevance FB and the web Simple: more like this But: users not interested in high recall, only precision

Feedback without explicit action from user Pseudo relevance feedback Just assume that top n docs are relevant Works well, sometimes topic drift Indirect relevance feedback use of clickstream data (general or individual)

Global methods for query reformulation Independent of results show query processing let user browse term lists suggest terms from thesaurus show related user queries (log mining) So a user can reformulate his query

Building a thesaurus To suggest or just include related terms Thesaurus with controlled vocabulary in which each concept has a canonical form, like the UMLS (human editors) PubMedPubMed Manually built thesaurus with synonyms, broader and narrower terms without canonical terms, like WordNet (human editors)

Building it automatically Automatic derivation of a thesaurus from a set of documents using simply word cooccurrence using grammatical analysis or relations (what is eaten is food) query log mining demo Dutch similar wordsdemo Dutch similar words English demoEnglish demo

Word Embedding Words are represented by a vector of e.g. 200 dimensions Similar words have similar vectors Vectors trained on large amounts of text Popular, free, implementation by Google: word2vec Other operations on vectors, e.g. v(Madrid) – v(Spain) + v(France) yields a vector that is close to v(Paris)

Brief (non-technical) history Early keyword-based engines ca Altavista, Excite, Infoseek, Inktomi, Lycos Paid search ranking: Goto (morphed into Overture.com  Yahoo!) Your search ranking depended on how much you paid Auction for keywords: casino was expensive!

Brief (non-technical) history 1998+: Link-based ranking pioneered by Google Blew away all early engines save Inktomi Great user experience in search of a business model But: Goto/Overture’s annual revenues were nearing $1 billion Result: Google added paid search “ads” to the side, independent of search results Yahoo followed suit, acquiring Overture (for paid placement) and Inktomi (for search) 2005+: Google gains search share, dominating in Europe and very strong in North America

Algorithmic results. Paid Search Ads

Web search basics The Web Ad indexes Web spider Indexer Indexes Search User Sec

The Web document collection No design/co-ordination Distributed content creation, linking, democratization of publishing Content includes truth, lies, obsolete information, contradictions … Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)… Scale much larger than previous text collections … but corporate records are catching up Growth – slowed down from initial “volume doubling every few months” but still expanding Content can be dynamically generated The Web Sec. 19.2

Indexing anchor text When indexing a document D, include (with some weight) anchor text from links pointing to D. Ar -basIBM anno Sun HP IBM Big Blue today announced record profits for the quarter

26 Indexing anchor text Thus: anchor text is often a better description of a page’s content than the page itself. Anchor text can be weighted more highly than document text.

27 Google bombs A Google bomb is a search with “bad” results due to maliciously manipulated anchor text. Google introduced a new weighting function in January 2007 that fixed many Google bombs. Still some remnants: [gengszterek] (gangsters) Coordinated link creation by those who dislike Fidesz, the main ruling political party Defused Google bombs: [miserable failure], [dangerous cult]

Indexing anchor text Can sometimes have unexpected side effects Solution: score anchor text with weight depending on the authority of the anchor page’s website E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust the anchor text from them Sec

The trouble with paid search ads … It costs money The alternative? Search Engine Optimization: “Tuning” your web page to rank highly in the algorithmic search results for select keywords Alternative to paying for placement Thus, intrinsically a marketing function Performed by companies, webmasters and consultants for their clients Some perfectly legitimate, some very shady Sec

Simplest forms Sec First generation engines relied heavily on tf/idf The top-ranked pages for the ‘query maui resort’ were the ones containing the most ‘maui’s and ‘resort’s SEOs responded with dense repetitions of chosen terms e.g., “maui resort maui resort maui resort ” Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Variant: repeated/misleading meta tags

Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate Is this a Search Engine spider? N Y SPAM Real Doc Cloaking Sec

More spam techniques Doorway pages (pages optimized for a single keyword that re-direct to the real target page) Link spamming mutual admiration societies, hidden links, awards – more on these later) domain flooding (numerous domains that point or re-direct to a target page) Robots Fake query stream – rank checking programs Millions of submissions via Add-Url Sec