Presentation is loading. Please wait.

Presentation is loading. Please wait.

| 1 › Gertjan van Noord2014 Search Engines Lecture 7: relevance feedback & query reformulation.

Similar presentations


Presentation on theme: "| 1 › Gertjan van Noord2014 Search Engines Lecture 7: relevance feedback & query reformulation."— Presentation transcript:

1 | 1 › Gertjan van Noord2014 Search Engines Lecture 7: relevance feedback & query reformulation

2 Query expansion Global methods: e.g. add (near) synonyms - Using a thesaurus - Using automatically constructed resources Local methods: based on initial results of query - Relevance feedback - Pseude-relevance feedback - Indirect relevance feedback

3 Relevance feedback 1.The user formulates a query 1.The system gives a list of results 1.The user marks one or more documents of the result list as relevant 1.The system uses this information to modify the original query 1.The system gives a new (or reordered) list of results

4 Which document information? Query and documents both seen as vectors Positive feedback Add terms from the document vector to the query / give some original query terms more weight Negative feedback Give some query terms less weight

5 Negative feedback Not commonly asked from the user The highest ranked document that is NOT marked relevant can be considered not-relevant by default

6 Rocchio feedback formula q mod / q 0 modified query / original query D r relevant docs D nr not-relevant docs d j document vector

7 Rocchio feedback formula α,β,γ: size of effect for the factor

8 alpha = 1, beta = 1, gamma = 0.5 Original query: query retrieval interface Results: Document 1: query interfacerelevant Document 2: query textrelevant Document 3: gps interfacenot relevant Exercise

9 Answer Termsq0d1d2d3Value in qmod gps00010 + 0 - 0.5 = -0.5 = 0 interface11011 + 0.5 - 0.5 = 1 query11101 + 1 – 0 = 2 retrieval10001 + 0 – 0 = 1 text00100 + 0.5 – 0 = 0.5 the new vector qmod is:(interface,query,retrieval,text) (1,2,1,0.5 )

10 Changing the query vector

11 Evaluation of RF At least 5 docs should be marked for good results Calculating the effect: what to do with the first result list? User studies, time-based comparison fairest

12 When is RF not sufficient? Insufficient relevance judgements (min. 5) If the first query fails: spelling cross-language vocabulary mismatch If the result includes more than one cluster of relevant documents subsets with different vocabulary disjunctive answer set (examples)

13 Problems of RF Users don’t like to go on with a search Results incomprehensible High computing costs In web search, recall is not very important

14 Relevance FB and the web Simple: more like this But: users not interested in high recall, only precision

15 Feedback without explicit action from user Pseudo relevance feedback Just assume that top n docs are relevant Works well, sometimes topic drift Indirect relevance feedback use of clickstream data (general or individual)

16 Global methods for query reformulation Independent of results show query processing let user browse term lists suggest terms from thesaurus show related user queries (log mining) So a user can reformulate his query

17 Building a thesaurus To suggest or just include related terms Thesaurus with controlled vocabulary in which each concept has a canonical form, like the UMLS (human editors) PubMedPubMed Manually built thesaurus with synonyms, broader and narrower terms without canonical terms, like WordNet (human editors)

18 Building it automatically Automatic derivation of a thesaurus from a set of documents using simply word cooccurrence using grammatical analysis or relations (what is eaten is food) query log mining demo Dutch similar wordsdemo Dutch similar words English demoEnglish demo

19 Word Embedding Words are represented by a vector of e.g. 200 dimensions Similar words have similar vectors Vectors trained on large amounts of text Popular, free, implementation by Google: word2vec Other operations on vectors, e.g. v(Madrid) – v(Spain) + v(France) yields a vector that is close to v(Paris)

20 Brief (non-technical) history Early keyword-based engines ca. 1995- 1997 Altavista, Excite, Infoseek, Inktomi, Lycos Paid search ranking: Goto (morphed into Overture.com  Yahoo!) Your search ranking depended on how much you paid Auction for keywords: casino was expensive!

21 Brief (non-technical) history 1998+: Link-based ranking pioneered by Google Blew away all early engines save Inktomi Great user experience in search of a business model But: Goto/Overture’s annual revenues were nearing $1 billion Result: Google added paid search “ads” to the side, independent of search results Yahoo followed suit, acquiring Overture (for paid placement) and Inktomi (for search) 2005+: Google gains search share, dominating in Europe and very strong in North America

22 Algorithmic results. Paid Search Ads

23 Web search basics The Web Ad indexes Web spider Indexer Indexes Search User Sec. 19.4.1

24 The Web document collection No design/co-ordination Distributed content creation, linking, democratization of publishing Content includes truth, lies, obsolete information, contradictions … Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)… Scale much larger than previous text collections … but corporate records are catching up Growth – slowed down from initial “volume doubling every few months” but still expanding Content can be dynamically generated The Web Sec. 19.2

25 Indexing anchor text When indexing a document D, include (with some weight) anchor text from links pointing to D. www.ibm.com Ar -basIBM anno Sun HP IBM Big Blue today announced record profits for the quarter

26 26 Indexing anchor text Thus: anchor text is often a better description of a page’s content than the page itself. Anchor text can be weighted more highly than document text.

27 27 Google bombs A Google bomb is a search with “bad” results due to maliciously manipulated anchor text. Google introduced a new weighting function in January 2007 that fixed many Google bombs. Still some remnants: [gengszterek] (gangsters) Coordinated link creation by those who dislike Fidesz, the main ruling political party Defused Google bombs: [miserable failure], [dangerous cult]

28 Indexing anchor text Can sometimes have unexpected side effects Solution: score anchor text with weight depending on the authority of the anchor page’s website E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust the anchor text from them Sec. 21.1.1

29 The trouble with paid search ads … It costs money The alternative? Search Engine Optimization: “Tuning” your web page to rank highly in the algorithmic search results for select keywords Alternative to paying for placement Thus, intrinsically a marketing function Performed by companies, webmasters and consultants for their clients Some perfectly legitimate, some very shady Sec. 19.2.2 29

30 Simplest forms Sec. 19.2.2 First generation engines relied heavily on tf/idf The top-ranked pages for the ‘query maui resort’ were the ones containing the most ‘maui’s and ‘resort’s SEOs responded with dense repetitions of chosen terms e.g., “maui resort maui resort maui resort ” Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Variant: repeated/misleading meta tags

31 Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate Is this a Search Engine spider? N Y SPAM Real Doc Cloaking Sec. 19.2.2 31

32 More spam techniques Doorway pages (pages optimized for a single keyword that re-direct to the real target page) Link spamming mutual admiration societies, hidden links, awards – more on these later) domain flooding (numerous domains that point or re-direct to a target page) Robots Fake query stream – rank checking programs Millions of submissions via Add-Url Sec. 19.2.2


Download ppt "| 1 › Gertjan van Noord2014 Search Engines Lecture 7: relevance feedback & query reformulation."

Similar presentations


Ads by Google