Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Similar presentations


Presentation on theme: "Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval."— Presentation transcript:

1 Improvements and extras Paul Thomas CSIRO

2 Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval 4.Evaluating IR systems 5.Improvements and extras

3 Problems matching terms “It is impossibly difficult for users to predict the exact words, word combinations, and phrases that are used by all (or most) relevant documents and only (or primarily) by those documents” (Blair and Maron 1985)

4 Query refinement It's hard to get queries right, especially if you don't know: What you're searching for; or What you're searching in We can refine a query: Manually Automatically

5 Automatic refinement

6 Relevance feedback Assume that relevant documents have something in common. Then if we have some documents we know are relevant, we can find more like those. 1.Return the documents we think are relevant; 2.User provides feedback on one or more; 3.Return a new set, taking that feedback into account.

7 An example

8 In vector space A query can be represented as a vector; so can all documents, relevant or not. We want to adjust the query vector so it's: Closer to the centroid of the relevant documents And away from the centroid of the non-relevant documents

9 Moving a query vector

10 Rocchio's algorithm

11 In probabilistic retrieval With real relevance judgements, we can make better estimates of probability P(rel|q,d). p i ≈ (w+0.5) / (w+y+1) Or, to get smoother estimates: p i ' ≈ (w+κp i ) / (w+y+κ)

12 In lucene Query.setBoost(float b) term^boost

13 Pseudo-relevance feedback We can assume the top k ranked documents are relevant. Less accurate (probably); But less effort (definitely). Or an in-between option: use implicit relevance feedback. For example, use clicks to refine future ranking.

14 When does it work? Have to have a good enough first query. Have to have relevant documents which are similar to each other. Users have to be willing to provide feedback.

15 Web search

16 Why is the web different? Scale Authorship Document types Markup Link structure

17 The web graph Paul's home page CSIRO ANU Research School Collaborative projects Past projects …I work at the CSIRO as a researcher in information retrieval…

18 Making use of link structure Text in (or near) the link Treat this as part of the target document Indegree Graph-theoretic measures Centrality, betweeness, … PageRank

19 Paul's home page CSIRO ANU Research School Collaborative projects Past projects

20 Incorporating PageRank PageRank is query-independent evidence: it is the same for any query. Can simply combine this with query-dependent evidence such as probability of relevance, cosine distance, term counts, … score(d,q) = α PageRank(d) + (1-α) similarity(d,q)

21 Other forms of web evidence Trust in the host (or domain, or domain owner, or network block) Reports of spam or malware Frequency of updates Related queries which lead to the same place URLs Page length Language …

22 Machine learning for IR Machine learning is a set of techniques for discovering rules from data. In information retrieval, we use machine learning for: Choosing parameters Classifying text Ranking documents

23 Classifiers Naive Bayes: Find category c such that P(c|d) is maximised Support vector machines (SVM): Find a separating hyperplane

24 Learning parameters Feature α (e.g. PageRank) Feature β (e.g. cosine) score(α,β) = θ

25 Ranking Ranking SVM Instead of classifying one document into {relevant, not relevant}: Classify a pair of documents into {first better, second better} RankNet LambdaNet LambdaMART …

26 What we covered today It's hard to write a good query: query rewriting Manual Automatic: spelling correction, thesauri, relevance feedback, pseudo-relevance feedback Web retrieval Has to cope with large scale, antagonistic authors But can make use of new features e.g. web graph Machine learning Makes it possible to “learn” how to classify or rank at scale, with lots of features

27 Recap lecture 1 Retrieval system = indexer + query processor Indexer (normally) writes an inverted file Query processor uses the index

28 Recap lecture 2 Ranking search results: why it's important Term frequency and “bag of words” td.idf Cosine similarity and the vector space model

29 Recap lecture 3 Probabilistic retrieval uses probability theory to deal with the uncertainty of relevance Ranking by P(rel | d) is optimal (under some assumptions) We can turn this into a sum of term weights and use an index and accumulators Very popular, very influential, and still in vogue

30 Recap lecture 4 Why should we evaluate? Efficiency and effectiveness Some ways to evaluate: observation, lab studies, log files, test collections Effectiveness measures

31 Now… There's a lab starting a bit after 11, in the Computer Science building (N114): Getting started with lucene Working with trec_eval

32 paul.thomas@csiro.au


Download ppt "Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval."

Similar presentations


Ads by Google