Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Text Categorization.
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
Information Retrieval in Practice
Information Retrieval Models: Probabilistic Models
Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR
Chapter 7 Retrieval Models.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Information Retrieval
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, November 12, 2007.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, April 13, 2011.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Chapter 6: Information Retrieval and Web Search
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
1 Computing Relevance, Similarity: The Vector Space Model.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Chapter 23: Probabilistic Language Models April 13, 2004.
Modern Information Retrieval Lecture 2: Key concepts in IR.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Lecture 12: Relevance Feedback & Query Expansion - II
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Implementation Issues & IR Systems
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Relevance Feedback Hongning Wang
Basic Information Retrieval
Chapter 5: Information Retrieval and Web Search
CS 4501: Information Retrieval
Retrieval Utilities Relevance feedback Clustering
INF 141: Information Retrieval
Presentation transcript:

Improvements and extras Paul Thomas CSIRO

Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval 4.Evaluating IR systems 5.Improvements and extras

Problems matching terms “It is impossibly difficult for users to predict the exact words, word combinations, and phrases that are used by all (or most) relevant documents and only (or primarily) by those documents” (Blair and Maron 1985)

Query refinement It's hard to get queries right, especially if you don't know: What you're searching for; or What you're searching in We can refine a query: Manually Automatically

Automatic refinement

Relevance feedback Assume that relevant documents have something in common. Then if we have some documents we know are relevant, we can find more like those. 1.Return the documents we think are relevant; 2.User provides feedback on one or more; 3.Return a new set, taking that feedback into account.

An example

In vector space A query can be represented as a vector; so can all documents, relevant or not. We want to adjust the query vector so it's: Closer to the centroid of the relevant documents And away from the centroid of the non-relevant documents

Moving a query vector

Rocchio's algorithm

In probabilistic retrieval With real relevance judgements, we can make better estimates of probability P(rel|q,d). p i ≈ (w+0.5) / (w+y+1) Or, to get smoother estimates: p i ' ≈ (w+κp i ) / (w+y+κ)

In lucene Query.setBoost(float b) term^boost

Pseudo-relevance feedback We can assume the top k ranked documents are relevant. Less accurate (probably); But less effort (definitely). Or an in-between option: use implicit relevance feedback. For example, use clicks to refine future ranking.

When does it work? Have to have a good enough first query. Have to have relevant documents which are similar to each other. Users have to be willing to provide feedback.

Web search

Why is the web different? Scale Authorship Document types Markup Link structure

The web graph Paul's home page CSIRO ANU Research School Collaborative projects Past projects …I work at the CSIRO as a researcher in information retrieval…

Making use of link structure Text in (or near) the link Treat this as part of the target document Indegree Graph-theoretic measures Centrality, betweeness, … PageRank

Paul's home page CSIRO ANU Research School Collaborative projects Past projects

Incorporating PageRank PageRank is query-independent evidence: it is the same for any query. Can simply combine this with query-dependent evidence such as probability of relevance, cosine distance, term counts, … score(d,q) = α PageRank(d) + (1-α) similarity(d,q)

Other forms of web evidence Trust in the host (or domain, or domain owner, or network block) Reports of spam or malware Frequency of updates Related queries which lead to the same place URLs Page length Language …

Machine learning for IR Machine learning is a set of techniques for discovering rules from data. In information retrieval, we use machine learning for: Choosing parameters Classifying text Ranking documents

Classifiers Naive Bayes: Find category c such that P(c|d) is maximised Support vector machines (SVM): Find a separating hyperplane

Learning parameters Feature α (e.g. PageRank) Feature β (e.g. cosine) score(α,β) = θ

Ranking Ranking SVM Instead of classifying one document into {relevant, not relevant}: Classify a pair of documents into {first better, second better} RankNet LambdaNet LambdaMART …

What we covered today It's hard to write a good query: query rewriting Manual Automatic: spelling correction, thesauri, relevance feedback, pseudo-relevance feedback Web retrieval Has to cope with large scale, antagonistic authors But can make use of new features e.g. web graph Machine learning Makes it possible to “learn” how to classify or rank at scale, with lots of features

Recap lecture 1 Retrieval system = indexer + query processor Indexer (normally) writes an inverted file Query processor uses the index

Recap lecture 2 Ranking search results: why it's important Term frequency and “bag of words” td.idf Cosine similarity and the vector space model

Recap lecture 3 Probabilistic retrieval uses probability theory to deal with the uncertainty of relevance Ranking by P(rel | d) is optimal (under some assumptions) We can turn this into a sum of term weights and use an index and accumulators Very popular, very influential, and still in vogue

Recap lecture 4 Why should we evaluate? Efficiency and effectiveness Some ways to evaluate: observation, lab studies, log files, test collections Effectiveness measures

Now… There's a lab starting a bit after 11, in the Computer Science building (N114): Getting started with lucene Working with trec_eval