CS246: Information Retrieval

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
Search Engines and Information Retrieval
Modern Information Retrieval
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.
Advance Information Retrieval Topics Hassan Bashiri.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Querying Structured Text in an XML Database By Xuemei Luo.
Chapter 6: Information Retrieval and Web Search
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
Data Mining: Text Mining
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Toward Semantic Search: RDFa based facet browser Jin Guang Zheng Tetherless World Constellation.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S.
Information Retrieval Quality of a Search Engine.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Introduction to Information Retrieval and Web Search
Information Retrieval and Web Search
Presented by: Hassan Sayyadi
Personalized, Interactive Question Answering on the Web
Information Retrieval and Web Search
Information Retrieval and Web Search
Multimedia Information Retrieval
Information Retrieval
Machine Learning in Natural Language Processing
אחזור מידע, מנועי חיפוש וספריות
Thanks to Bill Arms, Marti Hearst
Donna M. Gates Carnegie Mellon University
Dept. of Computer Science University of Liverpool
Basic Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Automatic Detection of Causal Relations for Question Answering
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Text Mining & Natural Language Processing
Search Engine Architecture
CS246: Leveraging User Feedback
Retrieval Performance Evaluation - Measures
Information Retrieval and Web Design
Information Retrieval and Web Search
Information Retrieval and Web Design
Statistical NLP: Lecture 10
CS249 Advanced Seminar: Learning From Text
CS249: Neural Language Model
Presentation transcript:

CS246: Information Retrieval Junghoo “John” Cho UCLA

Web Search User issues a “keyword” query System returns “relevant” pages information retrieval

Power of Search Extremely intuitive: matches with “how we think” Nothing to “learn” Think about other “human-computer interfaces”: C, Java, SQL, HTML, GUI, … Enormously successful Made the Web useful practically for everyone with Internet access

Challenge How can a computer figure out what pages are “relevant”? Both queries and data are fuzzy Unstructured text and natural-language query What documents are good matches for a query? Computers do not “understand” the documents or the queries Fundamentally, isn’t relevance a subjective notion? Is my home page relevant to “handsome man”? AskArturo.com We need a formal definition of our “search problem” to have any hope of building a system that does this How can we formalize our problem?

Information-Retrieval Problem Data source: a set of text documents Vocabulary (= lexicon): 𝑉= 𝑤 1 , 𝑤 2 , … 𝑤 𝑀 Document: 𝑑 𝑖 = 𝑤 𝑖1 , 𝑤 𝑖2 , … 𝑤 𝑖 𝑙 𝑖 Collection (= corpus): 𝐶={ 𝑑 1 ,…, 𝑑 𝑁 } Query: a sequence of words Query: 𝑞= 𝑤 1 ,…, 𝑤 𝑙 Output: a set of documents Set of “relevant” documents: 𝑅 𝑞 ⊆𝐶 Document set returned by the system: 𝑅′ 𝑞 ⊆𝐶 We want to make 𝑅′ 𝑞 as close to 𝑅 𝑞 as possible

What does “close” mean? 𝑅: set of “relevant” document 𝑅′: set of documents returned by the system Precision Among the pages returned, what fraction is relevant? 𝑃= 𝑅∩𝑅′ 𝑅′ Recall Among the relevant pages, what fraction is returned? 𝑅= 𝑅∩𝑅′ 𝑅

Evaluation: Precision and Recall 𝑅 =5, 𝑅′ =10, 𝑅∩𝑅′ =3. What is precision and recall?

Precision-Recall Trade Off Return more documents Higher recall Lower precision Depending on our setting, we get a curve of precision and recall Q: Which algorithm is better?

F1 Score Single metric combining precision and recall 𝐹1=2∙ 1 1 𝑃 + 1 𝑅 (0≤𝐹1≤1) “Harmonic mean” of precision and recall General F score is a “weighted version” of F1 𝐹= 1 𝛼 1 𝑃 +(1−𝛼) 1 𝑅 = 𝛽 2 +1 𝑃𝑅 𝛽 2 𝑃+𝑅 (𝛼= 1 1+ 𝛽 2 )

Where to Get Relevant Pages R? How do we obtain the set of relevant pages 𝑅? Option 1: Ask Arturo!!! Option 2 Collect a large set of real queries and “relevant” pages from users Evaluate the system based the collected “benchmark dataset” or “golden dataset” Search engines dedicate enormous resources to build a high-quality benchmark dataset Effectiveness of their system heavily depends on the quality of their benchmark data Evaluation of IR system is fundamentally a subjective and empirical task, requiring human judgement

Back to IR Problem Document: 𝑑 𝑖 = 𝑤 𝑖1 , 𝑤 𝑖2 , … 𝑤 𝑖 𝑙 𝑖 Collection: 𝐶={ 𝑑 1 ,…, 𝑑 𝑁 } Query: 𝑞= 𝑤 1 ,…, 𝑤 𝑙 Compute 𝑅′(𝑞) Q: How can the system compute 𝑅′(𝑞)? How can we formulate this task as a computational problem?

IR as a Decision Problem Given 𝑞, ∀𝑑∈𝐶, decide whether 𝑑∈𝑅 𝑞 More formally, we compute 𝑅 ′ 𝑞 𝑅 ′ 𝑞 = 𝑑∈𝐶 𝑓 𝑑,𝑞 =1} where 𝑓 𝑑,𝑞 →{0,1} is a binary classifier Q: How can we build the classifier 𝑓 𝑑,𝑞 →{0,1} ?

IR Problem: Ideal Solution Apply natural language processing (NLP) algorithms to “understand” the “meaning” of the document and the query and decide whether they are relevant Q: To understand the meaning of a sentence, what do we need to do?

NLP Pipeline S Inference VP PP Semantic analysis NP NP NP Girl(a), Dog(b), Park(c), Saw(a, b, c) Inference Saw(a, _, _) => Had(a, b), Eye(b) NP NP NP PP S VP Syntactic analysis (parsing) S: sentence NP: noun phrase VP: verb phrase PP: prepositional phrase Det: determiner N: noun V: verb P: preposition Det N V Det N P Det N Lexical analysis (POS tagging) A girl saw the dog in the park.

NLP: How Well Can We Do It? Unfortunately, NLP is very, very hard Lack of “background knowledge” A girl is a female, a park is an outdoor space, …: How can a computer know this? Many ambiguities Was the girl in the park? Or the dog? Or both? Ambiguities: example POS: saw can be a verb or a noun. What is it? Word sense: Saw has many meanings. What is it? Parsing: “in the park” what does it modify? Semantic: “the park” exactly what park?

NLP: Current State of Art POS tagging: ~ 97% accuracy Syntactic parsing: ~ 90% Semantic analysis: partial “success” in very limited domains Named entity recognition, Entity-relation extraction, sentiment analysis, … (~ 90%) Inference: still very early stage How do we represent “knowledge” and “inference rules”? Ontology mismatch problem Where do we obtain an ontology? NLP is too premature to fully understand the meaning of a sentence yet For search, only shallow NLP techniques are used as of now (if any)

Need for an “IR Model” Q: What can we do? Any way to solve the IR problem? Formulate the IR problem as a simpler computational problem based on a simplifying assumption

Simplifying Assumption: Bag of Words Consider each document as a “bag of words” “bag” vs “set” Ignore word ordering. Just keep word count Consider queries as bag of words Great oversimplification, but works well enough in many cases “John loves only Jane” vs “Only John loves Jane” The limitation still shows up on current search engines Still, how do we match documents and queries?

Simplifying Assumption: Boolean Model Return the documents that contain the words in the query “Boolean assumption”: a document is either “relevant” or “irrelevant” No notion of “ranking” Simplest “model” for information retrieval Q: How to find and return matching documents? Basic algorithm? Useful data structure? Scalability problem Takes too long to run. Any way to speed up the computation? Q: What is the key information that we need to compute 𝑅′(𝑞)?

Inverted Index Allows quick lookup of document ids with a particular word Q: How can we use this to answer “UCLA Physics”? Postings list lexicon/dictionary V 3 8 10 13 16 20 PL(Stanford) Stanford 1 2 3 9 16 18 PL(UCLA) UCLA MIT 4 5 8 10 13 19 20 22 PL(MIT) …