Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Chapter 5: Introduction to Information Retrieval
Basic IR: Modeling Basic IR Task: Slightly more complex:
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Hinrich Schütze and Christina Lioma
Evaluating the Performance of IR Sytems
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
The Vector Space Model …and applications in Information Retrieval.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
IR Models: Review Vector Model and Probabilistic.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Vector Space Models.
Information Retrieval
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Search Engines Session 5 INST 301 Introduction to Information Science.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Text Based Information Retrieval
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Basic Information Retrieval
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Text Categorization Assigning documents to a fixed set of categories
CS 430: Information Discovery
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Design
CS 430: Information Discovery
Presentation transcript:

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423

Menu Information Retrieval (IR) basics An example Evaluation: Precision and Recall

Information Retrieval (IR) The field of information retrieval deals with the representation, storage, organization of, access to information items. Search Engines: Search a large collection of documents to find the ones that satisfy an information need, e,g, find relevant documents Indexing Document representation Comparison with query Evaluation/feedback

IR basics Indexing Manual indexing: using controlled vocabularies, e,g, libraries, early version of Yahoo Automatic indexing: indexing program assigns keywords, phrases or other features, e.g. words from text of document Popular Retrieval Models Boolean: exact match Vector Space: best match Citation analysis models: best match Probabilistic models: best match

Vector space Model Any text object can be represented by a term vector. Example Doc1: 0.3, 0.1, 0.4 Doc2: 0.8, 0.5, 0.6 Query: 0.0, 0.2,0.0 Similarity is determined by distance in a vector space. Vector Space Similarity: Cosine of the angle between the two vectors

Text representation Term weights Term weights reflect the estimated importance of each term The more often a word occurs in a document, the better that term is in describing what the document is about But terms that appear in many documents in the collection are not very useful for distinguishing a relevant document from a non-relevant one

Term weights: TF*IDF Term weight W ij = TF * IDF TF: Term Frequency IDF: Inverse Document Frequency

An example Document1: cat eat mouse, mouse eat chocolate Document 2: cat eat mouse Document 3: mouse eat chocolate mouse Query: cat Indexing terms: Vector representation for each document Vector representation of query Which document is more relevant?

example Document1: cat eat mouse, mouse eat chocolate Document 2: cat eat mouse Document 3: mouse eat chocolate mouse Query: cat

Evaluation Data collection TREC Evaluation criteria Recall: Percentage of all relevant documents that are found by a search Precision: Percentage of retrieved documents that are relevant

Discussion How to compare two IR systems Which is more important in Web search: precision or recall?

Next lecture PageRank paper Discussion on Google and other search engines Current solution to search