Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423
Menu Information Retrieval (IR) basics An example Evaluation: Precision and Recall
Information Retrieval (IR) The field of information retrieval deals with the representation, storage, organization of, access to information items. Search Engines: Search a large collection of documents to find the ones that satisfy an information need, e,g, find relevant documents Indexing Document representation Comparison with query Evaluation/feedback
IR basics Indexing Manual indexing: using controlled vocabularies, e,g, libraries, early version of Yahoo Automatic indexing: indexing program assigns keywords, phrases or other features, e.g. words from text of document Popular Retrieval Models Boolean: exact match Vector Space: best match Citation analysis models: best match Probabilistic models: best match
Vector space Model Any text object can be represented by a term vector. Example Doc1: 0.3, 0.1, 0.4 Doc2: 0.8, 0.5, 0.6 Query: 0.0, 0.2,0.0 Similarity is determined by distance in a vector space. Vector Space Similarity: Cosine of the angle between the two vectors
Text representation Term weights Term weights reflect the estimated importance of each term The more often a word occurs in a document, the better that term is in describing what the document is about But terms that appear in many documents in the collection are not very useful for distinguishing a relevant document from a non-relevant one
Term weights: TF*IDF Term weight W ij = TF * IDF TF: Term Frequency IDF: Inverse Document Frequency
An example Document1: cat eat mouse, mouse eat chocolate Document 2: cat eat mouse Document 3: mouse eat chocolate mouse Query: cat Indexing terms: Vector representation for each document Vector representation of query Which document is more relevant?
example Document1: cat eat mouse, mouse eat chocolate Document 2: cat eat mouse Document 3: mouse eat chocolate mouse Query: cat
Evaluation Data collection TREC Evaluation criteria Recall: Percentage of all relevant documents that are found by a search Precision: Percentage of retrieved documents that are relevant
Discussion How to compare two IR systems Which is more important in Web search: precision or recall?
Next lecture PageRank paper Discussion on Google and other search engines Current solution to search