Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Similar presentations


Presentation on theme: "Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval."— Presentation transcript:

1 Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval

2 Menu Information Retrieval (IR) basics An example Evaluation: Precision and Recall

3 Information Retrieval (IR) The field of information retrieval deals with the representation, storage, organization of, access to information items. Search Engines: Search a large collection of documents to find the ones that satisfy an information need, e,g, find relevant documents Indexing Document representation Comparison with query Evaluation/feedback

4 IR basics Indexing Manual indexing: using controlled vocabularies, e,g, libraries, early version of Yahoo Automatic indexing: indexing program assigns keywords, phrases or other features, e.g. words from text of document Popular Retrieval Models Boolean: exact match Vector Space: best match Citation analysis models: best match Probabilistic models: best match COMP 307 1:4

5 Vector space Model Any text object can be represented by a term vector. Similarity is determined by distance in a vector space. Example Doc1: 0.3, 0.1, 0.4 Doc2: 0.8, 0.5, 0.6 Query: 0.0, 0.2,0.0 Vector Space Similarity: Cosine of the angle between the two vectors COMP 307 1:5

6 Text representation Term weights Term weights reflect the estimated importance of each term The more often a word occurs in a document, the better that term is in describing what the document is about But terms that appear in many documents in the collection are not very useful for distinguishing a relevant document from a non-relevant one COMP 307 1:6

7 Term weights: TF.IDF Term weight X i = TF * IDF TF: Term Frequency IDF: Inverse Document Frequency

8 An example Document1: cat eat mouse, mouse eat cocolate Document 2: cat eat mouse Document 3: mouse eat chocolate mouse Indexing terms: Vector representation for each document Query: cat Vector representation of query Which document is more relevant?

9 example

10 Evaluation Test collection TREC Parameters Recall: Percentage of all relevant documents that are found by a search Precision: Percentage of retrieved documents that are relevant

11 Discussion How to compare two IR systems Which is more important in Web search: precision or recall? What leads to Google’s success for IR


Download ppt "Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval."

Similar presentations


Ads by Google