Download presentation
Presentation is loading. Please wait.
Published byPhilomena Matthews Modified over 9 years ago
1
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
2
An index is a data structure that is designed to make search (or finding things) fast and efficient Text search often requires an inverted index Represents a class of similar data structures Inverted because we associate documents with words (rather than identifying words within or as part of documents)
3
Each index term or document feature is obtained during text transformation A document feature is some feature of the document expressed numerically For example, a topical feature estimates the degree to which the document is about a particular topic Example quality features include inlink count, number of days since page was last updated, etc.
4
Regardless of the ranking function, the model below provides a roadmap to implementation
6
Each index term is associated with an inverted list that may contain: A list of documents A list of word occurrences in documents Word counts Positional information regarding each word Metadata identifying fields (title, author, etc.) etc.
7
Each entry in an inverted index is called a posting The part of the posting that refers to a specific document or location is called a pointer Each document in the collection is given a unique number Lists are usually document-ordered ▪ Sorted by document number
8
assume each sentence is a separate document
9
Inverted index for documents S 1, S 2, S 3, and S 4 Deduplicates word occurrences What does this data structure tell us?
10
Inverted index with counts for documents S 1, S 2, S 3, and S 4 What does this data structure tell us?
11
inverted index with word positions what does this data structure tell us?
12
Proximity matching is a technique used to match multiword phrases Proximity matching also is used to match words within a window of size n e.g. words within five words of “fish” (n=5) matches “tropical fish”
13
A document field is a section of a document with additional semantic meaning e.g. date, from:, to:, etc. e.g. title, author, copyright, publisher, isbn, etc. Implementation options: Use separate inverted lists for each field Add extra information about fields to postings Use extent lists....
14
An extent is simply a contiguous region of a document (typically with special meaning) We can represent extents using word positions Inverted list records all extents for a given field extent list “fish” occurs at word position 2 in document S 1 the title occurs at word positions 1 and 2
15
Read and study Chapter 5 Do Exercises 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, and 5.8
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.