Presentation is loading. Please wait.

Presentation is loading. Please wait.

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Similar presentations


Presentation on theme: "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,"— Presentation transcript:

1 The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

2  An index is a data structure that is designed to make search (or finding things) fast and efficient  Text search often requires an inverted index  Represents a class of similar data structures  Inverted because we associate documents with words (rather than identifying words within or as part of documents)

3  Each index term or document feature is obtained during text transformation  A document feature is some feature of the document expressed numerically  For example, a topical feature estimates the degree to which the document is about a particular topic  Example quality features include inlink count, number of days since page was last updated, etc.

4  Regardless of the ranking function, the model below provides a roadmap to implementation

5

6  Each index term is associated with an inverted list that may contain:  A list of documents  A list of word occurrences in documents  Word counts  Positional information regarding each word  Metadata identifying fields (title, author, etc.)  etc.

7  Each entry in an inverted index is called a posting  The part of the posting that refers to a specific document or location is called a pointer  Each document in the collection is given a unique number  Lists are usually document-ordered ▪ Sorted by document number

8 assume each sentence is a separate document

9  Inverted index for documents S 1, S 2, S 3, and S 4  Deduplicates word occurrences  What does this data structure tell us?

10  Inverted index with counts for documents S 1, S 2, S 3, and S 4  What does this data structure tell us?

11 inverted index with word positions what does this data structure tell us?

12  Proximity matching is a technique used to match multiword phrases  Proximity matching also is used to match words within a window of size n  e.g. words within five words of “fish” (n=5) matches “tropical fish”

13  A document field is a section of a document with additional semantic meaning  e.g. date, from:, to:, etc.  e.g. title, author, copyright, publisher, isbn, etc.  Implementation options:  Use separate inverted lists for each field  Add extra information about fields to postings  Use extent lists....

14  An extent is simply a contiguous region of a document (typically with special meaning)  We can represent extents using word positions  Inverted list records all extents for a given field extent list “fish” occurs at word position 2 in document S 1 the title occurs at word positions 1 and 2

15  Read and study Chapter 5  Do Exercises 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, and 5.8


Download ppt "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,"

Similar presentations


Ads by Google