The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

 An index is a data structure that is designed to make search (or finding things) fast and efficient  Text search often requires an inverted index  Represents a class of similar data structures  Inverted because we associate documents with words (rather than identifying words within or as part of documents)

 Each index term or document feature is obtained during text transformation  A document feature is some feature of the document expressed numerically  For example, a topical feature estimates the degree to which the document is about a particular topic  Example quality features include inlink count, number of days since page was last updated, etc.

 Regardless of the ranking function, the model below provides a roadmap to implementation

 Each index term is associated with an inverted list that may contain:  A list of documents  A list of word occurrences in documents  Word counts  Positional information regarding each word  Metadata identifying fields (title, author, etc.)  etc.

 Each entry in an inverted index is called a posting  The part of the posting that refers to a specific document or location is called a pointer  Each document in the collection is given a unique number  Lists are usually document-ordered ▪ Sorted by document number

assume each sentence is a separate document

 Inverted index for documents S 1, S 2, S 3, and S 4  Deduplicates word occurrences  What does this data structure tell us?

 Inverted index with counts for documents S 1, S 2, S 3, and S 4  What does this data structure tell us?

inverted index with word positions what does this data structure tell us?

 Proximity matching is a technique used to match multiword phrases  Proximity matching also is used to match words within a window of size n  e.g. words within five words of “fish” (n=5) matches “tropical fish”

 A document field is a section of a document with additional semantic meaning  e.g. date, from:, to:, etc.  e.g. title, author, copyright, publisher, isbn, etc.  Implementation options:  Use separate inverted lists for each field  Add extra information about fields to postings  Use extent lists....

 An extent is simply a contiguous region of a document (typically with special meaning)  We can represent extents using word positions  Inverted list records all extents for a given field extent list “fish” occurs at word position 2 in document S 1 the title occurs at word positions 1 and 2

 Read and study Chapter 5  Do Exercises 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, and 5.8

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Similar presentations

Presentation on theme: "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Similar presentations

Presentation on theme: "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,"— Presentation transcript:

Similar presentations

About project

Feedback