Presentation is loading. Please wait.

Presentation is loading. Please wait.

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Similar presentations


Presentation on theme: "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,"— Presentation transcript:

1 The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

2  An index is a data structure that is designed to make search (or finding things) fast and efficient  Text search often requires an inverted index  Represents a class of similar data structures  Inverted because we associate documents with words (rather than identifying words within or as part of documents)

3  Each index term is associated with an inverted list that may contain:  A list of documents  A list of word occurrences in documents  Word counts  Positional information regarding each word  Metadata identifying fields (title, author, etc.)  etc.

4  Each entry in an inverted index is called a posting  The part of the posting that refers to a specific document or location is called a pointer  Each document in the collection is given a unique number  Lists are usually document-ordered ▪ Sorted by document number

5  Inverted index with counts for documents S 1, S 2, S 3, and S 4  What does this data structure tell us?

6 how? Limitations of scale? How can we parallelize this?

7  To handle larger indexes:  Build the inverted list structure until we run out of memory  Write the partial index to disk; repeat  At the end of this process, we have many partial indexes, which must be merged

8  Partial indexes must be designed so they can be merged in small pieces  Store tokens/words in alphabetical order

9  Use the merging strategy to parallelize:  Multiple machines build partial indexes  A single machine collects and merges all partial indexes to produce a final index  Parallelization and distributed computing is required due to the scale of information  Not just for search  Also for analytics and data mining

10  First normalize the user query using the same normalization rules applied during text transformation  Convert to lowercase (downcase)  Remove extraneous characters  Perform stemming  etc.

11  Document-at-a-time query processing:  Calculate complete scores for documents by processing all relevant term lists, one document at a time  Term-at-a-time query processing:  Accumulate scores for documents by processing term lists in their entirety, one term list at a time

12

13

14  Read less data from the inverted lists  A multi-keyword search requires that all query terms appear in the results  Use skipping and skip pointers to speed up multi-keyword searches term: skip pointers GOAL: skip those documents that do not contain the other query term(s)

15  Calculate scores for fewer documents  Apply conjunctive processing in which every document must contain all query terms  Works best when one query term occurs much less frequently than the others  Modify document-at-a-time and term-at-a-time algorithms to remove documents that do not contain all query terms

16

17

18  Read and study Chapter 5  (skim §5.4)  Do Exercises 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, and 5.8


Download ppt "The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,"

Similar presentations


Ads by Google