The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

1 Chap 14 Ranking Algorithm 指導教授 : 黃三益 博士 學生 : 吳金山 鄭菲菲.
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval in Practice
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Modern Information Retrieval
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby.
The College of Saint Rose CIS 433 – Programming Languages David Goldschmidt, Ph.D. from Concepts of Programming Languages, 9th edition by Robert W. Sebesta,
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
 A databases is a collection of data organized to make it easy to search and easy to retrieve in a useful, usable form.
LIS618 lecture 2 the Boolean model Thomas Krichel
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Web Search Algorithms By Matt Richard and Kyle Krueger.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section : MIMD Architectures Inverted Files November.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Web- and Multimedia-based Information Systems Lecture 2.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Evidence from Content INST 734 Module 2 Doug Oard.
CIS 250 Advanced Computer Applications Database Management Systems.
Modern Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
VOCAB REVIEW. A field that can be computed from other fields Calculated field Click for the answer Next Question.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Text Indexing and Search
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Indexing & querying text
Information Retrieval in Practice
Ch. 8 File Structures Sequential files. Text files. Indexed files.
Information Retrieval and Web Search
Implementation Issues & IR Systems
Data Mining Chapter 6 Search Engines
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
Efficient Retrieval Document-term matrix t1 t tj tm nf
Information Retrieval and Web Design
INF 141: Information Retrieval
Presentation transcript:

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN

 An index is a data structure that is designed to make search (or finding things) fast and efficient  Text search often requires an inverted index  Represents a class of similar data structures  Inverted because we associate documents with words (rather than identifying words within or as part of documents)

 Each index term is associated with an inverted list that may contain:  A list of documents  A list of word occurrences in documents  Word counts  Positional information regarding each word  Metadata identifying fields (title, author, etc.)  etc.

 Each entry in an inverted index is called a posting  The part of the posting that refers to a specific document or location is called a pointer  Each document in the collection is given a unique number  Lists are usually document-ordered ▪ Sorted by document number

 Inverted index with counts for documents S 1, S 2, S 3, and S 4  What does this data structure tell us?

how? Limitations of scale? How can we parallelize this?

 To handle larger indexes:  Build the inverted list structure until we run out of memory  Write the partial index to disk; repeat  At the end of this process, we have many partial indexes, which must be merged

 Partial indexes must be designed so they can be merged in small pieces  Store tokens/words in alphabetical order

 Use the merging strategy to parallelize:  Multiple machines build partial indexes  A single machine collects and merges all partial indexes to produce a final index  Parallelization and distributed computing is required due to the scale of information  Not just for search  Also for analytics and data mining

 First normalize the user query using the same normalization rules applied during text transformation  Convert to lowercase (downcase)  Remove extraneous characters  Perform stemming  etc.

 Document-at-a-time query processing:  Calculate complete scores for documents by processing all relevant term lists, one document at a time  Term-at-a-time query processing:  Accumulate scores for documents by processing term lists in their entirety, one term list at a time

 Read less data from the inverted lists  A multi-keyword search requires that all query terms appear in the results  Use skipping and skip pointers to speed up multi-keyword searches term: skip pointers GOAL: skip those documents that do not contain the other query term(s)

 Calculate scores for fewer documents  Apply conjunctive processing in which every document must contain all query terms  Works best when one query term occurs much less frequently than the others  Modify document-at-a-time and term-at-a-time algorithms to remove documents that do not contain all query terms

 Read and study Chapter 5  (skim §5.4)  Do Exercises 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, and 5.8