Inverted Index Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Hinrich Schütze and Christina Lioma Lecture 5: Index Compression
Inverted Index Hongning Wang
Information Retrieval in Practice
Advanced Algorithms Piyush Kumar (Lecture 12: String Matching/Searching) Welcome to COT5405 Source: S. Šaltenis.
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Modern Information Retrieval
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Information Retrieval IR 4. Plan This time: Index construction.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Modern Information Retrieval Chapter 4 Query Languages.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Overview of Search Engines
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Chapter 6: Information Retrieval and Web Search
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Search Engine Architecture
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Evidence from Content INST 734 Module 2 Doug Oard.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
KNN & Naïve Bayes Hongning Wang
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
Information Retrieval in Practice
University of Maryland Baltimore County
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Information Retrieval in Practice
Search Engine Architecture
Text Indexing and Search
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Information Retrieval in Practice
Modified from Stanford CS276 slides Lecture 4: Index Construction
Inverted Index Hongning Wang
Implementation Issues & IR Systems
CS 430: Information Discovery
Query Languages.
Basic Information Retrieval
Search Engine Architecture
Minwise Hashing and Efficient Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Inverted Index Hongning Wang

Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation Query Rep (Query) Evaluation Feedback Information Retrieval2 Indexed corpus Ranking procedure

What we have now Documents have been – Crawled from Web – Tokenized/normalized – Represented as Bag-of-Words Let’s do search! – Query: “information retrieval” informationretrievalretrievedishelpfulforyoueveryone Doc Doc Information Retrieval3

Complexity analysis Information Retrieval4 Solution: linked list for each document

Complexity analysis doclist = [] for (wi in q) { for (d in D) { for (wj in d) { if (wi == wj) { doclist += [d]; break; } return doclist; Bottleneck, since most of them won’t match! Information Retrieval5

Solution: inverted index Build a look-up table for each word in vocabulary – From word to find documents! information retrieval retrieved is helpful for you everyone Doc1Doc2 Doc1 Doc2 Doc1Doc2 Doc1Doc2 Doc1Doc2 Doc1 Query: information retrieval Dictionary Postings Information Retrieval6

Structures for inverted index Dictionary: modest size – Needs fast random access – Stay in memory Hash table, B-tree, trie, … Postings: huge – Sequential access is expected – Stay on disk – Contain docID, term freq, term position, … – Compression is needed “Key data structure underlying modern IR” - Christopher D. Manning Information Retrieval7

Recap: Zipf’s law Remove non-informative words Remove rare words Information Retrieval8 Remove 1s Remove 0s

Statistical property of language A plot of word frequency in Wikipedia (Nov 27, 2006) Word frequency Word rank by frequency Information Retrieval9 discrete version of power law

Recap: Normalization Solution – Rule-based Delete periods and hyphens All in lower case – Dictionary-based Construct equivalent class – Car -> “automobile, vehicle” – Mobile phone -> “cellphone” Stemming: Reduce inflected or derived words to their root form Information Retrieval10

Recap: crawling and document analyzing Doc Analyzer Crawler Doc Representation Information Retrieval11 Indexed corpus 1.Visiting strategy 2.Avoid duplicated visit 3.Re-visit policy 1.HTML parsing 2.Tokenization 3.Stemming/normalization 4.Stopword/controlled vocabulary filter BagOfWord representation!

Recap: inverted index Build a look-up table for each word in vocabulary – From word to find documents! information retrieval retrieved is helpful for you everyone Doc1Doc2 Doc1 Doc2 Doc1Doc2 Doc1Doc2 Doc1Doc2 Doc1 Query: information retrieval Dictionary Postings Information Retrieval12

Sorting-based inverted index construction Term Lexicon: 1 the 2 cold 3 days 4 a... DocID Lexicon: 1 doc1 2 doc2 3 doc3... doc1 doc2 doc …... Sort by docId Parse & Count... …... Sort by termId “Local” sort... …... Merge sort All info about term 1 : Information Retrieval13

Sorting-based inverted index Challenges – Document size exceeds memory limit Key steps – Local sort: sort by termID For later global merge sort – Global merge sort Preserve docID order: for later posting list join Can index large corpus with a single machine! Also suitable for MapReduce! Information Retrieval14

A close look at inverted index information retrieval retrieved is helpful for you everyone Doc1Doc2 Doc1 Doc2 Doc1Doc2 Doc1Doc2 Doc1Doc2 Doc1 Dictionary Postings Approximate search: e.g., misspelled queries, wildcard queries Proximity search: e.g., phrase queries Index compression Dynamic index update Information Retrieval15

Dynamic index update Periodically rebuild the index – Acceptable if change is small over time and penalty of missing new documents is negligible Auxiliary index – Keep index for new documents in memory – Merge to index when size exceeds threshold Increase I/O operation Solution: multiple auxiliary indices on disk, logarithmic merging Information Retrieval16

Index compression Benefits – Save storage space – Increase cache efficiency – Improve disk-memory transfer rate Target – Postings file Information Retrieval17

Basics in coding theory Expected code length – Information Retrieval18 EventP(X)Code a b c d EventP(X)Code a b c d EventP(X)Code a0.750 b c d

Index compression Observation of posting files – Instead of storing docID in posting, we store gap between docIDs, since they are ordered – Zipf’s law again: The more frequent a word is, the smaller the gaps are The less frequent a word is, the shorter the posting list is – Heavily biased distribution gives us great opportunity of compression! Information theory: entropy measures compression difficulty. Information Retrieval19

Index compression Solution – Fewer bits to encode small (high frequency) integers – Variable-length coding Unary: x  1 is coded as x-1 bits of 1 followed by 0, e.g., 3=> 110; 5=>11110  -code: x=> unary code for 1+  log x  followed by uniform code for x-2  log x  in  log x  bits, e.g., 3=>101, 5=>11001  -code: same as  -code,but replace the unary prefix with  -code. E.g., 3=>1001, 5=>10101 Information Retrieval20

Index compression Example Information Retrieval21 Data structureSize (MB) Text collection960.0 dictionary11.2 Postings, uncompressed400.0 Postings  -coded Compression rate: ( )/960 = 11.7% Table 1: Index and dictionary compression for Reuters-RCV1. (Manning et al. Introduction to Information Retrieval)

Search within in inverted index Query processing – Parse query syntax E.g., Barack AND Obama, orange OR apple – Perform the same processing procedures as on documents to the input query Tokenization->normalization->stemming->stopwords removal Information Retrieval22

Search within inverted index Procedures – Lookup query term in the dictionary – Retrieve the posting lists – Operation AND: intersect the posting lists OR: union the posting list NOT: diff the posting list Information Retrieval23

Search within inverted index Example: AND operation Term1 Term scan the postings Trick for speed-up: when performing multi-way join, starts from lowest frequency term to highest frequency ones Information Retrieval24

Phrase query “computer science” – “He uses his computer to study science problems” is not a match! – We need the phase to be exactly matched in documents – N-grams generally does not work for this Large dictionary size, how to break long phrase into N- grams? – We need term positions in documents We can store them in inverted index Information Retrieval25

Phrase query Generalized postings matching – Equality condition check with requirement of position pattern between two query terms e.g., T2.pos-T1.pos = 1 (T1 must be immediately before T2 in any matched document) – Proximity query: |T2.pos-T1.pos| ≤ k Term1 Term2 scan the postings Information Retrieval26

More and more things are put into index Document structure – Title, abstract, body, bullets, anchor Entity annotation – Being part of a person’s name, location’s name Information Retrieval27

Spelling correction Tolerate the misspelled queries – “barck obama” -> “barack obama” Principles – Of various alternative correct spellings of a misspelled query, choose the nearest one – Of various alternative correct spellings of a misspelled query, choose the most common one Information Retrieval28

Spelling correction Proximity between query terms – Edit distance Minimum number of edit operations required to transform one string to another Insert, delete, replace Tricks for speed-up – Fix prefix length (error does not happen on the first letter) – Build character-level inverted index, e.g., for length 3 characters – Consider the layout of a keyboard » E.g., ‘u’ is more likely to be typed as ‘y’ instead of ‘z’ Information Retrieval29

Spelling correction Proximity between query terms – Query context “flew form Heathrow” -> “flew from Heathrow” – Solution Enumerate alternatives for all the query terms Heuristics must be applied to reduce the search space Information Retrieval30

Spelling correction Proximity between query terms – Phonetic similarity “herman” -> “Hermann” – Solution Phonetic hashing – similar-sounding terms hash to the same value Information Retrieval31

What you should know Inverted index for modern information retrieval – Sorting-based index construction – Index compression Search in inverted index – Phrase query – Query spelling correction Information Retrieval32

Today’s reading Introduction to Information Retrieval – Chapter 2: The term vocabulary and postings lists Section 2.3, Faster postings list intersection via skip pointers Section 2.4, Positional postings and phrase queries – Chapter 4: Index construction – Chapter 5: Index compression Section 5.2, Dictionary compression Section 5.3, Postings file compression Information Retrieval33