Indexing & querying text

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Herv´ eJ´ egouMatthijsDouzeCordeliaSchmid INRIA INRIA INRIA
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Index Compression David Kauchak cs160 Fall 2009 adapted from:
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Learning for Text Categorization
Packing bag-of-features ICCV 2009 Herv´e J´egou Matthijs Douze Cordelia Schmid INRIA.
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Modern Information Retrieval
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Information Retrieval IR 4. Plan This time: Index construction.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Index Compression Ferrol Aderholdt. Motivation Uncompressed indexes are large  It might be useful for some modern devices to support information retrieval.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Querying Structured Text in an XML Database By Xuemei Luo.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Web- and Multimedia-based Information Systems Lecture 2.
1 Patrick Lambrix Department of Computer and Information Science Linköpings universitet Information Retrieval.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Evidence from Content INST 734 Module 2 Doug Oard.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Natural Language Processing Topics in Information Retrieval August, 2002.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
University of Maryland Baltimore County
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Large Scale Search: Inverted Index, etc.
Information Retrieval in Practice
Text Indexing and Search
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Database Management System
Information Retrieval in Practice
Indexing & querying text
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information Retrieval and Web Search
Implementation Issues & IR Systems
Chapter 12: Query Processing
Optimal Merging Of Runs
Optimal Merging Of Runs
Basic Information Retrieval
Representation of documents and queries
Java VSR Implementation
Text Categorization Assigning documents to a fixed set of categories
Implementation Based on Inverted Files
Lecture 2- Query Processing (continued)
6. Implementation of Vector-Space Retrieval
2018, Spring Pusan National University Ki-Joune Li
Java VSR Implementation
Java VSR Implementation
Index Structures Chapter 13 of GUW September 16, 2019
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Indexing & querying text September 2016 Pierre-Edouard Portier http://liris.cnrs.fr/pierre-edouard.portier/teaching_2016_2017/

LA-Times http://trec.nist.gov/data/docs_eng.html Daily articles from 1989 and 1990 Let’s have a look at file la010189

Tokenization Language-dependent Compound words Etc. For your first iteration keep it simple Space character as a delimiter Remove punctuation Or use a library… http://www.nltk.org/api/nltk.tokenize.html If you think of minor hacks (e.g., removing tags <BYLINE>,…, <P>,...<TYPE>) Ask yourself: are they necessary?

Stemming https://en.wikipedia.org/wiki/Stemming “stems”, “stemmer”, “stemming”  “stem”  “marketing”, “markets”  “market”  Language-dependent For your first iteration keep it simple Use Porter’s algorithm Or don’t use stemming

Stop words removal https://en.wikipedia.org/wiki/Stop_words Can increase performance May limit phrase search

Inverted file (IF) … < 𝑡 𝑖, 𝑛 𝑖 > Mapping between vocabulary of terms (VOC) and posting lists (PL) … < 𝑡 𝑖, 𝑛 𝑖 > …, < 𝑑 𝑗 , 𝑠𝑐𝑜𝑟𝑒( 𝑡 𝑖 , 𝑑 𝑗 )>, ... VOC PL VOC maps a pair <term, size-of-PL> to the value of an offset in the PL-FILE, or An index of a memory-mapped table VOC can be a hash-map or a search-tree (e.g. B-Tree) PL are contiguous parts of the PL-FILE to minimize seek time on rotating drives

𝑠𝑐𝑜𝑟𝑒( 𝑡 𝑖 , 𝐷 𝑗 )

Term-frequency 𝑡𝑓 𝑡,𝑑 = 𝑛 𝑡,𝑑 𝑡′ 𝑛 𝑡 ′ ,𝑑 𝑡𝑓 𝑡,𝑑 = 𝑛 𝑡,𝑑 𝑡′ 𝑛 𝑡 ′ ,𝑑 The denominator is a normalization factor used to remove the influence of the document’s length

Inverse-document-frequency 𝑖𝑑𝑓 𝑡 = log |𝐷| | 𝑑∈𝐷 𝑛 𝑡,𝑑 >0}| Discriminant potential for term 𝑡

𝑖𝑑𝑓 𝑡 = log |𝐷| | 𝑑∈𝐷 𝑛 𝑡,𝑑 >0}| | 𝑑∈𝐷 𝑛 𝑡,𝑑 >0}| |𝐷| estimates the probability for the presence of term 𝑡 in any document Independence: 𝑝 𝐴𝐵 =𝑝 𝐴 𝑝 𝐵 ≡ log 𝑝 𝐴𝐵 = log 𝑝 𝐴 +log⁡(𝑝 𝐵 ) Additivity: 𝑖𝑑𝑓 𝑡 1 ∧ 𝑡 2 =𝑖𝑑𝑓 𝑡 1 +𝑖𝑑𝑓( 𝑡 2 ) | 𝑑∈𝐷 𝑛 𝑡,𝑑 >0}| |𝐷| ≤1, and the log can be negative, thus the inverted ratio Log compresses the scale so that large and small quantities can be compared The assumption of independence matches with the Vector Space Model (VSM) Idf values can be considered as dimension weights

𝑠𝑐𝑜𝑟𝑒 𝑡 𝑖 , 𝐷 𝑗 =𝑡𝑓 𝑡 𝑖 , 𝐷 𝑗 × 𝑖𝑑𝑓( 𝑡 𝑖 )

Aggregation Function Monotonous score aggregation function (e.g., sum) Cosine similarity is reduced to summation by pre-normalizing vector lengths to 1 For your 1st iteration: you can focus on the sum as an aggregation function Consider only conjunctive and disjunctive queries

Ranked queries, a naïve algorithm The PL are ordered by DOC ID Parallel scan of the 𝑃𝐿 𝑡 𝑖 to find each DOC ID 𝑑 𝑗 for which at least one (OR-QUERY) or all (AND-QUERY) the terms of the query are present 𝑠𝑐𝑜𝑟𝑒 𝑑 𝑗 = 𝑖=1 𝑛 𝑠𝑐𝑜𝑟𝑒( 𝑡 𝑖 , 𝑑 𝑗 ) At the end of the parallel scan, the 𝑑 𝑗 are sorted by their score For AND-QUERY the search is linear in the size of the smallest PL OR-QUERY requires a full scan of the PL

Fagin’s Threshold Algorithm (TA) “Simple” threshold algorithm earns Gödel Prize Optimal Aggregation Algorithms for Middleware (PODS, 2001) An optimal way to retrieve the top-k results for monotonous score aggregation

4. ; IF min 𝑑 ′ ∈𝑅 𝑠𝑐𝑜𝑟𝑒 𝑑 ′ <𝜏→𝐺𝑂𝑇𝑂 1 FI 5. ; RETURN 𝑅 0. 𝑅←∅; 𝜏←+∞; 1. ; DO 0≤𝑖<𝑛→ 𝑑 𝑖 ← Doc from 𝑃𝐿 𝑡 𝑖 (sorted by weights) with the next best 𝑠𝑐𝑜𝑟𝑒 𝑡 𝑖 , 𝑑 𝑖 ; 𝑠𝑐𝑜𝑟𝑒 𝑑 𝑖 ←𝑠𝑐𝑜𝑟𝑒 𝑡 𝑖 , 𝑑 𝑖 + 𝑗≠𝑖 𝑠𝑐𝑜𝑟𝑒 𝑡 𝑗 , 𝑑 𝑖 NB. 𝑠𝑐𝑜𝑟𝑒 𝑡 𝑗 , 𝑑 𝑖 can be obtained from a binary search on 𝑃𝐿 𝑡 𝑗 (sorted by doc ids) ; IF 𝑅 <𝑘→ 𝑅←𝑅∪{ 𝑑 𝑖 } | 𝑅 ≥𝑘→ 𝑑𝑚← 𝑎𝑟𝑔𝑚𝑖𝑛 𝑑 ′ ∈𝑅 𝑠𝑐𝑜𝑟𝑒 𝑑 ′ ; IF 𝑠𝑐𝑜𝑟𝑒 𝑑 𝑖 >𝑠𝑐𝑜𝑟𝑒 𝑑𝑚 →𝑅← 𝑅\{𝑑𝑚 )∪{ 𝑑 𝑖 } FI FI OD 2. ; 𝜏← 𝑖=0 𝑛−1 𝑠𝑐𝑜𝑟𝑒( 𝑡 𝑖 , 𝑑 𝑖 ) 3. ; IF 𝑅 <𝑘→𝐺𝑂𝑇𝑂 1 FI 4. ; IF min 𝑑 ′ ∈𝑅 𝑠𝑐𝑜𝑟𝑒 𝑑 ′ <𝜏→𝐺𝑂𝑇𝑂 1 FI 5. ; RETURN 𝑅

Merge-based algorithm To build the IF Start building an IF in memory. When the memory is full, flush the PL of this partial IF to the disk. When all the documents have been seen, merge the flushed PL to obtain the IF.

PL Compression Advantage: Requirement: Strategy: More of the IF can be stored in main memory Requirement: Time to read a compressed PL from disk + decompressing it must be less than Time to read an uncompressed PL from disk Strategy: Sort the PL by doc-id and store the deltas between consecutive doc-ids Use variable-byte encoding for these deltas

Variable Byte (Vbyte) encoding Given integer 𝑣 IF 𝑣<128 𝑣 is encoded on the last 7 bit of a byte 𝑏 On first entry in this branch, the first bit of 𝑏 is set to 1 Otherwise, the first bit of 𝑏 is set to 0 Return 𝑏 ELSE let 𝑣′ and 𝑘 s.t. 𝑣=𝑘×128+𝑣′ 𝑏←𝑉𝐵𝑌𝑇𝐸( 𝑣 ′ ) Return 𝑉𝐵𝑌𝑇𝐸(𝑘) concatenated with 𝑏

VBYTE decoding 𝑉𝐵𝑌𝑇𝐸 𝑣 = 𝑏 𝑛−1 … 𝑏 1 𝑏 0 𝑉𝐵𝑌𝑇𝐸 𝑣 = 𝑏 𝑛−1 … 𝑏 1 𝑏 0 𝑣= 𝑏 𝑛−1 × 128 𝑛−1 + …+ 𝑏 1 ×128+ (𝑏 0 −128)

VBYTE example 𝑣=130 𝑉𝐵𝑌𝑇𝐸 𝑣 =0000 0001 1000 0010 𝑣= (0000 0001) 2 × 128 + (0000 0010) 2

Zipf’s law https://plus.maths.org/content/mystery-zipf http://www-personal.umich.edu/~mejn/courses/2006/cmplxsys899/powerlaws.pdf Do PL lengths follow a Zipf’s Law? What is the best estimate of the parameter “a” on our dataset? E.g., least squares estimate