Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Text Databases Text Types
Latent Semantic Analysis
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Dimensionality Reduction PCA -- SVD
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Latent Semantic Indexing via a Semi-discrete Matrix Decomposition.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Kathryn Linehan Advisor: Dr. Dianne O’Leary
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
Chapter 5: Information Retrieval and Web Search
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy.
Compression of Inverted Indexes for Fast Query Evaluation Falk Scholer Hugh Williams John Yiannis Justin Zobel (RMIT University, Melbourne, Australia)
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Chapter 6: Information Retrieval and Web Search
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Authors: Rosario Sotomayor, Joe Carthy and John Dunnion Speaker: Rosario Sotomayor Intelligent Information Retrieval Group (IIRG) UCD School of Computer.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Vector Space Models.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
Computer-assisted essay assessment zSimilarity scores by Latent Semantic Analysis zComparison material based on relevant passages from textbook zDefining.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Natural Language Processing Topics in Information Retrieval August, 2002.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Text Indexing and Search
Indexing & querying text
Information Retrieval in Practice
Indexing & querying text
Information Retrieval and Web Search
Chapter 5: Information Retrieval and Web Search
Restructuring Sparse High Dimensional Data for Effective Retrieval
Information Retrieval and Web Design
Latent Semantic Analysis
Presentation transcript:

Indices Tomasz Bartoszewski

Inverted Index Search Construction Compression

Inverted Index In its simplest form, the inverted index of a document collection is basically a data structure that attaches each distinctive term with a list of all documents that contains the term.

Search Using an Inverted Index

Step 1 – vocabulary search finds each query term in the vocabulary If (Single term in query){ goto step3; } Else{ goto step2; }

Step 2 – results merging merging of the lists is performed to find their intersection use the shortest list as the base partial match is possible

Step 3 – rank score computation based on a relevance function (e.g. okapi, cosine) score used in the final ranking

Example

Index Construction

Time complexity O(T), where T is the number of all terms (including duplicates) in the document collection (after pre-processing)

Index Compression

Why? avoid disk I/O the size of an inverted index can be reduced dramatically the original index can also be reconstructed all the information is represented with positive integers -> integer compression

Use gaps 4, 10, 300, and 305 -> 4, 6, 290 and 5 Smaller numbers Large for rare terms – not a big problem

All in one

Unary For x: X-1 bits of 0 and one of 1 e.g. 5 -> >

Elias Gamma Coding

Elias Delta Coding

Golomb Coding

Example

The coding tree for b=5

Selection of b

Variable-Byte Coding seven bits in each byte are used to code an integer last bit 0 – end, 1 – continue E.g >

Summary Golomb coding better than Elias Gamma coding does not work well Variable-byte integers are often faster than Variable-bit (higher storage costs) compression technique can allow retrieval to be up to twice as fast than without compression space requirement averages 20% – 25% of the cost of storing uncompressed integers

Latent Semantic Indexing

Reason many concepts or objects can be described in multiple ways find using synonyms of the words in the user query deal with this problem through the identification of statistical associations of terms

Singular value decomposition (SVD) estimate latent structure, and to remove the “noise” hidden “concept” space, which associates syntactically different but semantically similar terms and documents

LSI LSI starts with an m*n termdocument matrix A row = term; column = document value e.g. term frequency

Singular Value Decomposition

Query and Retrieval

Example

q - “user interface”

Example

Summary The original paper of LSI suggests 50–350 dimensions. k needs to be determined based on the specific document collection association rules may be able to approximate the results of LSI