A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Chapter 5: Introduction to Information Retrieval
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Introduction to Information Retrieval
Fast Algorithms For Hierarchical Range Histogram Constructions
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Lectures on Recursive Algorithms1 COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Inverted Index Hongning Wang
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Multiversion Access Methods - Temporal Indexing. Basics A data structure is called : Ephemeral: updates create a new version and the old version cannot.
Information Retrieval in Practice
B+-tree and Hashing.
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part A Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Selection Sort, Insertion Sort, Bubble, & Shellsort
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
Database Management 9. course. Execution of queries.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Chapter 6: Information Retrieval and Web Search
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
3.3 Complexity of Algorithms
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Multi-object Similarity Query Evaluation Michal Batko.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
C++ How to Program, 7/e © by Pearson Education, Inc. All Rights Reserved.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
1 Ullman et al. : Database System Principles Notes 5: Hashing and More.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
LINKED LISTS.
Information Retrieval in Practice
COMP9319 Web Data Compression and Search
An Efficient Algorithm for Incremental Update of Concept space
Lecture 1: Introduction and the Boolean Model Information Retrieval
Query processing: phrase queries and positional indexes
Information Retrieval and Web Search
Chapter 12: Query Processing
Query Languages.
Algorithm design and Analysis
Information Organization: Clustering
Implementation Based on Inverted Files
Lecture 2- Query Processing (continued)
Divide and Conquer Algorithms Part I
Similarity Search: A Matching Based Approach
Space-for-time tradeoffs
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Information Retrieval and Web Design
Chapter 11: Indexing and Hashing
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters, vol. 91, pp.115 – 120, 2004

Abstract When searching for information on the Web, it is often necessary to use one of the available search engines. Because the number of results are quite large for most queries, we need some measure of relevance with respect to the query. One of the most important relevance factors is the proximity score, i.e., how close the keywords appear together in a given document.

Abstract A basic proximity score is given by the size of the smallest range containing all the keywords in the query. We generalize the proximity score to include many practically important cases and present an O(n log k)- time algorithm for the generalized problem, where k is the number of keywords and n is the number of occurrences of the keywords in a document.

Proximity score Used when given multiple keywords If proximity is good Likely that the keywords occur in a paragraph or a sentence Cannot be computed off-line Just too many possible combinations Computation must be very efficient

How to store docs. in web search databases Typical search A few keywords, look for documents with all the keywords Not efficient to store a document as is Typical scheme Inverted file List of document IDs for each keyword Each document ID has a list of offsets For each occurrence of the keyword Counted in words

Example – one document ID I am Tom. You are Jane. I am a boy. You are a girl. I am a student. You are a dropout. …. …. i am tom you are jane ….… 1, 7, 15, … 2, … 3, 10, 18 … 4, 11, 19 5, … 0, 6, 14, …

Terminology Range Is a continuous area in a document is inclusive and denoted by Size of range The size of range is

The basic proximity problem Given keywords and lists of offsets Find the smallest range in the document where all the keywords appear

Extension #1 Not all of the keywords ‘ apple computer support ’ All results may have bad proximity score Some good score with ‘ apple ’ and ‘ computer ’ proximity score with partial keyword

Extension #2 Multiple occurrences of keywords ‘ johnson and johnson ’ ‘ johnson ’ must appear at least twice proximity requiring repetitions of keywords

Def. of Generalized Prob. Input keywords: Lists of offsets: Thresholds: # keywords in range: Solution The smallest range containing at least keywords Each keyword more than threshold times

Previous works Gonnet et al. Two keywords within a given distance Baeza-Yates and Cunto Logarithmic time alg. with square time construction Manber and Baeza-Yates Logarithmic time alg. Given distance Superlinear space Sadakane and Imai Basic proximity problem time

Our result Generalized problem time

The algorithm Merge phase In time Scan phase In time There can be multiple scans With scans with different thresholds and In time

The merge The input lists are merged The merged list is denoted by L[0...n − 1]. two fields L[x].offset and L[x].k i Takes time

Candidate range Def. Candidate range is a range that matches the problem definition The solution is a candidate range The number of candidate ranges is less than n×(n − 1)/ 2

Critical range Def. Critical range is a candidate range that does not properly contain other candidate ranges Lemma. The solution is a critical range. The solution is a candidate range If the solution is not a critical range, then smaller ranges that match problem definition exist.

# critical ranges Lemma. Critical ranges are not nested Immediate from the definition of critical ranges Lemma. There are linear number of critical ranges Critical ranges do not share left ends Nested if so Only linear number of possible left ends

Difference between critical ranges and candidate ranges

Scan critical ranges in linear time Variables used Current left end pointer - L Current right end pointer – R (L, R) is the current range Counters for each keyword - c i # occurrences in the current range Threshold counter - h # keywords over the threshold

Updating the counters The counter for each keyword Updated each time L or R is moved (by one) Reflects the # occurrences of each keyword in the range Only one counter is affected per move At each move Check if the current range is a candidate To avoid looking at all counters Threshold counter has # counters over the threshold

The first critical range Repeatedly move the right pointer R until the current range is a candidate range The right end pointer has the end point of the first critical range No range of the form is a candidate range if Repeatedly move the left pointer L until the current range is not a candidate range Move L back by one and you have the first critical range

Illustration Critical ranges L ↓ R ↓

Illustration Critical ranges L R ↓ ↓

Illustration Critical ranges L R ↓ ↓

Illustration Critical ranges L R ↓ ↓

The next critical range Move L to the right by one place Repeat as if looking for the first

Time complexity - scan Each movement of pointers takes constant time Two variables are updated for each movement Counter for affected keyword Threshold counter The scan finishes in linear time O(n)

Conclusions Linear time algorithm if # keywords k is a constant, merged form is given, or working on the original document Is optimal?