Download presentation
Presentation is loading. Please wait.
Published byJean Tamsyn Newton Modified over 9 years ago
1
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems Research (CIDR 2007) 2009. 05. 08. Summarized by Jaehui Park, IDS Lab., Seoul National University Presented by Jaehui Park, IDS Lab., Seoul National University
2
Copyright 2008 by CEBT Introduction Interactive search engine Variety of complex features – Automatic query completion (IR perspective) – Semi-structured retrieval – Semantic search – DB-style joins and grouping (DB perspective) – Range search (Theorist’s perspective) Combining IR-style with DB-style querying – example query ir db integration conference:sigmod author: Context-sensitive prefix search and completion For a given collections of documents, with a unique id for each document and a unique id for each of the words used in the collection, a context-sensitive prefix search and completion query is a pair (D, W), where D is a set of document ids and W is a range of word ids. To process the query means to compute a ranked list of all pairs (d, w), where words w occurs in document d, d is from D and w is from W. Novel index data structure – HYB [SIGIR 2006][SPIRE 2006] Using no more space than a state-of-the-art compressed inverted index With 10 times faster query processing 2
3
Copyright 2008 by CEBT HYB Indexing data structure Space usage : HYB == INV (inverted index) – Empirical entropy The inherent space complexity of an index Processing time : HYB > INV – The number of operations needed – The latencies of access to data Basic idea Precompute inverted lists for unions of words – The union of all lists for word range W W: arbitrary word range The basic unit of processing is a block – Block => a range of words Block consists of all pairs (w,d) Each block is sorted by document id – Effective gap encoding scheme Compressed multiset 3
4
Copyright 2008 by CEBT HYB (An example) 10 documents (ids: 3,5,6,7,8,9,11,12,13,15) A block for the word range A-D HYB consists of a collection of such blocks Word-in-document pairs : (w,d) Two operations on the block Intersection with a sorted list of document ids Intersection with a list of word ids For example, – Query : ontol sem search The sorted list of ids of documents matching ontol sem The sorted list of documents ids from the blocks containing all occurrences of the word search 4
5
Copyright 2008 by CEBT HYB Block volumes The number of pairs (w,d) A small fraction of the total number of documents – c < 1 (c ~= 0.2) Advantages Simple Can be compressed extremely well – It is proven both theoretically and empirically in [SIGIR 2006] Enables a processing of the prefix search and completion queries – By mere sequential access – Without sorting or other non-linear operations Rank by a precomputed score – For each word-in-document pair Okapi BM25 score + IDF 5
6
Copyright 2008 by CEBT CompleteSearch’s feature set Context-sensitive autocompletion search Display completions of the last query word that – Would lead to good hits, as well as the best hits for any of these completion Compute all completions of the last query word Google Suggest, Apple’s Spotlight, AlltheWeb Live Search We remark that a prototype of our engine already existed when Google Suggest and Apple Spotlight were launched. Algorithmically easier 6
7
Copyright 2008 by CEBT CompleteSearch’s feature set DB_style joins By adding special words – : DB-style join functionality – Intersect the two list of completions == attribute-values pairs for the join attributes – Ex) query -> table:ABC attr_k: and table:XYZ attr_k: Something standard IR-style keyword search cannot handle – conference:sigir author: – conference:sigmod author: – The completion of the two queries Intersecting the two lists of authors No document is a SIGIR paper and a SIGMOD paper at the same time When the answer is spread over several pages – Which German chancellors had an audience with the pope? Combine information from the followings One page about Angela Merkel (current German chancellor) Another page about the current pope having met Angela Merkel Intersect of following two queries german chancellor politician: audience pope politician: 7
8
Copyright 2008 by CEBT CompleteSearch’s feature set Structured search in XML documents XML tags as special words.. (two dots) – Proximity operator Ex) tag:email..tag:subj..dbworld Retrieve all email messages mentioning dbworld in their subject line Semantic search Tag documents – Ex) politician:tony_blair – The necessity of semantic annotation (proactive behavior) Which politician had a private audience with a pope? – Query: audience pope politician: Compute ranked list of completions of politician: which occur in the context of audience pope 8
9
Copyright 2008 by CEBT Lessons Learned Locality of Access For efficiency – Access data as sequential as possible Faster than random access (100 times) – Process as little data as possible per query Extensive use of compression – Hardware-aware implementation When it comes to algorithms highly optimized for sequential access to data, the choice of programming language is critical : C++ An interactive web-application AJAX User Feedback The vast majority of users is not willing to read even the tiniest bit of documentation – Make user interface intuitive and simple 9
10
Copyright 2008 by CEBT Things to do We describe CompleteSearch an interactive search engine that offers the user a variety of complex features Automatic query completion Semi-structured retrieval Semantic search DB-style joins In CompleteSearch done – Limit the sheer amount of data that has to be processed per query – Make access to it almost exclusively sequential undone – We have not yet fully exploited the potential of top-k retrieval techniques – How to deal with dynamic updates 10
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.