Download presentation

Presentation is loading. Please wait.

Published byGary Marner Modified over 2 years ago

1
Space-Efficient Data Structures for Top-k Completion Giuseppe Ottaviano Università di Pisa Bo-June (Paul) Hsu Microsoft Research WWW 2013

2
String auto-completion

3
Scored string sets Top-k Completion query: Given prefix p, return k strings prefixed by p with highest scores Example: p=“tr”, k=2 (triangle, 9), (trie, 5) three2 trial1 triangle9 trie5 triple4 triply3

4
Space-Efficiency Scored string sets can be very large – Hundreds of millions of queries for web search auto- suggest Must fit in RAM for fast access – Need space-efficient solutions! We compare three solutions – RMQ Trie, based on Range Minimum Queries – Completion Trie, based on a modified trie with variable-sized pointers – Score-Decomposed Trie, based on succinct data structures

5
RMQ TRIE (RT)

6
RMQ Trie Lexicographic order → strings starting with given prefix in a contiguous range If we can find the max in a range, it is top-1 Range is split in two subranges, can proceed recursively using a heap to retrieve top-k three2 trial1 triangle9 trie5 triple4 triply3

7
RMQ Trie To store the strings we can use any data structure that keeps the strings sorted – We use a compressed trie To find max score in a range we use a succinct Range Minimum Query (RMQ) data structure – Needs only 2.6 additional bits per score, answers queries in O(log n) time This is a standard strategy, but not very fast. We use it as a baseline.

8
COMPLETION TRIE (CT)

9
t iree εε l εεεgle hr e p a l n Node label Branching character (Scored) compacted tries ye three2 trial1 triangle9 trie5 triple4 triply3 2 1 4 3 5 9

10
t iree εε l εεεgle hr e p a l n Completion Trie ye three2 trial1 triangle9 trie5 triple4 triply3 2 1 4 3 5 9 4 9 9 9

11
Completion Trie t9t9 t9t9 hree 2 ri 9 a9a9 a9a9 e5e5 e5e5 pl 4 e4e4 e4e4 y3y3 y3y3 ngle 9 l1l1 l1l1 t iree εε l εεεgle hr e p a l n ye 2 1 4 3 5 9 4 9 9 9

12
Completion Trie Scores encoded differentially (either from parent or previous sibling) Pointers and score deltas encoded with variable bytes All node information in the same stream, favoring cache-efficiency t9t9 t9t9 hree 2 ri 9 a9a9 a9a9 e5e5 e5e5 pl 4 e4e4 e4e4 y3y3 y3y3 ngle 9 l1l1 l1l1

13
SCORE-DECOMPOSED TRIE (SDT)

14
Trees as balanced parentheses () (()()()) (()(()()())) 2n bits are sufficient (and necessary) to represent a tree Can support O(1) operations with 2n + o(n) bits

15
Score-decomposed trie Builds on compressed path-decomposed tries [Grossi-Ottaviano ALENEX 2012] Parentheses-based representation of trees Dictionary-compression of node labels

16
Score-decomposed trie t iree εε l εεεgle hr e p a l n ye 2 1 4 3 5 9 h,2e,5p,4l,1 9 L : t1ri2a1ngle BP: ( ((( ) B : h epl R : 2 541

17
three2 trial1 triangle9 trie5 triple4 triply3

18
Score compression... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000... 3 bits/value11 bits/value16 bits/value Data structure to store scores in RT and SDT Packed-blocks array – “Folklore” data structure, similar to many existing packed arrays, Frame-Of-Reference, PFORDelta,… Divide the array into fixed-size blocks Encode the values of each block with the same number of bits Store separately the block offsets

19
Score compression... 3 5 1 2 3 0 0 1 2 4 1900 1 1 2 3 2 1 10000... 3 bits/value11 bits/value16 bits/value Can be unlucky – Each block may contain a large value But scores are power-law distributed Also, tree-wise monotone sorting On average, 4 bits per score

20
Space Datasetgzip CT SDT RT QueriesA 27% 57%30%31% QueriesB25%48%26%27% URLs24%57%26%27% Unigrams39%43%35%37%

21
Space

22
Time Time per returned completion on a top-10 query

23
Thanks for your attention! Questions?

Similar presentations

OK

Mike 66 Sept 20081 Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,

Mike 66 Sept 20081 Succinct Data Structures: Techniques and Lower Bounds Ian Munro University of Waterloo Joint work with/ work of Arash Farzan, Alex Golynski,

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on viruses and bacteria articles Ppt on central limit theorem example Conceptual architecture view ppt on mac Ppt on asymptotic notation of algorithms and data Ppt on statistics for class 11th Ppt on ar 25-50 Ppt on area of trapezium math Ppt on indian history quiz Ppt on law against child marriage in pakistan File type ppt on cybercrime training