Presentation on theme: "Partitioned Elias-Fano Indexes"— Presentation transcript:
1Partitioned Elias-Fano Indexes Giuseppe OttavianoRossano VenturiniISTI-CNR, PisaUniversità di Pisa
2Inverted indexes 1: [it is what it is not] 2: [what is a] DocumentDocida2, 3banana3is1, 2, 3it1, 3not1what1, 2Posting list1: [it is what it is not]2: [what is a]3: [it is a banana]Core data structure of Information RetrievalWe seek fast and space-efficient encoding of posting lists (index compression)
3Sequences in posting lists Generally, a posting lists is composed ofSequence of docids: strictly monotonicSequence of frequencies/scores: strictly positiveSequence of positions: concatenation of strictly monotonic listsAdditional occurrence data: ???We focus on docids and frequencies
5Elias-Fano encodingData structure from the ’70s, mostly in the succinct data structures nicheNatural encoding of monotonically increasing sequences of integersRecently successfully applied to inverted indexes [Vigna, WSDM13]Used by Facebook Graph Search!
6Elias-Fano representation Example: 2, 3, 5, 7, 11, 13, 24w – l upper bits00010000110010100111010110110111000Count in unary the size of upper bits “buckets” including empty ones000l lower bits001Concatenate lower bits010Elias-Fano representation of the sequence011110
7Elias-Fano representation Example: 2, 3, 5, 7, 11, 13, 24w – l upper bits00010000110010100111010110110111000Elias-Fano representation of the sequencel lower bitsn: sequence length U: largest sequence valueMaximum bucket: [U / 2l] Example: [24 / 22] = 6 = 110Upper bits: one 0 per bucket and one 1 per valueSpace [U / 2l] + n + nl bits
8Elias-Fano representation Example: 2, 3, 5, 7, 11, 13, 24w – l upper bitsCan show that l = [log(U/n)] is optimal00010000110010100111010110110111000l lower bits[U / 2l] + n + nl bits(2 + log(U/n))n bits U/n is “average gap”
9Elias-Fano representation Example: 2, 3, 5, 7, 11, 13, 24w – l upper bitsnextGEQ(6) ?= 700010000110010100111010110110111000l lower bits[6 / 22] = 1 = 001Find the first GEQ bucket = find the 1-th 0 in upper bitsWith additional data structures and broadword techniques -> O(1)Linear scan inside the (small) bucket
10Elias-Fano representation Example: 2, 3, 5, 7, 11, 13, 24w – l upper bits00010000110010100111010110110111000Elias-Fano representation of the sequencel lower bits(2 + log(U/n))n-bits space independent of values distribution!… is this a good thing?
11Term-document matrix Alternative interpretation of inverted index Gaps are distances between the Xs123aXbananaisitnotwhata2, 3banana3is1, 2, 3it1, 3not1what1, 2
12Gaps are usually smallAssume that documents from the same domain have similar docids…unipi.it/unipi.it/ studentsunipi.it/ researchunipi.it/.../ ottavianosigir.org/sigir.org/ venuesigir.org/ fullpaperspisaXsigir...“Clusters” of docidsPosting lists contain long runs of very close integersThat is, long runs of very small gaps
13Elias-Fano and clustering Consider the following two lists1, 2, 3, 4, …, n – 1, Un random values between 1 and UBoth have n elements and largest value UElias-Fano compresses both to the exact same number of bits: (2 + log(U/n))nBut first list is far more compressible: it is “sufficient” to store n and U: O(log n + log U)Elias-Fano doesn’t exploit clustering
14Partitioned Elias-Fano XXX XXXXX XXXXXXXXX X X X X XXXX X XXXPartition the sequence into chunksAdd pointers to the beginning of each chunkRepresent each chunk and the sequence of pointers with Elias-FanoIf the chunks “approximate” the clusters, compression improves
15Partition optimization We want to find, among all the possible partitions, the one that takes up less spaceExhaustive search is exponentialDynamic programming can be done quadraticOur solution: (1 + ε)-approximate solution in linear time O(n log(1/ε)/log(1 + ε))Reduce to a shortest path in a sparsified DAG
16Partition optimization Nodes correspond to sequence elementsEdges to potential chunksPaths = Sequence partitions
17Partition optimization Each edge weight is the cost of the chunk defined by the edge endpointsShortest path = Minimum cost partitionEdge costs can be computed in O(1)...... but number of edges is quadratic!
19Sparsification: idea n.1 General DAG sparsification techniqueQuantize edge costs in classes of cost between (1 + ε1)i and (1 + ε1)i + 1For each node and each cost class, keep only one maximal edgeO(log n / log (1 + ε1)) edges per node!Shortest path in sparsified DAG at most (1 + ε1) times more expensive than in original DAGSparsified DAG can be computed on the fly
20Sparsification: idea n.2 XXXXXXXXX X X X XIf we split a chunk at an arbitrary positionNew cost ≤ Old cost cost of new pointerIf chunk is “big enough”, loss is negligibleWe keep only edges with cost O(1 / ε2)At most O(log (1 / ε2) / log (1 + ε1)) edges/node
21SparsificationSparsified DAG has O(n log (1 / ε2) / log (1 + ε1)) edges!Fixed εi, it is O(n) vs O(n2) in original DAGOverall approximation factor is (1 + ε2) (1 + ε1)
22Dependency on ε1Notice the scale!Time in secondsBits per posting
23Here we go from O(n log n) to O(n) Dependency on ε2Here we go from O(n log n) to O(n)Time in secondsBits per posting
24Results on GOV2 and ClueWeb09 OR queriesAND queriesMore experiments in paper, code is available