On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,

On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba, Canada R3B 2E9

Outline Motivation New Index Structure - Trie over word sequences -Interval sequence assigned to words -Sublists assigned to intervals Algorithm -basic algorithm based on linear search -algorithm based on binary search Experiments Summary

Motivation Evaluation of conjunctive queries in text databases and search engines w 1  w 2  …  w k, where each w i is a word. To find all the documents containing these words. Example cat  dog

New Index Structures e:d:f:a:c:b:e:d:f:a:c:b: {3, 5, 6, 7, 8, 9, 10, 11} {1, 2, 3, 5, 6, 7, 8} {1, 4, 6, 7, 8, 10, 11} {1, 2, 3, 4, 7, 10} {5, 6, 9, 11} {4, 8} Documents and word sequences: DocId 1 2 3 4 5 6 7 8 9 10 11 words a, f, d a, d a, e, d f, b, a c, d, e d, f, e, c f, d, e, a f, d, e, b e, c a, e, f f, e, c sorted ws d, f, a d, a e, d, a f, a, b e, d, c e, d, f, c e, d, f, a e, d, f, b e, c e, f, a e, f, c Inverted lists: e  d= {3, 5, 6, 7, 8, 9, 10, 11}  {1, 2, 3, 5, 6, 7, 8} = {3, 5, 6, 7, 8}

Sorted word sequences: DocId 1 2 3 4 5 6 7 8 9 10 11 sequences d, f, a d, a e, d, a f, a, b e, d, c e, d, f, c e, d, f, a e, d, f, b e, c e, f, a e, f, c [16, 16] v9v9 v 11 v 15 v 13 [1, 1] v 10 [1, 20] v1v1 d f a [1, 2] [1, 4] v0v0 v4v4  a (c) [3, 3] v3v3 [10, 10] [17, 17] v8v8 [8, 14] [10, 13] [11, 11] [12, 12] [9, 9] [8, 8] e[8, 19] v 12 c [15, 15] d v7v7 f [16, 18] c f a a c a b c v 14 v 17 v 18 v 19 v6v6 f a [5, 6] [5, 7] v2v2 b [5, 5] v5v5 v 16 Trie over sorted word sequences: Fig. 1 New Index Structures

Trie. Assume that S = {s 1, …, s n }. If |S| = 0, the trie is, of course, empty. For |S| = 1, trie(S) is a single node. If |S| > 1, S is split into m (possibly empty) subsets S 1, S 2, …, S m so that a string is in S j if its first word is w j (1 ≤ j ≤ m). The tries trie(S 1 ), trie(S 2 ), …, trie(S n ) are constructed in the same way except that at the kth step, the splitting of sets is based on the kth words in the sequences. They are then connected from their respective roots to a single node to create trie(S). Tree encoding. Label each node v in a trie with an interval I v = [α v, β v ], where β v denotes the rank of v in a post-order traversal of the trie. Here the ranks are assumed to begin with 1, and all the children of a node are assumed to be ordered and fixed during the traversal. Furthermore, α v denotes the lowest rank for any node u in T[v] (the subtree rooted at v, including v). Thus, for any node u in T[v], we have I u  I v since the post-order traversal enters a node before all of its children, and leaves after having visited all of its children. New Index Structures

More than one node may be labeled with the same word We associate each word w with a interval sequence of the form: L w =,, …,, where k is the number of all those nodes labeled with w and each = [ [1], [2]] (1  i  k) is an interval associated with a certain node labeled with w. New Index Structures

Le:Ld:Lf:La:Lc:Lb:Le:Ld:Lf:La:Lc:Lb: [8,19] [1, 4][8, 14] [1, 2][5, 7][10, 13][16, 18] [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16] [9, 9][10, 10][15, 15][17, 17] [5, 5][12, 12] New Index Structures e:d:f:a:c:b:e:d:f:a:c:b: {3, 5, 6, 7, 8, 9, 10, 11} {1, 2, 3, 5, 6, 7, 8} {1, 4, 6, 7, 8, 10, 11} {1, 2, 3, 4, 7, 10} {5, 6, 9, 11} {4, 8} In general, an interval sequence is shorter than the corresponding inverted list. The longer an inverted list, the shorter the corresponding interval sequence.

v7v7 v4v4 v 19 v 18 v 14 Assignment of DocIDs to intervals v 13 v 17 v0v0 v3v3 {11} v 15 {6} v8v8 {3, 5, 6, 7, 8} {6, 7, 8} {7} {8} {5} {3} e {3, 5, 6, 7, 8, 9, 10, 11} v 12 c {9} df {10, 11} v9v9 c f a a {10} c a b c v 17 {4} {2} {1} v 10 v1v1 d f a {1, 2}  a v6v6 f a v2v2 v 11 b v5v5 = [10, 13]. The set {6, 7, 8} assigned to v 14 can be considered as the set assigned to [10, 13]. L d : [1, 4][8, 14] {1, 2}{3, 5, 6, 7, 8} Fig. 2 New Index Structures

Query Evaluation Q = w  w′ ? Lw:Lw: L w′ : S1S1 S2S2 S3S3 ⊎⊎ Assume that frequency of w is higher than w. answer:

BASIC EVALUATION ALGORITHM conj(L w, L w ) - to find all those intervals in L w with each being contained in some interval in L w, stored in a new sequence L. 1.Let L w = I 1, …, I k. Let L w = J 1,, …, J k. L  . 2.Step through L w and L w from left to right. Let I p and I q be the interval currently encountered. We will do one of the following checkings: i)If I p  J q append J q to the end of L. Move to J q+1 if q < k (then, in a next step, we will I p check against J q+1 ). If q = k, stop. ii)If I p [1] > J q [2], move to J q+1 if q < k. If q = k, stop. iii)If I p [2] < J q [1], move I p+1 to if p < k (then, in a next step, we will check J q against I p+1 ). If p = k, stop.

p [1, 4][8, 14] q [5, 5][12, 12] p q [1, 4][8, 14] [5, 5][12, 12] p q L b : [5, 5][12, 12] L d : [1, 4][8, 14] 1 st step:2 nd step:3 rd step: BASIC ALGORITHM In L b, only [12, 12] is contained in an interval [8, 12] in L d. Return the subset associated with [12, 12] as the result. It is {8}. d  b ?

BASIC ALGORITHM Q = d  f  a. L d = [1, 4][8, 14] L f = [1, 2][5, 7][10, 13][16, 18] L a = [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16] L = conj(L d, L f ) = [1, 2][10, 13]. L = conj(L, L a ) = [1, 1][11, 11]. The results: {1} ⊎ {7} = {1, 7}. {1}{7}

Algorithm based on binary search  Let L 1 = I 1, …, I n and Let L 2 = J 1,, …, J m be two interval sequences with m = |L 2 | ≤ n = |L 1 |.  Let l = log(n/m). Then, 2 l is the largest power of two not exceeding n/m. Let t = n - 2 l + 1. Set intersection based on binary search L2:L2: L1:L1: t Fig. 3

Compare J m and I t. If J m [1] > I t [2] (J m appears to the right of I t ), we should look for the intervals (in L 1 ) covered by J m somewhere to the right of I t.  By using the traditional binary search, we try to find an interval I covered by J m with l more comparisons.  Around I, we will continually (by a simple linear search) find the left-most interval x in L 1, which can be covered by J m ; and then with l more comparisons, we will find the right-most interval y covered by J m, in a similar way.  Obviously, all the intervals between x and y, including x and y, can be covered by J m. Algorithm based on binary search

L2:L2: L1:L1: txy L2L2 L1L1 txy This information allows us to reduce the problem to the situation illustrated in Fig. 3. To complete the whole operation, it is sufficient to apply the above process to L 1 and L 2, where L 1 = I 1, …, I x-1 and L 2 = J 1,, …, J m-1. Fig. 4 Algorithm based on binary search

If, on the other hand, J m [2] < I t [1] (J m appears to the left of I t ),, we should check the intervals to the left of I t, and the problem immediately reduces to the checking of L 2 = L 2 against L 1 = L 1 [1.. t - 1]. We can complete the operation by applying the above process to L 1 and L 2. L2:L2: L1:L1: t Fig. 5 t L1L1 L 2 = L 2 Algorithm based on binary search

However, L 2 may become larger than L 1. So in the recursive call, the roles of L 2 and L 1 may be reversed, by which we will check each interval I in L 2 against L 1 to find an interval I in L 2 such that I  the last interval in L 1. L2:L2: L1:L1: txy L2L2 L1L1 xy Fig. 6 Algorithm based on binary search

t If J m  I t, we will check linearly I t-1, I t-2, … until we meet a left-most interval x which can covered by J m. Then, check I t+1, I t+2, … until a right-most interval y which can be covered by J m. All the encountered nodes, except x and y, must be covered by J m. This reduces the problem to a checking of L 2 = L 2 [1.. m - 1] against L 1 = L 1 [1.. x]. L2:L2: L1:L1: t x y L2L2 L1L1 Fig. 7 x y Algorithm based on binary search

If J m  I t (we may have this case due to the roll interchange), we add J m to the result and the problem reduces to a checking L 2 = L 2 [1.. m - 1] against L 1 = L 1 [1.. t]. L2:L2: L1:L1: t L2L2 L1L1 Fig. 8 t Algorithm based on binary search

Example 2 ConsiderL d = [1, 4][8, 14] and L a = [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16]. (m = 2, n = 6) By our algorithm based on the binary search, the following operations will be conducted: Step 1: checking L d [2] = [8, 14] against L a. l =  log(6/2)  = 1, t = n - 2 l + 1= 6 – 2 + 1= 5, L a [5] = [11, 11]. Since [11, 11]  [8, 14], we will call linearSearch( ) to find x = 4 and y = 5. Step 2: checking L d [1] = [1, 4] against L a [1.. 3]. l = = 1, t = 3 – 2 1 + 1 = 2, L a [2] = [3, 3]. Since [3, 3]  [1, 4], we will will call linearSearch( ) to find x = 1 and y = 2. Algorithm based on binary search

IMPROVEMENTS Search control by using LCAs (least common ancestors) [3, 3][16, 16] [8, 8] [1, 1] [5, 6] [11, 11] v 10 v5v5 v1v1 v 12 v 18 v7v7 v6v6 v 15 v2v2 v0v0 [1, 4] [1, 20] [8, 14] [8, 19] Tw:Tw: Fig. 7: Illustration for T w and : L a = [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16] v 10 v5v5 v6v6 v 12 v 18 v 15 Fig. 9

IMPROVEMENTS [1, 1] [1, 4] [3, 3] [5, 6] [1, 20] [8, 8] [8, 14] [11, 11] [16, 16] [8, 19] 11 44 22 33 IaIa 1 IaIa 2 IaIa 3 IaIa 4 IaIa 5 IaIa 6 Fig. 10 L a = [1, 1][3, 3][5, 6][8, 8][11, 11][16, 16]

Experiments In the experiments, we have tested seven methods: Inverted files with melding [6] (IFm for short), Inverted files with adaptive [16] (IFa for short), Hashing-based (RanGroupScan in [19]; Hb for short), Skip-list-based [33] (SkipL for short), Interval-based (linear-search, discussed in the paper, Ib for short), setIntersect (binary-search, discussed in the paper; sI for short), setIntersect with LCAs (discussed in the paper; sIL for short).

Experiments For the experiments with real data, we use part of Wikipedia data, which contains more than 10 million text documents. We numbered the documents as they were stored, by assigning them a sequential number indicating their order in the indexing process. The characteristics of this collection are shown in Table 1. Wikipedia data pages10,500,000 Size (gigabytes)16.25 Word occurrences (without markup)1,567,324,812 Distinct words (after stemming)3,603,556 Table 1: Characteristics of Wikipedia Data

Experiments Two-word queries: Two inverted lists of the same length: 5 million elements on average for 20 queries

Experiments Queries with more than two words: inverted lists of the same length: 5 million elements on average for 20 queries

Summary An efficient algorithm for intersection of inverted lists - Trie over sorted word sequences - Tree encoding - Interval sequences associated with words -Binary search of interval sequences Computational complexities time: O(m (log(n/m) + 1)) (m  n), where n and m are respectively the lengths of the two interval sequences taking parting the intersection.

Thanks you.

On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,

Similar presentations

Presentation on theme: "On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,

Similar presentations

Presentation on theme: "On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,"— Presentation transcript:

Similar presentations

About project

Feedback