Presentation is loading. Please wait.

Presentation is loading. Please wait.

Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

Similar presentations


Presentation on theme: "Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to."— Presentation transcript:

1 Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to all content in this presentation.

2  Consider searching for a subsequence in a collection of genome sequences: …gcaagctttatagtgacaacaataaggtatcactcggtt…  N-gram inverted indexes are the traditional solution, but have 10-100 times more terms than ordinary word-based inverted indexes  TinyLex indexes achieve similar query performance with 7-17 times less terms  TinyLex provides good worst-case query performance 2 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

3  1. Each wife had seven sacks,  2. Each sack had seven cats,  3. Each cat had seven kits.  4. Kits, cats, sacks, and wives. each: {1, 2, 3} had: {1, 2, 3} seven: {1, 2, 3} wife: {1, 4} sack: {1, 2, 4} cat: {2, 3, 4} kit: {3, 4} 3 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

4  1. Each wife had seven sacks,  2. Each sack had seven cats,  3. Each cat had seven kits.  4. Kits, cats, sacks, and wives. 4 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: sack and cat sack: {1, 2, 4} cat: {2, 3, 4} {1, 2, 4} ∩ {2, 3, 4} = {2, 4}

5  Partial word or punctuation queries ◦ Searching a dictionary for all words ending in “ment” ◦ Searching for in HTML files ◦ Searching for "%s" in C source files ◦ Searching for x^2/2 in LaTeX source files  Searching East Asian language text ◦ No spaces, word extraction is complex  Phrase searching 5 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

6 Genome sequences:  1. gcaagctttatagtgacaac...  2. aataaggtatcactcggtta...  3. caattacccccacttcccct...  4. cattataaagaaatgatcaa... Example query: Documents containing subsequence “cact” 6 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

7 Simplified example: Two-letter alphabet  1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb aaa: {2} aab: {2, 3, 4} aba: {1, 2, 3} abb: {1, 2, 4} baa: {2, 3, 4} bab: {1, 2, 3} bba: {1, 4} bbb: {1, 4} 7 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

8  1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb 8 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: aaba aaba aab and aba

9  1. babbbbabab  2. aababaaabb  3. babababaab (false positive)  4. bbbbaabbbb 9 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: aaba aab and aba aab: {2, 3, 4} aba: {1, 2, 3} {2, 3, 4} ∩ {1, 2, 3} = {2, 3}

10  1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb a: {1, 2, 3, 4} b: {1, 2, 3, 4} Small number of terms Slow queries Long posting lists Too many false positives length = 1 10 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

11  1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb aababa: {2} aabbbb: {4} abaaab: {2} ababaa: {2,3} ababab: {3} abbbba: {1} baaabb: {2} baabbb: {4} babaaa: {2} babaab: {3} bababa: {3} babbbb: {1} bbaabb: {4} bbabab: {1} bbbaab: {4} bbbaba: {1} bbbbaa: {4} bbbbab: {1} Fast queries Too many terms Queries must be ≥6 characters length = 6 11 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

12  Review of inverted n-gram indexes  Example TinyLex index  TinyLex index construction  Results  Disadvantages  Questions 12 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

13  Goal: less terms without sacrificing query performance  Consider the n-grams “juggl” and “uggle” ◦ Almost exactly the same posting list in a typical English language collection ◦ Just put the n-gram “uggl” in the index, and leave out “juggl” and “uggle” 13 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee juggl: {2, 7, 33} uggle: {2, 7, 33} uggl: {2,7,33}

14  Insight: The more false positives a term produces when it is queried for, the more information it adds when it is added to the index.  Choose a false positive threshold t and choose the smallest possible set of index terms that satisfies it.  Allow variable-length n-grams. 14 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

15  1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb aa: {2, 3, 4} bb: {1, 2, 4} aaa: {2} aba: {1, 2, 3} bab: {1, 2, 3} bba: {1, 4} bbb: {1, 4} aaba: {2} baab: {3, 4} babb: {1} 15 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee In this example t = 1. At most 1 false positive is allowed for any query. Only 10 terms!

16  1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb 16 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: abaab aba and baab aba: {1, 2, 3} baab: {3, 4} {1, 2, 3} ∩ {3, 4} = {3}

17  The construction guarantees that if the query term occurs in the collection, it will have at most t – 1 false positives (zero in this case).  If we observe t false positives, we can halt immediately. 17 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

18 18 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: bbbbb bbb and bbb and bbb bbb: {1, 4} {1, 4} ∩ {1, 4} ∩ {1, 4} = {1, 4} 1.babbbbabab (false positive)...can’t happen unless the query result is empty. Halt.

19  Achieve similar query performance to classical n-gram indexes with a much larger number of terms  Worst-case bound on number of false positives  Query can be any length 19 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

20  Review of inverted n-gram indexes  Example TinyLex index  TinyLex index construction  Results  Disadvantages  Questions 20 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

21  The problem: ◦ Input: a set of documents, a threshold t ◦ Output: a list of terms such that any query for a term occurring in the collection will have at most t – 1 false positives 21 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

22  Basic construction:  For each n-gram length from 1 to max: ◦ Make a list of all n-grams in the collection and what documents they occur in. ◦ Perform a query on each term using the partially constructed index. ◦ If a term has too many false positives, add it to the index. 22 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

23  1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb (index empty) 23 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 1-gramsQuery result Actual a{1,2,3,4} b t = 1 If the difference between the query result size and the actual posting list size is at least 1, add it to the index.

24  1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb 24 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 2-gramsQuery result Actual aa{1,2,3,4}{2,3,4} ab{1,2,3,4} ba{1,2,3,4} bb{1,2,3,4}{1,2,4} (index empty)

25  1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb 25 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 2-gramsQuery result Actual aa{1,2,3,4}{2,3,4} ab{1,2,3,4} ba{1,2,3,4} bb{1,2,3,4}{1,2,4} aa: {2,3,4} bb: {1,2,4}

26  1. 1011110101  2. 0010100011  3. 1010101001  4. 1111001111 aa: {2,3,4} bb: {1,2,4} 26 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 3-gramsQuery result Actual aaa{2,3,4}{2} aab{2,3,4} aba{1,2,3,4}{1,2,3} abb{1,2,4} baa{2,3,4} bab{1,2,3,4}{1,2,3} bba{1,2,4}{1,4} bbb{1,2,4}{1,4}

27  1. 1011110101  2. 0010100011  3. 1010101001  4. 1111001111 aa: {2,3,4} bb: {1,2,4} aaa: {2} aba: {1,2,3} bab: {1,2,3} bba: {1,4} bbb: {1,4} 27 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 3-gramsQuery result Actual aaa{2,3,4}{2} aab{2,3,4} aba{1,2,3,4}{1,2,3} abb{1,2,4} baa{2,3,4} bab{1,2,3,4}{1,2,3} bba{1,2,4}{1,4} bbb{1,2,4}{1,4}

28  1. 1011110101  2. 0010100011  3. 1010101001  4. 1111001111 28 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 4-gramsQuery result Actual aaab{2} aaba{2,3}{2} aabb{2,4} abaa{2,3} abab{1,2,3} abbb{1,4} baaa{2} baab{2,3,4}{3,4} baba{1,2,3} babb{1,2}{1} bbaa{4} bbab{1} bbba{1,4} bbbb{1,4} aa: {2,3,4} bb: {1,2,4} aaa: {2} aba: {1,2,3} bab: {1,2,3} bba: {1,4} bbb: {1,4}

29  1. 1011110101  2. 0010100011  3. 1010101001  4. 1111001111 aa: {2,3,4} bb: {1,2,4} aaa: {2} aba: {1,2,3} bab: {1,2,3} bba: {1,4} bbb: {1,4} aaba: {2} baab: {3,4} babb: {1} 29 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 4-gramsQuery result Actual aaab{2} aaba{2,3}{2} aabb{2,4} abaa{2,3} abab{1,2,3} abbb{1,4} baaa{2} baab{2,3,4}{3,4} baba{1,2,3} babb{1,2}{1} bbaa{4} bbab{1} bbba{1,4} bbbb{1,4}

30  Review of inverted n-gram indexes  Example TinyLex index  TinyLex index construction  Results  Disadvantages  Questions 30 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

31 31 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee  Test set: 100MB TREC WSJ collection  37000 documents, English text  Same query performance with 7-17 times less terms

32 32 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee  Overall compressed index size 2-20% less  TinyLex index has more information per term

33 33 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee  Dramatic 50x improvement in worst-case query performance for long queries

34  Applications to phrase searching using variable-length word n-grams  Making the construction more efficient  Performance on genome sequences  Empirical evaluation of scaling 34 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

35  Suffix arrays (Manber and Myers 1991) ◦ Faster queries, but indexes 3-10 times larger  agrep and GLIMPSE (Wu and Manber 1994) ◦ More general queries, but relies on a word concept  n-Gram/2L (Kim et al 2005) ◦ Orthogonal; examines less document offsets  “Growing an n-gram language model” ◦ (Siivola and Pellom 2005) ◦ Similar idea applied to language modeling 35 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

36  Faster construction time ◦ Currently about 10 times slower to construct than a classical n-gram index.  Queries for nonoccurring terms are more expensive than with classical n-gram indexes (t documents must be read).  Generalize to dynamic collections 36 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

37  N-gram indexes enable practical queries for subsequences  TinyLex indexes achieve similar query performance to classical n-gram indexes with 7-17 times less terms  TinyLex yields good worst-case query performance by placing an upper bound on the number of false positives 37 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

38 38 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee


Download ppt "Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to."

Similar presentations


Ads by Google