Download presentation

Presentation is loading. Please wait.

Published byPeyton Loar Modified over 2 years ago

1
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine

2
Overview Web Search Engine Basics Algorithms for set operations Theoretical Analysis Experimental Analysis Engineering an Improved Algorithm Conclusions

3
Web Search Engine Basics Crawl: sequential gathering process Document ID (DocID) for each web page Cool sites: SIGIR SIGACT SIGCOMM SIGIR SIGCOMM SIGACT http://acm.org/home.html 1 2 3 4

4
Indexing: List of entries of type E.g. SIGCOMM 1 3 4 2 Cool sites: SIGIR SIGACT SIGCOMM SIGIRSIGACT

5
Postings set: Set of docID’s containing a word or pattern. SIGACT {1,3} SIGCOMM {1,4} SIGCOMM 1 3 4 2 Cool sites: SIGIR SIGACT SIGCOMM SIGIRSIGACT

6
Search Engine Basics (cont.) Postings set stored implicitly/explicitly in a string matching data structure PAT tree/array Inverted word index Suffix trees KMP (grep)...

7
String Matching Problem Different performance characteristics for each solution Time/Space tradeoff (empirical) Linear time/linear space lower bound [Demaine/L-O, SODA 2001]

8
Search Engine Basics (cont.) A user query is of the form: keyword 1 keyword 2 … keyword n where is one of { and,or } E.g. computer and science or internet

9
Evaluating a Boolean Query The interpretation of a boolean query is the mapping: keyword postings set and (set intersection) or (set union) E.g. {computer} {science} {internet }

10
Set Operations for Web Search Engines Average postings set size > 10 million Postings set are sorted

11
Intersection Time Complexity Worst case linear on size of postings sets: Θ(n) {1,3,5,7} {1,3,5,7} On size of output? {1,3,5,7} {2,4,6,8}

12
Adaptive Algorithms Assume the intersection is empty. What is the min number of comparisons needed to ascertain this fact? Examples {1,2,3,4} {5,6,7,8}

13
Much ado About Nothing A sequence of comparisons is a proof of non-intersection if every possible instance of sets satisfying said sequence has empty intersection. E.g. A={1,3,5,7} B={2,4,6,8} a 1 < b 1 < a 2 < b 2 < a 3 < b 3 < a 4 < b 4

14
Adaptive Algorithms In [SODA 2000] we proposed an adaptive algorithm that intersects k sets in: k · | shortest proof of non-intersection | steps. Ideal for crawled, “bursty” data sets

15
How does it work? 1,_,3,... i n DocID universe set

16
Measuring Performance 100MB Web Crawl 5000 queries from Google

17
Baseline Standard Algorithm Sort sets by size Candidate answer set is smallest set For each set S in increasing order by size –For each element e in candidate set Binary search for e in S If e is not found remove from candidate set R emove elements before e in S

18
Upper Bound: Adaptive/Traditional Two-Smallest Algorithm

19
Lower Bound: Adaptive/Shortest Proof

20
Middle Bound: Adaptive/ Encoding of Shortest Proof

21
Side by Side Lower Bound Middle Bound

22
Possible Improvements Adaptive performs best in two-three sets Traditional algorithm often terminates after first pair of sets Galloping seems better than binary search Adaptive keeps a dynamic definition of “smallest set” Candidate elements aggressively tested

23
Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9}

24
Experimental Results Test orthogonally each possible improvement Cyclic or Two Smallest Symmetric Update Smallest Advance on Common Element Gallop Factor/Binary Search

25
Binary Search vs. Gallop

26
Advance on Common Element

27
Small Adaptive Combines best of Adaptive and Two-Smallest Two-smallest Symmetric Advance on common element Update on smallest Gallop with factor 2

28
Small Adaptive

29
Small Adaptive is faster than Two-Smallest Aggregate speed-up 2.9 x comparisons Faster than Adaptive

30
Conclusions Faster intersection algorithm for Web Search Engines Adaptive measure for set operations Information theoretic “middle bound” Standard speed-up techniques for other settings THE END

31
Total # of elements in a query Number of queries for each total size Query Log

32
Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9, 12}

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google