Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.

Similar presentations

Presentation on theme: "Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine."— Presentation transcript:

1 Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine

2 Overview Web Search Engine Basics Algorithms for set operations Theoretical Analysis Experimental Analysis Engineering an Improved Algorithm Conclusions

3 Web Search Engine Basics Crawl: sequential gathering process Document ID (DocID) for each web page Cool sites: SIGIR SIGACT SIGCOMM SIGIR SIGCOMM SIGACT 1 2 3 4

4 Indexing: List of entries of type E.g. SIGCOMM 1 3 4 2 Cool sites: SIGIR SIGACT SIGCOMM SIGIRSIGACT

5 Postings set: Set of docID’s containing a word or pattern. SIGACT {1,3} SIGCOMM {1,4} SIGCOMM 1 3 4 2 Cool sites: SIGIR SIGACT SIGCOMM SIGIRSIGACT

6 Search Engine Basics (cont.) Postings set stored implicitly/explicitly in a string matching data structure PAT tree/array Inverted word index Suffix trees KMP (grep)...

7 String Matching Problem Different performance characteristics for each solution Time/Space tradeoff (empirical) Linear time/linear space lower bound [Demaine/L-O, SODA 2001]

8 Search Engine Basics (cont.) A user query is of the form: keyword 1  keyword 2  …  keyword n where  is one of { and,or } E.g. computer and science or internet

9 Evaluating a Boolean Query The interpretation of a boolean query is the mapping: keyword postings set and  (set intersection) or  (set union) E.g. {computer}  {science}  {internet }

10 Set Operations for Web Search Engines Average postings set size > 10 million Postings set are sorted

11 Intersection Time Complexity Worst case linear on size of postings sets: Θ(n) {1,3,5,7}  {1,3,5,7} On size of output? {1,3,5,7}  {2,4,6,8}

12 Adaptive Algorithms Assume the intersection is empty. What is the min number of comparisons needed to ascertain this fact? Examples {1,2,3,4}  {5,6,7,8}

13 Much ado About Nothing A sequence of comparisons is a proof of non-intersection if every possible instance of sets satisfying said sequence has empty intersection. E.g. A={1,3,5,7} B={2,4,6,8} a 1 < b 1 < a 2 < b 2 < a 3 < b 3 < a 4 < b 4

14 Adaptive Algorithms In [SODA 2000] we proposed an adaptive algorithm that intersects k sets in: k · | shortest proof of non-intersection | steps. Ideal for crawled, “bursty” data sets

15 How does it work? 1,_,3,... i n DocID universe set

16 Measuring Performance 100MB Web Crawl 5000 queries from Google

17 Baseline Standard Algorithm Sort sets by size Candidate answer set is smallest set For each set S in increasing order by size –For each element e in candidate set Binary search for e in S If e is not found remove from candidate set R emove elements before e in S

18 Upper Bound: Adaptive/Traditional Two-Smallest Algorithm

19 Lower Bound: Adaptive/Shortest Proof

20 Middle Bound: Adaptive/ Encoding of Shortest Proof

21 Side by Side Lower Bound Middle Bound

22 Possible Improvements Adaptive performs best in two-three sets Traditional algorithm often terminates after first pair of sets Galloping seems better than binary search Adaptive keeps a dynamic definition of “smallest set” Candidate elements aggressively tested

23 Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9}

24 Experimental Results Test orthogonally each possible improvement Cyclic or Two Smallest Symmetric Update Smallest Advance on Common Element Gallop Factor/Binary Search

25 Binary Search vs. Gallop

26 Advance on Common Element

27 Small Adaptive Combines best of Adaptive and Two-Smallest Two-smallest Symmetric Advance on common element Update on smallest Gallop with factor 2

28 Small Adaptive

29 Small Adaptive is faster than Two-Smallest Aggregate speed-up 2.9 x comparisons Faster than Adaptive

30 Conclusions Faster intersection algorithm for Web Search Engines Adaptive measure for set operations Information theoretic “middle bound” Standard speed-up techniques for other settings THE END

31 Total # of elements in a query Number of queries for each total size Query Log

32 Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9, 12}

Download ppt "Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine."

Similar presentations

Ads by Google