Download presentation
Presentation is loading. Please wait.
Published byPeyton Loar Modified over 9 years ago
1
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine
2
Overview Web Search Engine Basics Algorithms for set operations Theoretical Analysis Experimental Analysis Engineering an Improved Algorithm Conclusions
3
Web Search Engine Basics Crawl: sequential gathering process Document ID (DocID) for each web page Cool sites: SIGIR SIGACT SIGCOMM SIGIR SIGCOMM SIGACT http://acm.org/home.html 1 2 3 4
4
Indexing: List of entries of type E.g. SIGCOMM 1 3 4 2 Cool sites: SIGIR SIGACT SIGCOMM SIGIRSIGACT
5
Postings set: Set of docID’s containing a word or pattern. SIGACT {1,3} SIGCOMM {1,4} SIGCOMM 1 3 4 2 Cool sites: SIGIR SIGACT SIGCOMM SIGIRSIGACT
6
Search Engine Basics (cont.) Postings set stored implicitly/explicitly in a string matching data structure PAT tree/array Inverted word index Suffix trees KMP (grep)...
7
String Matching Problem Different performance characteristics for each solution Time/Space tradeoff (empirical) Linear time/linear space lower bound [Demaine/L-O, SODA 2001]
8
Search Engine Basics (cont.) A user query is of the form: keyword 1 keyword 2 … keyword n where is one of { and,or } E.g. computer and science or internet
9
Evaluating a Boolean Query The interpretation of a boolean query is the mapping: keyword postings set and (set intersection) or (set union) E.g. {computer} {science} {internet }
10
Set Operations for Web Search Engines Average postings set size > 10 million Postings set are sorted
11
Intersection Time Complexity Worst case linear on size of postings sets: Θ(n) {1,3,5,7} {1,3,5,7} On size of output? {1,3,5,7} {2,4,6,8}
12
Adaptive Algorithms Assume the intersection is empty. What is the min number of comparisons needed to ascertain this fact? Examples {1,2,3,4} {5,6,7,8}
13
Much ado About Nothing A sequence of comparisons is a proof of non-intersection if every possible instance of sets satisfying said sequence has empty intersection. E.g. A={1,3,5,7} B={2,4,6,8} a 1 < b 1 < a 2 < b 2 < a 3 < b 3 < a 4 < b 4
14
Adaptive Algorithms In [SODA 2000] we proposed an adaptive algorithm that intersects k sets in: k · | shortest proof of non-intersection | steps. Ideal for crawled, “bursty” data sets
15
How does it work? 1,_,3,... i n DocID universe set
16
Measuring Performance 100MB Web Crawl 5000 queries from Google
17
Baseline Standard Algorithm Sort sets by size Candidate answer set is smallest set For each set S in increasing order by size –For each element e in candidate set Binary search for e in S If e is not found remove from candidate set R emove elements before e in S
18
Upper Bound: Adaptive/Traditional Two-Smallest Algorithm
19
Lower Bound: Adaptive/Shortest Proof
20
Middle Bound: Adaptive/ Encoding of Shortest Proof
21
Side by Side Lower Bound Middle Bound
22
Possible Improvements Adaptive performs best in two-three sets Traditional algorithm often terminates after first pair of sets Galloping seems better than binary search Adaptive keeps a dynamic definition of “smallest set” Candidate elements aggressively tested
23
Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9}
24
Experimental Results Test orthogonally each possible improvement Cyclic or Two Smallest Symmetric Update Smallest Advance on Common Element Gallop Factor/Binary Search
25
Binary Search vs. Gallop
26
Advance on Common Element
27
Small Adaptive Combines best of Adaptive and Two-Smallest Two-smallest Symmetric Advance on common element Update on smallest Gallop with factor 2
28
Small Adaptive
29
Small Adaptive is faster than Two-Smallest Aggregate speed-up 2.9 x comparisons Faster than Adaptive
30
Conclusions Faster intersection algorithm for Web Search Engines Adaptive measure for set operations Information theoretic “middle bound” Standard speed-up techniques for other settings THE END
31
Total # of elements in a query Number of queries for each total size Query Log
32
Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9, 12}
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.