Download presentation

Presentation is loading. Please wait.

Published byPeyton Loar Modified over 2 years ago

1
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine

2
Overview Web Search Engine Basics Algorithms for set operations Theoretical Analysis Experimental Analysis Engineering an Improved Algorithm Conclusions

3
Web Search Engine Basics Crawl: sequential gathering process Document ID (DocID) for each web page Cool sites: SIGIR SIGACT SIGCOMM SIGIR SIGCOMM SIGACT http://acm.org/home.html 1 2 3 4

4
Indexing: List of entries of type E.g. SIGCOMM 1 3 4 2 Cool sites: SIGIR SIGACT SIGCOMM SIGIRSIGACT

5
Postings set: Set of docID’s containing a word or pattern. SIGACT {1,3} SIGCOMM {1,4} SIGCOMM 1 3 4 2 Cool sites: SIGIR SIGACT SIGCOMM SIGIRSIGACT

6
Search Engine Basics (cont.) Postings set stored implicitly/explicitly in a string matching data structure PAT tree/array Inverted word index Suffix trees KMP (grep)...

7
String Matching Problem Different performance characteristics for each solution Time/Space tradeoff (empirical) Linear time/linear space lower bound [Demaine/L-O, SODA 2001]

8
Search Engine Basics (cont.) A user query is of the form: keyword 1 keyword 2 … keyword n where is one of { and,or } E.g. computer and science or internet

9
Evaluating a Boolean Query The interpretation of a boolean query is the mapping: keyword postings set and (set intersection) or (set union) E.g. {computer} {science} {internet }

10
Set Operations for Web Search Engines Average postings set size > 10 million Postings set are sorted

11
Intersection Time Complexity Worst case linear on size of postings sets: Θ(n) {1,3,5,7} {1,3,5,7} On size of output? {1,3,5,7} {2,4,6,8}

12
Adaptive Algorithms Assume the intersection is empty. What is the min number of comparisons needed to ascertain this fact? Examples {1,2,3,4} {5,6,7,8}

13
Much ado About Nothing A sequence of comparisons is a proof of non-intersection if every possible instance of sets satisfying said sequence has empty intersection. E.g. A={1,3,5,7} B={2,4,6,8} a 1 < b 1 < a 2 < b 2 < a 3 < b 3 < a 4 < b 4

14
Adaptive Algorithms In [SODA 2000] we proposed an adaptive algorithm that intersects k sets in: k · | shortest proof of non-intersection | steps. Ideal for crawled, “bursty” data sets

15
How does it work? 1,_,3,... i n DocID universe set

16
Measuring Performance 100MB Web Crawl 5000 queries from Google

17
Baseline Standard Algorithm Sort sets by size Candidate answer set is smallest set For each set S in increasing order by size –For each element e in candidate set Binary search for e in S If e is not found remove from candidate set R emove elements before e in S

18
Upper Bound: Adaptive/Traditional Two-Smallest Algorithm

19
Lower Bound: Adaptive/Shortest Proof

20
Middle Bound: Adaptive/ Encoding of Shortest Proof

21
Side by Side Lower Bound Middle Bound

22
Possible Improvements Adaptive performs best in two-three sets Traditional algorithm often terminates after first pair of sets Galloping seems better than binary search Adaptive keeps a dynamic definition of “smallest set” Candidate elements aggressively tested

23
Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9}

24
Experimental Results Test orthogonally each possible improvement Cyclic or Two Smallest Symmetric Update Smallest Advance on Common Element Gallop Factor/Binary Search

25
Binary Search vs. Gallop

26
Advance on Common Element

27
Small Adaptive Combines best of Adaptive and Two-Smallest Two-smallest Symmetric Advance on common element Update on smallest Gallop with factor 2

28
Small Adaptive

29
Small Adaptive is faster than Two-Smallest Aggregate speed-up 2.9 x comparisons Faster than Adaptive

30
Conclusions Faster intersection algorithm for Web Search Engines Adaptive measure for set operations Information theoretic “middle bound” Standard speed-up techniques for other settings THE END

31
Total # of elements in a query Number of queries for each total size Query Log

32
Example {6, 7,10,11,14} {4, 8,10,11,15} {1, 2, 4, 5, 7, 8, 9, 12}

Similar presentations

OK

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on ideal gas law practice Ppt on isobars and isotopes of uranium Ppt on obesity prevention activities Ppt on regional trade agreements Ppt on power line communications Ppt on sports day poster Ppt on index numbers excel Ppt on producers consumers and decomposers 3rd Ppt on principles of peace building activities Ppt on structure of chromosomes worksheet