Download presentation
Presentation is loading. Please wait.
1
Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004
2
Agenda What is Hidden Web? How to crawl the Hidden Web? Problem formalization Searching for “best” keyword Greedy Tree searching Pruning Experiments & results Conclusion
3
What is Hidden Web? Hidden Unreachable by following hyperlinks Dynamically generated Accessible only through a search interface Informative Examples http://citeseer.ist.psu.edu/ - CS research paper http://citeseer.ist.psu.edu/ http://www.pubmed.org – medical research paper http://www.pubmed.org http://catalog.loc.gov – library of congress http://catalog.loc.gov
4
What is Hidden Web? Search interface http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Se arch+Documents&cs=1
5
What is Hidden Web? Result
6
What is Hidden Web? Document
7
How to crawl the Hidden Web http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Se arch+Documents&cs=1 Figure out a keyword Hidden Web Query Result Our task
8
Problem formalization Set-cover Vertex – documents Hyper-edges – query words
9
Goal Maximize the number of unique documents retrieved with minimum number of query words
10
Problem formalization P(q i ) portion of unique documents retrieved by issuing query word q i (portion of documents containing “q i ”) P(q i v q j ) portion of unique documents retrieved by issuing query words q i and q j (portion of documents containing q i or q j ) P(q i | q j ) portion of documents containing q i in the set of documents retrieved by issuing query words q j
11
Problem formalization What is the next “best” query word? P((q 1 v … v q i-1 ) v q i ) = P(q 1 v … v q i-1 ) + P(q i ) – P((q 1 v … v q i-1 ) ^ q i ) = P(q 1 v … v q i-1 ) + P(q i ) – P(q 1 v … v q i-1 )P(q i | q 1 v … v q i-1 ) P(q 1 v … v q i-1 ) – known P(q i | q 1 v … v q i-1 ) – known P(q i ) – unknown Approximate P(q i ) using P(q i | q 1 v … v q i-1 )
12
Search for best query word Greedy: choose the most frequently occurring word so far to be the query Choose q i with maximum P(q i | q 1 v … v q i-1 ) For set-cover problem, greedy is proven to obtain log-optimal solution
13
Search for best query word Can we do better? Intuition Correlation of keywords E.g. - linux - debian, redhat, suse, knoppix, fedora, etc… We might save the query word “linux” !
14
Search for best query word Whole document collection Already retrieved documents Documents retrieved by q i Documents retrieved by q j Documents retrieved by q k
15
Search for best query word linux debian redhat f(x) = Number of documents we get by issuing queries linux, debain, redhat minus the overlapping between “redhat, linux” and “debain, linux” and “redhat, debain”
16
Search for best query word The search tree is huge (branching factor) We look ahead for the 10 most frequent keywords We only search up to depth 6 Pruning
17
Search for best query word DFBnB Sub-tree where the sum of documents retrieved assuming no overlapping between keywords are less than the current best solution
18
Experiment Document collection : ~100K front pages of randomly selected websites Query interface : an inverted index (a program that returns documents containing the given query word) Methods Greedy DFS search (look ahead for 10 words, up to depth 6) DFS search with pruning (DFBnB)
19
Results Does searching helps? provide 51 work 159 privacy 144 years 172 world 344 list 205 info 1467 map 184 want 57 order 87 people 85 read 56 main 2270 high 95 designed 240 latest 36 events 132 looking 46 send 80 right 380 enter 1285 local 77 browser 1216 questions 77 real 77 provide 51 work 159 privacy 144 years 172 read 101 main 2364 designed 291 info 1455 latest 53 looking 60 send 101 right 402 local 99 world 239 list 142 map 150 want 42 order 69 people 67 high 85 events 126 questions 85 enter 1272 browser 1216 real 77
20
Results Does searching helps?
21
Results How much does pruning saves? With out pruning – 187300 nodes are examined 187300=(10)+(10*9)+(10*9*8)+(10*9*8*7)+(10*9*8*7* 6)+(10*9*8*7*6*5) With pruning – 5558 nodes are examined on average (when we choose the most frequent keyword to expand) DFBnB saves ~ 30 times
22
Conclusion Searching helps little “in this problem” DFBnB is “really effective” in pruning search tree
23
End
24
More results Priori information helps
25
Results
27
Search & Greedy
28
Search with prune & Greedy
29
Search for best query word base = q 1 v … v q i P(base v q i+1 v q i+2 ) = P(base v q i+1 ) + P(q i+2 ) – P((base v q i+1 ) ^ q i+2 ) P((base v q i+1 ) ^ q i+2 ) = P((base ^ q i+2 ) v (q i +1 ^ q i+2 )) = P(base ^ q i+2 ) + P(q i+1 ^ q i+2 ) – P(base ^ q i+1 ^ q i+2 ) = P(base ^ q i+2 ) + P(q i+1 ^ q i+2 ) – P(base ^ q i+1 ^ q i+2 )
30
2 words overlapping
31
3 words overlapping
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.