Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004

Agenda  What is Hidden Web?  How to crawl the Hidden Web?  Problem formalization  Searching for “best” keyword Greedy Tree searching Pruning  Experiments & results  Conclusion

What is Hidden Web?  Hidden Unreachable by following hyperlinks Dynamically generated Accessible only through a search interface  Informative  Examples http://citeseer.ist.psu.edu/ - CS research paper http://citeseer.ist.psu.edu/ http://www.pubmed.org – medical research paper http://www.pubmed.org http://catalog.loc.gov – library of congress http://catalog.loc.gov

What is Hidden Web?  Search interface  http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Se arch+Documents&cs=1

What is Hidden Web?  Result

What is Hidden Web?  Document

How to crawl the Hidden Web  http://citeseer.ist.psu.edu/cis?q=heuristic+search&submit=Se arch+Documents&cs=1 Figure out a keyword Hidden Web Query Result Our task

Problem formalization  Set-cover Vertex – documents Hyper-edges – query words

Goal  Maximize the number of unique documents retrieved with minimum number of query words

Problem formalization  P(q i ) portion of unique documents retrieved by issuing query word q i (portion of documents containing “q i ”)  P(q i v q j ) portion of unique documents retrieved by issuing query words q i and q j (portion of documents containing q i or q j )  P(q i | q j ) portion of documents containing q i in the set of documents retrieved by issuing query words q j

Problem formalization  What is the next “best” query word?  P((q 1 v … v q i-1 ) v q i ) = P(q 1 v … v q i-1 ) + P(q i ) – P((q 1 v … v q i-1 ) ^ q i ) = P(q 1 v … v q i-1 ) + P(q i ) – P(q 1 v … v q i-1 )P(q i | q 1 v … v q i-1 )  P(q 1 v … v q i-1 ) – known P(q i | q 1 v … v q i-1 ) – known P(q i ) – unknown Approximate P(q i ) using P(q i | q 1 v … v q i-1 )

Search for best query word  Greedy: choose the most frequently occurring word so far to be the query Choose q i with maximum P(q i | q 1 v … v q i-1 )  For set-cover problem, greedy is proven to obtain log-optimal solution

Search for best query word  Can we do better?  Intuition Correlation of keywords E.g. - linux - debian, redhat, suse, knoppix, fedora, etc… We might save the query word “linux” !

Search for best query word Whole document collection Already retrieved documents Documents retrieved by q i Documents retrieved by q j Documents retrieved by q k

Search for best query word linux debian redhat f(x) = Number of documents we get by issuing queries linux, debain, redhat minus the overlapping between “redhat, linux” and “debain, linux” and “redhat, debain”

Search for best query word  The search tree is huge (branching factor)  We look ahead for the 10 most frequent keywords  We only search up to depth 6  Pruning

Search for best query word  DFBnB Sub-tree where the sum of documents retrieved assuming no overlapping between keywords are less than the current best solution

Experiment  Document collection : ~100K front pages of randomly selected websites  Query interface : an inverted index (a program that returns documents containing the given query word)  Methods Greedy DFS search (look ahead for 10 words, up to depth 6) DFS search with pruning (DFBnB)

Results  Does searching helps? provide 51 work 159 privacy 144 years 172 world 344 list 205 info 1467 map 184 want 57 order 87 people 85 read 56 main 2270 high 95 designed 240 latest 36 events 132 looking 46 send 80 right 380 enter 1285 local 77 browser 1216 questions 77 real 77 provide 51 work 159 privacy 144 years 172 read 101 main 2364 designed 291 info 1455 latest 53 looking 60 send 101 right 402 local 99 world 239 list 142 map 150 want 42 order 69 people 67 high 85 events 126 questions 85 enter 1272 browser 1216 real 77

Results  Does searching helps?

Results  How much does pruning saves? With out pruning – 187300 nodes are examined 187300=(10)+(10*9)+(10*9*8)+(10*9*8*7)+(10*9*8*7* 6)+(10*9*8*7*6*5) With pruning – 5558 nodes are examined on average (when we choose the most frequent keyword to expand) DFBnB saves ~ 30 times

Conclusion  Searching helps little “in this problem”  DFBnB is “really effective” in pruning search tree

More results  Priori information helps

Results

Search & Greedy

Search with prune & Greedy

Search for best query word  base = q 1 v … v q i  P(base v q i+1 v q i+2 ) = P(base v q i+1 ) + P(q i+2 ) – P((base v q i+1 ) ^ q i+2 )  P((base v q i+1 ) ^ q i+2 ) = P((base ^ q i+2 ) v (q i +1 ^ q i+2 )) = P(base ^ q i+2 ) + P(q i+1 ^ q i+2 ) – P(base ^ q i+1 ^ q i+2 ) = P(base ^ q i+2 ) + P(q i+1 ^ q i+2 ) – P(base ^ q i+1 ^ q i+2 )

2 words overlapping

3 words overlapping

Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Similar presentations

Presentation on theme: "Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004.

Similar presentations

Presentation on theme: "Keywords Selection Problem in Hidden Web Crawling Ka Cheung Sia, Richard March 15 2004."— Presentation transcript:

Similar presentations

About project

Feedback