Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003
Goals and Motivation Hidden Webs are informative No current search engines can index them (even Google)
Search Interface search terms
Search Results results
Goals and Motivation Hidden Webs are informative No current search engines can index them (even Google) Next-generation search engine Automatic discovery of search interface Classification/categorization of hidden websites Generating queries to search interfaces Crawling and indexing of these web pages
Tasks Crawling Search Interface Detection Domain classification
Crawling 2.2M URLs from dmoz 1.7M eventually Crawled in November G/4G - before/after compression Root level web pages only e.g.
Why root-level only? 80% of search interface contained in root-level (from UIUC) Efficient, cost effective 3B web pages compared to 8M web sites
Search Interface Classification Most search interfaces are inside tags Identify specific features( e.g. keywords, special tags, etc ) that are common in all search interfaces
Search Interface Classification Potential attributes we’ve considered
Action count
Select count
Password field
Training sets for C4.5 Initially only positive training set Several classification iterations using real web data For each iteration, add correct classifications into the positive training set and negative training sets For misclassified web pages, do the same
Training set 3 iterations seem sufficient
Results Checked via random sampling- select 100 random web pages and manually check the correctness of the classification 91.5% accuracy- correctly identifies search interfaces (precision) 87.5% accuracy- correctly identifies non-search interfaces
Results Random sampling estimation: search interfaces currently exist on our data set OCLC estimated about 8.7M unique websites in 2003 Total #of search interface on the web (upper bound)
Domain Classification Manually extract domain specific keywords Cars – odometer, mileage, airbag, acura, … Books – ISBN, author, title, publication, … 240 keywords used 4 target categories {Books, Cars, Entertainment, Travel} + “Others”
Domain Classification Navie Bayes classifier Bad result Keywords used not specific enough to distinguish between domains Websites span over different topics Probabilistic Trap of analysis based on content only
Domain Classification C4.5 classification tree “Better” result More are classified as “Others” Deterministic Improvement needed More keywords Link structure Analysis of search results
Conclusion A tool for automatic search interface detection Rough estimate of the total number of search interfaces size of Hidden Web Domain classification Still need improvment
Some statistics Precision Books – 34% Cars – 41 % Entertainment – 48% Travel – 58% Some examples – Books – Entertainment – Travel – Others – Cars – Travel Others