Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003

Goals and Motivation Hidden Webs are informative No current search engines can index them (even Google)

Search Interface search terms

Search Results results

Goals and Motivation Hidden Webs are informative No current search engines can index them (even Google) Next-generation search engine Automatic discovery of search interface Classification/categorization of hidden websites Generating queries to search interfaces Crawling and indexing of these web pages

Tasks Crawling Search Interface Detection Domain classification

Crawling 2.2M URLs from dmoz 1.7M eventually Crawled in November 2003 20G/4G - before/after compression Root level web pages only e.g. http://www.ucla.eduhttp://www.ucla.edu

Why root-level only? 80% of search interface contained in root-level (from UIUC) Efficient, cost effective 3B web pages compared to 8M web sites

Search Interface Classification Most search interfaces are inside tags Identify specific features( e.g. keywords, special tags, etc ) that are common in all search interfaces

Search Interface Classification Potential attributes we’ve considered

Action count

Select count

Password field

Training sets for C4.5 Initially only positive training set Several classification iterations using real web data For each iteration, add correct classifications into the positive training set and negative training sets For misclassified web pages, do the same

Training set 3 iterations seem sufficient

Results Checked via random sampling- select 100 random web pages and manually check the correctness of the classification 91.5% accuracy- correctly identifies search interfaces (precision) 87.5% accuracy- correctly identifies non-search interfaces

Results Random sampling estimation: 124311 search interfaces currently exist on our data set OCLC estimated about 8.7M unique websites in 2003 Total #of search interface on the web (upper bound)

Domain Classification Manually extract domain specific keywords Cars – odometer, mileage, airbag, acura, … Books – ISBN, author, title, publication, … 240 keywords used 4 target categories {Books, Cars, Entertainment, Travel} + “Others”

Domain Classification Navie Bayes classifier Bad result Keywords used not specific enough to distinguish between domains Websites span over different topics Probabilistic Trap of analysis based on content only

Domain Classification C4.5 classification tree “Better” result More are classified as “Others” Deterministic Improvement needed More keywords Link structure Analysis of search results

Conclusion A tool for automatic search interface detection Rough estimate of the total number of search interfaces  size of Hidden Web Domain classification Still need improvment

Some statistics Precision Books – 34% Cars – 41 % Entertainment – 48% Travel – 58% Some examples http://www.barnesandnoble.com – Books http://www.barnesandnoble.com http://www.amazon.com – Entertainment http://www.amazon.com http://www.travelocity.com – Travel http://www.travelocity.com http://www.cnn.com – Others http://www.cnn.com http://www.latimes.com – Cars http://www.latimes.com http://www.nih.gov – Travel http://www.nih.gov http://www.healthfinder.gov - Others http://www.healthfinder.gov

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Similar presentations

Presentation on theme: "Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Similar presentations

Presentation on theme: "Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003."— Presentation transcript:

Similar presentations

About project

Feedback