Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Similar presentations


Presentation on theme: "Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003."— Presentation transcript:

1 Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003

2 Goals and Motivation Hidden Webs are informative No current search engines can index them (even Google)

3 Search Interface search terms

4 Search Results results

5 Goals and Motivation Hidden Webs are informative No current search engines can index them (even Google) Next-generation search engine Automatic discovery of search interface Classification/categorization of hidden websites Generating queries to search interfaces Crawling and indexing of these web pages

6 Tasks Crawling Search Interface Detection Domain classification

7 Crawling 2.2M URLs from dmoz 1.7M eventually Crawled in November 2003 20G/4G - before/after compression Root level web pages only e.g. http://www.ucla.eduhttp://www.ucla.edu

8 Why root-level only? 80% of search interface contained in root-level (from UIUC) Efficient, cost effective 3B web pages compared to 8M web sites

9 Search Interface Classification Most search interfaces are inside tags Identify specific features( e.g. keywords, special tags, etc ) that are common in all search interfaces

10 Search Interface Classification Potential attributes we’ve considered

11 Action count

12 Select count

13 Password field

14 Training sets for C4.5 Initially only positive training set Several classification iterations using real web data For each iteration, add correct classifications into the positive training set and negative training sets For misclassified web pages, do the same

15 Training set 3 iterations seem sufficient

16 Results Checked via random sampling- select 100 random web pages and manually check the correctness of the classification 91.5% accuracy- correctly identifies search interfaces (precision) 87.5% accuracy- correctly identifies non-search interfaces

17 Results Random sampling estimation: 124311 search interfaces currently exist on our data set OCLC estimated about 8.7M unique websites in 2003 Total #of search interface on the web (upper bound)

18

19 Domain Classification Manually extract domain specific keywords Cars – odometer, mileage, airbag, acura, … Books – ISBN, author, title, publication, … 240 keywords used 4 target categories {Books, Cars, Entertainment, Travel} + “Others”

20 Domain Classification Navie Bayes classifier Bad result Keywords used not specific enough to distinguish between domains Websites span over different topics Probabilistic Trap of analysis based on content only

21 Domain Classification C4.5 classification tree “Better” result More are classified as “Others” Deterministic Improvement needed More keywords Link structure Analysis of search results

22 Conclusion A tool for automatic search interface detection Rough estimate of the total number of search interfaces  size of Hidden Web Domain classification Still need improvment

23 Some statistics Precision Books – 34% Cars – 41 % Entertainment – 48% Travel – 58% Some examples http://www.barnesandnoble.com – Books http://www.barnesandnoble.com http://www.amazon.com – Entertainment http://www.amazon.com http://www.travelocity.com – Travel http://www.travelocity.com http://www.cnn.com – Others http://www.cnn.com http://www.latimes.com – Cars http://www.latimes.com http://www.nih.gov – Travel http://www.nih.gov http://www.healthfinder.gov - Others http://www.healthfinder.gov


Download ppt "Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003."

Similar presentations


Ads by Google