Download presentation
Presentation is loading. Please wait.
1
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003
2
Goals and Motivation Hidden Webs are informative No current search engines can index them (even Google)
3
Search Interface search terms
4
Search Results results
5
Goals and Motivation Hidden Webs are informative No current search engines can index them (even Google) Next-generation search engine Automatic discovery of search interface Classification/categorization of hidden websites Generating queries to search interfaces Crawling and indexing of these web pages
6
Tasks Crawling Search Interface Detection Domain classification
7
Crawling 2.2M URLs from dmoz 1.7M eventually Crawled in November 2003 20G/4G - before/after compression Root level web pages only e.g. http://www.ucla.eduhttp://www.ucla.edu
8
Why root-level only? 80% of search interface contained in root-level (from UIUC) Efficient, cost effective 3B web pages compared to 8M web sites
9
Search Interface Classification Most search interfaces are inside tags Identify specific features( e.g. keywords, special tags, etc ) that are common in all search interfaces
10
Search Interface Classification Potential attributes we’ve considered
11
Action count
12
Select count
13
Password field
14
Training sets for C4.5 Initially only positive training set Several classification iterations using real web data For each iteration, add correct classifications into the positive training set and negative training sets For misclassified web pages, do the same
15
Training set 3 iterations seem sufficient
16
Results Checked via random sampling- select 100 random web pages and manually check the correctness of the classification 91.5% accuracy- correctly identifies search interfaces (precision) 87.5% accuracy- correctly identifies non-search interfaces
17
Results Random sampling estimation: 124311 search interfaces currently exist on our data set OCLC estimated about 8.7M unique websites in 2003 Total #of search interface on the web (upper bound)
19
Domain Classification Manually extract domain specific keywords Cars – odometer, mileage, airbag, acura, … Books – ISBN, author, title, publication, … 240 keywords used 4 target categories {Books, Cars, Entertainment, Travel} + “Others”
20
Domain Classification Navie Bayes classifier Bad result Keywords used not specific enough to distinguish between domains Websites span over different topics Probabilistic Trap of analysis based on content only
21
Domain Classification C4.5 classification tree “Better” result More are classified as “Others” Deterministic Improvement needed More keywords Link structure Analysis of search results
22
Conclusion A tool for automatic search interface detection Rough estimate of the total number of search interfaces size of Hidden Web Domain classification Still need improvment
23
Some statistics Precision Books – 34% Cars – 41 % Entertainment – 48% Travel – 58% Some examples http://www.barnesandnoble.com – Books http://www.barnesandnoble.com http://www.amazon.com – Entertainment http://www.amazon.com http://www.travelocity.com – Travel http://www.travelocity.com http://www.cnn.com – Others http://www.cnn.com http://www.latimes.com – Cars http://www.latimes.com http://www.nih.gov – Travel http://www.nih.gov http://www.healthfinder.gov - Others http://www.healthfinder.gov
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.