Presentation on theme: "Probe, Count, and Classify: Categorizing Hidden Web Databases Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc."— Presentation transcript:
Probe, Count, and Classify: Categorizing Hidden Web Databases Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc.
Surface Web vs. Hidden Web Surface Web Link structure Crawlable Hidden Web No link structure Documents hidden behind search forms
Do We Need the Hidden Web? Example: PubMed/MEDLINE PubMed: (www.ncbi.nlm.nih.gov/PubMed) search: cancerwww.ncbi.nlm.nih.gov/PubMed 1,341,586 matches AltaVista: cancer site:www.ncbi.nlm.nih.gov 21,830 matches Surface WebHidden Web 2 billion pages500 billion pages (?)
Interacting With Searchable Text Databases Searching: Metasearchers Browsing: Yahoo!-like web directories: InvisibleWeb.com SearchEngineGuide.com Example from InvisibleWeb.com Health Health > Publications > PubMED Publications Created Manually!
Classifying Text Databases Automatically: Outline Definition of classification Classification through query probing Experiments
Database Classification: Two Definitions Coverage-based classification: Database contains many documents about a category Coverage: #docs about this category Specificity-based classification: Database contains mainly documents about a category Specificity: #docs/|DB|
Database Classification: An Example Category: Basketball Coverage-based classification ESPN.com, NBA.com, not KnicksTerritory.com Specificity-based classification NBA.com, KnicksTerritory.com, not ESPN.com
Database Classification: More Details Thresholds for coverage and specificity Tc: coverage threshold (e.g., 100) Ts: specificity threshold (e.g., 0.5) Tc, Ts editorial choices Root SPORTS C=800 S=0.8 BASEBALL S=0.5 BASKETBALL S=0.5 HEALTH C=200 S=0.2 Ideal(D) Ideal(D): set of classes for database D Class C is in Ideal(D) if: D has enough coverage and specificity (Tc, Ts) for C and all of Cs ancestors and D fails to have both enough coverage and specificity for each child of C
From Document to Database Classification If we know the categories of all documents inside the database, we are done! We do not have direct access to the documents. Databases do not export such data! How can we extract this information?
Our Approach: Query Probing 1. Train a rule-based document classifier. 2. Transform classifier rules into queries. 3. Adaptively send queries to databases. 4. Categorize the databases based on adjusted number of query matches.
Training a Rule-based Document Classifier Feature Selection: Zipfs law pruning, followed by information-theoretic feature selection [Koller & Sahami96] Classifier Learning: AT&Ts RIPPER [Cohen 1995] Input: A set of pre-classified, labeled documents Output: A set of classification rules IF linux THEN Computers IF jordan AND bulls THEN Sports IF lung AND cancer THEN Health
Transform each rule into a query IF lung AND cancer THEN health +lung +cancer IF linux THEN computers +linux Send the queries to the database Get number of matches for each query, NOT the documents (i.e., number of documents that match each rule) These documents would have been classified by the rule under its associated category! Constructing Query Probes
Adjusting Query Results Classifiers are not perfect! Queries do not retrieve all the documents in a category Queries for one category match documents not in this category From the classifiers training phase we know its confusion matrix
Confusion Matrix compsportshealth comp sports health DB-real Probing results = = =313 X= Classified into Correct class M. Coverage(D) ~ ECoverage(D) 10% of Sports classified as Computers 10% of the 5000 Sports docs to Computers
Confusion Matrix Adjustment: Compensating for Classifiers Errors compsportshealth comp sports health DB-real Probing results X= Coverage(D) ~ M -1. ECoverage(D) M is diagonally dominant, hence invertible Multiplication better approximates the correct result
Classifying a Database 1. Send the query probes for the top-level categories 2. Get the number of matches for each probe 3. Calculate Specificity and Coverage for each category 4. Push the database to the qualifying categories (with Specificity>Ts and Coverage>Tc) 5. Repeat for each of the qualifying categories 6. Return the classes that satisfy the coverage/specificity conditions The result is the Approximation of the Ideal classification
Real Example: ACM Digital Library (Tc=100, Ts=0.5)
Experiments: Data 72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes) 500,000 Usenet articles (April-May 2000): Newsgroups assigned by hand to hierarchy nodes RIPPER trained with 54,000 articles (1,000 articles per leaf) 27,000 articles used to construct estimations of the confusion matrices Remaining 419,000 articles used to build 500 Controlled Databases of varying category mixes, size
Comparison With Alternatives DS: Random sampling of documents via query probes Callan et al., SIGMOD99 Different task: Gather vocabulary statistics We adapted it for database classification TQ: Title-based Probing Yu et al., WISE 2000 Query probes are simply the category names
Accuracy of classification results: Expanded(N) = N and all descendants Correct = Expanded(Ideal(D)) Classified = Expanded(Approximate(D)) Precision = |Correct /\ Classified|/|Classified| Recall = |Correct /\ Classified|/|Correct| F-measure = 2.Precision.Recall/(Precision + Recall) Cost of classification: Number of queries to database Experiments: Metrics N Expanded(N)
Experimental Results: Controlled Databases Feature selection helps. Confusion-matrix adjustment helps. F-measure above 0.8 for most combinations. Results degrade gracefully with hierarchy depth. Relatively small number of probes needed for most combinations tried. Also, probes are short: 1.5 words on average; 4 words maximum. Both better performance and lower cost than DS [Callan et al. adaptation] and TQ [Yu et al.]
Web Databases 130 real databases classified from InvisibleWeb. Used InvisibleWebs categorization as correct. Simple wrappers for querying (only # of matches needed). The Ts, Tc thresholds are not known (unlike with the Controlled databases) but implicit in the IWeb categorization. We can learn/validate the thresholds (tricky but easy!). More details in the paper!
Web Databases: Learning Thresholds
Experimental Results: Web Databases 130 Real Web Databases. F-measure above 0.7 for best combination learned. 185 query probes per database on average needed for classification. Also, probes are short: 1.5 words on average; 4 words maximum.
Conclusions Accurate classification using only a small number of short queries No need for document retrieval Only need a result like: X matches found No need for any cooperation or special metadata from databases
Build wrappers automatically Extend to non-topical categories Evaluate impact of varying search interfaces (e.g., Boolean vs. ranked) Extend to other classifiers (e.g., SVMs or Bayesian models) Integrate with searching (connection with database selection?) Current and Future Work
Easy, inexpensive method for database classification Uses results from document classification Indirect classification of the documents in a database Does not inspect documents, only number of matches Adjustment of results according to classifiers performance Easy wrapper construction No need for any metadata from the database Contributions
Related Work Callan et al., SIGMOD 1999 Gauch et al., Profusion Dolin et al., Pharos Yu et al., WISE 2000 Raghavan and Garcia Molina, VLDB 2001
Controlled Databases 500 databases built using 419,000 newsgroup articles One label per document 350 databases with single (not necessarily leaf) category 150 databases with varying category mixes Database size ranges from 25 to 25,000 articles Indexed and queries using SMART
F-measure for Different Hierarchy Depths PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing Tc=8, Ts=0.3
Query Probes Per Controlled Database
Web Databases: Number of Query Probes
HealthSportsScienceComputersArts Health Sports Science Computers Arts Real Confusion Matrix for Top Node of Hierarchy