Probe, Count, and Classify: Categorizing Hidden Web Databases

Probe, Count, and Classify: Categorizing Hidden Web Databases
Panagiotis G. Ipeirotis Luis Gravano Columbia University Mehran Sahami E.piphany Inc. Frequently, presenters must deliver material of a technical nature to an audience unfamiliar with the topic or vocabulary. The material may be complex or heavy with detail. To present technical material effectively, use the following guidelines from Dale Carnegie Training®. Consider the amount of time available and prepare to organize your material. Narrow your topic. Divide your presentation into clear segments. Follow a logical progression. Maintain your focus throughout. Close the presentation with a summary, repetition of the key steps, or a logical conclusion. Keep your audience in mind at all times. For example, be sure data is clear and information is relevant. Keep the level of detail and vocabulary appropriate for the audience. Use visuals to support key points or steps. Keep alert to the needs of your listeners, and you will have a more receptive audience.

Surface Web vs. Hidden Web
In your opening, establish the relevancy of the topic to the audience. Give a brief preview of the presentation and establish value for the listeners. Take into account your audience’s interest and expertise in the topic when choosing your vocabulary, examples, and illustrations. Focus on the importance of the topic to your audience, and you will have more attentive listeners. Surface Web Link structure Crawlable Hidden Web No link structure Documents “hidden” behind search forms

Do We Need the Hidden Web?
Surface Web Hidden Web 2 billion pages 500 billion pages (?) Example: PubMed/MEDLINE PubMed: ( search: “cancer”  1,341,586 matches AltaVista: “cancer site:  21,830 matches In your opening, establish the relevancy of the topic to the audience. Give a brief preview of the presentation and establish value for the listeners. Take into account your audience’s interest and expertise in the topic when choosing your vocabulary, examples, and illustrations. Focus on the importance of the topic to your audience, and you will have more attentive listeners.

Interacting With Searchable Text Databases
Searching: Metasearchers Browsing: Yahoo!-like web directories: InvisibleWeb.com SearchEngineGuide.com Example from InvisibleWeb.com Health > Publications > PubMED Created Manually! In your opening, establish the relevancy of the topic to the audience. Give a brief preview of the presentation and establish value for the listeners. Take into account your audience’s interest and expertise in the topic when choosing your vocabulary, examples, and illustrations. Focus on the importance of the topic to your audience, and you will have more attentive listeners.

Classifying Text Databases Automatically: Outline
Definition of classification Classification through query probing Experiments If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience.

Database Classification: Two Definitions
Coverage-based classification: Database contains many documents about a category Coverage: #docs about this category Specificity-based classification: Database contains mainly documents about a category Specificity: #docs/|DB| If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience.

Database Classification: An Example
Category: Basketball Coverage-based classification ESPN.com, NBA.com, not KnicksTerritory.com Specificity-based classification NBA.com, KnicksTerritory.com, not ESPN.com If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience.

Database Classification: More Details
Thresholds for coverage and specificity Tc: coverage threshold (e.g., 100) Ts: specificity threshold (e.g., 0.5) Tc, Ts “editorial” choices Ideal(D) Root Ideal(D): set of classes for database D Class C is in Ideal(D) if: D has “enough” coverage and specificity (Tc, Ts) for C and all of C’s ancestors and D fails to have both “enough” coverage and specificity for each child of C SPORTS C=800 S=0.8 HEALTH C=200 S=0.2 BASKETBALL S=0.5 BASEBALL S=0.5

From Document to Database Classification
If we know the categories of all documents inside the database, we are done! We do not have direct access to the documents. Databases do not export such data! How can we extract this information? If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience.

Our Approach: Query Probing
Train a rule-based document classifier. Transform classifier rules into queries. Adaptively send queries to databases. Categorize the databases based on adjusted number of query matches. If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience.

Training a Rule-based Document Classifier
Feature Selection: Zipf’s law pruning, followed by information-theoretic feature selection [Koller & Sahami’96] Classifier Learning: AT&T’s RIPPER [Cohen 1995] Input: A set of pre-classified, labeled documents Output: A set of classification rules IF linux THEN Computers IF jordan AND bulls THEN Sports IF lung AND cancer THEN Health If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience.

Constructing Query Probes
Transform each rule into a query IF lung AND cancer THEN health  +lung +cancer IF linux THEN computers  +linux Send the queries to the database Get number of matches for each query, NOT the documents (i.e., number of documents that match each rule) These documents would have been classified by the rule under its associated category! If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience.

Adjusting Query Results
Classifiers are not perfect! Queries do not “retrieve” all the documents in a category Queries for one category “match” documents not in this category From the classifier’s training phase we know its “confusion matrix” If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience.

X = Confusion Matrix M . Coverage(D) ~ ECoverage(D) Correct class
10% of “Sports” classified as “Computers” comp sports health 0.70 0.10 0.00 0.18 0.65 0.04 0.02 0.05 0.86 DB-real 1000 5000 50 Probing results = 1200 3432 313 X = If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. 10% of the 5000 “Sports” docs to “Computers” Classified into M . Coverage(D) ~ ECoverage(D)

Confusion Matrix Adjustment: Compensating for Classifier’s Errors
-1 DB-real 1000 5000 50 comp sports health 0.70 0.10 0.00 0.18 0.65 0.04 0.02 0.05 0.86 Probing results 1200 3432 313 = X M is diagonally dominant, hence invertible If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience. Coverage(D) ~ M-1 . ECoverage(D) Multiplication better approximates the correct result

Classifying a Database
Send the query probes for the top-level categories Get the number of matches for each probe Calculate Specificity and Coverage for each category “Push” the database to the qualifying categories (with Specificity>Ts and Coverage>Tc) Repeat for each of the qualifying categories Return the classes that satisfy the coverage/specificity conditions The result is the Approximation of the Ideal classification If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience.

Real Example: ACM Digital Library (Tc=100, Ts=0.5)

Experiments: Data 72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes) 500,000 Usenet articles (April-May 2000): Newsgroups assigned by hand to hierarchy nodes RIPPER trained with 54,000 articles (1,000 articles per leaf) 27,000 articles used to construct estimations of the confusion matrices Remaining 419,000 articles used to build 500 Controlled Databases of varying category mixes, size If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience.

Comparison With Alternatives
DS: Random sampling of documents via query probes Callan et al., SIGMOD’99 Different task: Gather vocabulary statistics We adapted it for database classification TQ: Title-based Probing Yu et al., WISE 2000 Query probes are simply the category names

Average F-measure, Controlled Databases
PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing

Experimental Results: Controlled Databases
Feature selection helps. Confusion-matrix adjustment helps. F-measure above 0.8 for most <Tc, Ts> combinations. Results degrade gracefully with hierarchy depth. Relatively small number of probes needed for most <Tc, Ts> combinations tried. Also, probes are short: 1.5 words on average; 4 words maximum. Both better performance and lower cost than DS [Callan et al. adaptation] and TQ [Yu et al.]

Web Databases 130 real databases classified from InvisibleWeb™.
Used InvisibleWeb’s categorization as correct. Simple “wrappers” for querying (only # of matches needed). The Ts, Tc thresholds are not known (unlike with the Controlled databases) but implicit in the IWeb categorization. We can learn/validate the thresholds (tricky but easy!). More details in the paper!

Web Databases: Learning Thresholds

Experimental Results: Web Databases
130 Real Web Databases. F-measure above 0.7 for best <Tc, Ts> combination learned. 185 query probes per database on average needed for classification. Also, probes are short: 1.5 words on average; 4 words maximum.

Conclusions Accurate classification using only a small number of short queries No need for document retrieval Only need a result like: “X matches found” No need for any cooperation or special metadata from databases If you have several points, steps, or key ideas use multiple slides. Determine if your audience is to understand a new idea, learn a process, or receive greater depth to a familiar concept. Back up each point with adequate explanation. As appropriate, supplement your presentation with technical support data in hard copy or on disc, , or the Internet. Develop each point adequately to communicate with your audience.

Current and Future Work
Build “wrappers” automatically Extend to non-topical categories Evaluate impact of varying search interfaces (e.g., Boolean vs. ranked) Extend to other classifiers (e.g., SVMs or Bayesian models) Integrate with searching (connection with database selection?)

Questions?

Contributions Easy, inexpensive method for database classification
Uses results from document classification “Indirect” classification of the documents in a database Does not inspect documents, only number of matches Adjustment of results according to classifier’s performance Easy wrapper construction No need for any metadata from the database

Related Work Callan et al., SIGMOD 1999 Gauch et al., Profusion
Dolin et al., Pharos Yu et al., WISE 2000 Raghavan and Garcia Molina, VLDB 2001

Controlled Databases 500 databases built using 419,000 newsgroup articles One label per document 350 databases with single (not necessarily leaf) category 150 databases with varying category mixes Database size ranges from 25 to 25,000 articles Indexed and queries using SMART

F-measure for Different Hierarchy Depths
PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing Tc=8, Ts=0.3

Query Probes Per Controlled Database

Web Databases: Number of Query Probes

3-fold Cross-validation
These charts are not included in the paper and I am not quite sure whether they can be useful or not. Actually they are the F measure values for the three disjoint sets of the Web set. Their behavior is exactly the same for Varying thresholds, confirming strongly the fact that we are not overfitting the data

Real Confusion Matrix for Top Node of Hierarchy
Health Sports Science Computers Arts 0.753 0.018 0.124 0.021 0.017 0.006 0.171 0.016 0.064 0.024 0.255 0.047 0.004 0.042 0.080 0.610 0.031 0.027 0.298

Probe, Count, and Classify: Categorizing Hidden Web Databases

Similar presentations

Presentation on theme: "Probe, Count, and Classify: Categorizing Hidden Web Databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probe, Count, and Classify: Categorizing Hidden Web Databases

Similar presentations

Presentation on theme: "Probe, Count, and Classify: Categorizing Hidden Web Databases"— Presentation transcript:

Similar presentations

About project

Feedback