Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.

Similar presentations


Presentation on theme: "Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National."— Presentation transcript:

1 Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National Science Foundation

2 2 Motivation Web information is stored in databases Databases are accessed through forms Forms are designed in various ways

3 3 Motivation Web information is stored in databases Databases are accessed through forms Forms are designed in various ways Automated agents are of great value

4 4 Prototype System Flowchart Input Analyzer Retrieved Page(s) User Query Site Form Output Analyzer Extracted Information Application Extraction Ontology

5 5 Input Analyzer – User Query Acquisition Our system provides a form created based on application-specific ontology

6 6 Input Analyzer – User Query Acquisition (cont’)

7 7 Input Analyzer – Site Form Analysis Understand name, type, and/or values for each field

8 8 Input Analyzer – Form Query Generation Form Field Name Recognition – For all fields Form Field Values Justification – For range fields only Form Fields Matching (Case 0 – 5) – For all fields

9 9 Form Field Name Recognition Match by value – Application extraction ontology Match by name – WordNet based C4.5 decision tree learning algorithm – Levenshtein edit distance, soundex, and longest common subsequence (LCS)

10 10 Form Field Values Justification For range fields only

11 11 Form Field Values Justification: Type 1 Lower value list: [0, 1, 5000, 10000, 15000, 20000, 30000]; Upper value list: [2500, 5000, 10000, 15000, 20000, 30000, 50000, 999999]; Paired = false.

12 12 Form Field Values Justification: Type 2 Lower value list: [0, 0, 5001, 10001, 15001, 20001]; Upper value list: [999999, 5000, 10000, 15000, 20000, 999999]; Paired = true.

13 13 Form Field Values Justification: Type 3 Lower value list: [25, 25, 25, 25, 25, 25, 25]; Upper value list: [25, 50, 100, 300, 500, 500, 500]; Paired = true.

14 14 Form Fields Matching: Case 0 Fields specified in user query are the same as in a site form.

15 15 Form Fields Matching: Case 1 Fields specified in a user query are not contained in a site form, but are in the returned information. ? ?

16 16 Form Fields Matching: Case 2 Fields specified in a user query are not contained in a site form, and are not in the returned information. Color? ? ?

17 17 Form Fields Matching: Case 3 Fields required by a site form are not provided in user query, but a general default value, such as “All” or “Any”, is provided by the site form.

18 18 Form Fields Matching: Case 4 Fields appear in a site form are not provided in a user query, and the default value provided by the site form is specific, not “All” or “Any”. ?

19 19 Form Fields Matching: Case 5 Values specified in a user query do not match with values provided in a site form.

20 20 Output Analyzer Form Results Processor – Record separator – BYU Ontos Final Results Generator – Database manipulation Single table Multiple tables

21 21 A Car-ads Search Example

22 22 A Car-ads Search Example (cont’)

23 23 Measurements Field-matching Efficiency

24 24 Measurements (cont’) Field-matching Efficiency Query-submission Efficiency

25 25 Measurements (cont’) Field-matching Efficiency Query-submission Efficiency Overall Efficiency

26 26 Experimental Results Car-ads search Number of Forms: 7 Number of Fields in Forms: 31 Number of Fields Applicable to Ontology: 21 (67.7%) Field MatchingQuery SubmissionOverall Recall100% (21/21)100% (249/249)100% Precision100% (21/21)82.7% (249/301) [97.1% (249+1847)/(301+1858)]* 82.7% [97.1%]* * Numbers in square brackets are calculated including queries submitted for retrieving next links.

27 27 Experimental Results (cont’) Digital-camera search Number of Forms: 7 Number of Fields in Forms: 41 Number of Fields Applicable to Ontology: 23 (56.1%) Field MatchingQuery SubmissionOverall Recall91.3% (21/23)100% (31/31)91.3% Precision100% (21/21)100% (31/31) [100% (31+85)/(31+85)]* 100% [100%]* * Numbers in square brackets are calculated including queries submitted for retrieving next links.

28 28 Results Discussion Field Matching – By value Successful: 100% – By name Successful example: price vs. myprice, pricelow, pricehigh, _extern_price, min_price, max_price Failed: price vs. lo_p, hi_p

29 29 Results Discussion (cont’) Query Submission

30 30 Conclusion Our system’s performance – Fields applicable to extraction ontologies: 61.9% – Fields system matched: 95.7% – Queries submitted that are necessary: 91.4% To improve the performance – Field labels – The quality of the extraction ontologies Forms our system does not handle – Multiple forms – Forms whose actions are coded inside scripts

31 31 Contributions Enables directed hidden Web crawling – Accurate field matching – Efficient form filling and submission – Post processing for precise results Ontology based – Extensible to multiple domains – Resilient to page changes


Download ppt "Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National."

Similar presentations


Ads by Google