Presentation is loading. Please wait.

Presentation is loading. Please wait.

Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.

Similar presentations


Presentation on theme: "Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center."— Presentation transcript:

1 Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center for eBusiness Brigham Young University November 9, 2004 Funded by the National Science Foundation under grant IIS-0083127

2 2 Motivation Web information is stored in databases Databases are accessed through forms Forms are designed in various ways

3 3 Motivation Web information is stored in databases Databases are accessed through forms Forms are designed in various ways Automated agents are of great value

4 4 Prototype System Flowchart Input Analyzer Retrieved Page(s) User Query Site Form Output Analyzer Extracted Information Application Extraction Ontology

5 5 Input Analyzer – User Query Acquisition System creates a form based on application- specific ontology

6 6 Input Analyzer – User Query Acquisition (cont.)

7 7 Input Analyzer – Site Form Analysis Understand name, type, and/or values for each field

8 8 Input Analyzer – Form Query Generation Form field name recognition – For all fields Form field value recognition – For range fields only Form field matching (Case 0 – 5) – For all fields

9 9 Form Field Name Recognition Match by value – Application extraction ontology Match by name – WordNet-based C4.5 decision tree learning algorithm – Levenshtein edit distance, SoundEx, and longest common subsequence (LCS)

10 10 Form Field Value Recognition For range fields only

11 11 Form Field Value Recognition: Type 1 Lower value list: [0, 1, 5000, 10000, 15000, 20000, 30000]; Upper value list: [2500, 5000, 10000, 15000, 20000, 30000, 50000, 999999]; Paired = false.

12 12 Form Field Value Recognition: Type 2 Lower value list: [0, 0, 5001, 10001, 15001, 20001]; Upper value list: [999999, 5000, 10000, 15000, 20000, 999999]; Paired = true.

13 13 Form Field Value Recognition: Type 3 Lower value list: [25, 25, 25, 25, 25, 25, 25]; Upper value list: [25, 50, 100, 300, 500, 500, 500]; Paired = true.

14 14 Form Field Matching: Case 0 Field specified in user query (Q) is the same as in a site form (F)

15 15 Form Field Matching: Case 1 Field in Q is not contained in F, but is in the returned information ? ?

16 16 Form Field Matching: Case 2 Field in Q is not contained in F, and is not in the returned information Color? ? ?

17 17 Form Field Matching: Case 3 Field required by F is not provided in Q, but a general default value, such as “All” or “Any”, is provided by F

18 18 Form Field Matching: Case 4 Field required by F is not provided in Q, and the default value provided by the site form is specific, not “All” or “Any” ?

19 19 Form Field Matching: Case 5 Values specified in Q do not match values provided in F

20 20 Output Analyzer Form results processor – Record separator – BYU Ontos Final results generator – Database manipulation Single table Multiple tables

21 21 A Car-ads Search Example

22 22 A Car-ads Search Example (cont.)

23 23 Measurements Field-matching efficiency

24 24 Measurements (cont.) Field-matching efficiency Query-submission efficiency

25 25 Measurements (cont.) Field-matching efficiency Query-submission efficiency Overall efficiency

26 26 Experimental Results Car-ads search Number of Forms: 7 Number of Fields in Forms: 31 Number of Fields Applicable to Ontology: 21 (67.7%) Field MatchingQuery SubmissionOverall Recall100% (21/21)100% (249/249)100% Precision100% (21/21)82.7% (249/301) [97.1% (249+1847)/(301+1858)]* 82.7% [97.1%]* * Numbers in square brackets are calculated including queries submitted for retrieving next links.

27 27 Experimental Results (cont.) Digital-camera search Number of Forms: 7 Number of Fields in Forms: 41 Number of Fields Applicable to Ontology: 23 (56.1%) Field MatchingQuery SubmissionOverall Recall91.3% (21/23)100% (31/31)91.3% Precision100% (21/21)100% (31/31) [100% (31+85)/(31+85)]* 100% [100%]* * Numbers in square brackets are calculated including queries submitted for retrieving next links.

28 28 Results Discussion Field matching – By value Successful: 100% – By name Successful example: price vs. myprice, pricelow, pricehigh, _extern_price, min_price, max_price Failed: price vs. lo_p, hi_p

29 29 Results Discussion (cont.) Query submission

30 30 Conclusion Our system’s performance – Fields applicable to extraction ontologies: 61.9% – Fields system matched: 95.7% – Queries submitted that are necessary: 91.4% To improve the performance – Field labels – The quality of the extraction ontologies Forms our system does not handle – Multiple forms – Forms whose actions are coded inside scripts

31 31 Contributions Enables directed hidden Web crawling – Accurate field matching – Efficient form filling and submission – Post processing for precise results Ontology based – Extensible to multiple domains – Resilient to page changes


Download ppt "Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center."

Similar presentations


Ads by Google