Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extracting Data Behind Web Forms Stephen W. Liddle David W. Embley Del T. Scott, Sai Ho Yau Brigham Young University Presented by: Helen Chen.

Similar presentations


Presentation on theme: "Extracting Data Behind Web Forms Stephen W. Liddle David W. Embley Del T. Scott, Sai Ho Yau Brigham Young University Presented by: Helen Chen."— Presentation transcript:

1 Extracting Data Behind Web Forms Stephen W. Liddle David W. Embley Del T. Scott, Sai Ho Yau Brigham Young University Presented by: Helen Chen

2 2 Introduction According to BrightPlanet.com, the size of “deep/hidden web” is 500 times greater than the “shallow web” Web forms are designed in various ways using radio buttons, checkboxes, selection lists, text boxes, hidden controls, and even author-defined objects Automated form filling is desirable but challenging

3 3 Example From www.autointerface.com

4 4 Objective Automatically fills out web forms, retrieve all the data behind the forms, and eliminates duplicates Domain-independent Unbounded domains are excluded from this study

5 5 Issues Considered Data can only be obtained piecemeal using multiple queries Result page contain error messages – HTTP 404 error pages – Message embedded within a series of table, frames, or other typed of HTML divisions Duplicate data are retrieved

6 6 Procedure Automate form filling Process response pages Recognize duplicate data Loop!

7 7 Automate form filling Parse HTML pages into a parse tree and store the portion of interest – Only the portion between and tags are of interest Store particular information – Source URL of the page, the action URL to which the form will be submitted, the number of fields, details for each field (names, types, default values) Assign proper values to each field Submit a form for CGI processing

8 8 Form Submission Method – using HTTP GET verb – using HTTP POST verb Plan – Issue default query – Sampling phase – Exhaustive phase

9 9 Form Submission (cont’) The user can specify several thresholds – Percentage of data retrieved – Number of queries issues – Number of bytes retrieved – Amount of time spent – Number of consecutive queries with no new data returned Exit form filling process when one of the thresholds above is reached or data are exhausted

10 10 Form Submission (cont’) Estimate the database size Where D i is the estimated data size Oi is the number of unique bytes observed after i th query N is the total number of queries p i is the estimation of probability of finding new data in query i+1 and p i = No. of queries that returned new data / i

11 11 Form Submission (cont’) Estimate data size with windowed probability Where  i is a measure of the standard deviation of p i over the previous 2 query cycles Comment: windowed probability estimate is NOT as good as cumulative estimate in practice

12 12 Form Submission (cont’) Estimate the maximum possible space needed Estimate the remaining time required Where b i is the size in bytes of the i th sample query N is the total number of queries n is the number of sample queries, n=C Where t i is the total duration of the i th sample query

13 13 Sampling Phase of the Submission Determine the size of a sampling batch (number of queries to issue at one time) WhereN is the total number of possible combinations |f i | represents the number of choice for the i th factor Where C is the size of a sampling batch Where c is the cardinality of the largest factor

14 14 Sampling Phase of the Submission XX X X X XX sort-by N = 4*7 = 28 log 2 N = 4.8 c = 7 C = max (7, 5) = 7 x x x x x xx x x x x x x xx x x x

15 15 Exhaustive Phase of the Submission Estimate max possible space needed, max remaining time needed, and data size Let user specify various thresholds for completeness of retrieved data Process additional batches of C query samples until one of the thresholds is reached or all possible combinations are exhaust

16 16 Exhaustive Phase (Improvement) |F B |=12 |F A |=60

17 17 Process response pages No-record notification -- continue Required field missing – require user intervention Unexpected failure -- timeout Default query retrieves all data in one page Default query retrieved all data showing on more than one page – concatenate all records Default query does not retrieve all data – sampling and exhaustive phase

18 18 Recognize duplicated data modify the copy detection system (CDS) by using tag after,,,,, … Strip all HTML tags CDS computes hash values for every record separated by CDS compares the new hash values with the hash values of all records retrieved previously Remove duplicates store new data in the repository

19 19 Experimental Results Among 13 different Web sites visited, 5 of the cases returned all the data with a single query The sampling phase takes from a few dozen seconds to several hours Storage requirements are modest To retrieve 80% of the data, the relatively sparse data pattern need to submit about 40% of queries, and the fairly dense data pattern need to submit about 75% of the queries

20 20 Conclusions Domain-independent approach for automatically retrieving the data behind a given web form Use two-phase approach to gathering data Analysis of the productivity of various factors in order to emphasize those that yield more data earlier in the search process improves the performance of the system

21 21 Future Work Automatically retrieve, extract, and integrate just the relevant data from different web sites using the tool of domain-specific ontologies with respect to user queries Unbounded domains, such as text boxes, are under the scope of the work

22 22 Related Works Microsoft Passport and Wallet System for e-commerce transactions ShopBot for domain specific comparison shopping Commercial ventures index the hidden web: BrightPlanet.com, InvisibleWeb.com HiWE: domain-specific, human-assisted web crawler


Download ppt "Extracting Data Behind Web Forms Stephen W. Liddle David W. Embley Del T. Scott, Sai Ho Yau Brigham Young University Presented by: Helen Chen."

Similar presentations


Ads by Google