Presentation is loading. Please wait.

Presentation is loading. Please wait.

R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Similar presentations


Presentation on theme: "R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001."— Presentation transcript:

1 R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001

2 Overview Automatically generates a wrapper from large structured Web pages Supports nested structures Efficient approach to large, complex pages with regular structures

3 Approach Given a set of example pages Generate a Union-free Regular Expression (UFRE) Find the least upper bounds on the RE lattice to generate a wrapper Reduces to find the least upper bound on two UFRES

4 Matching/Mismatching Start with the first page and create a RE that defines the wrapper Match each successive sample against the wrapper Mismatches result in generalizations of the regular expression Types of mismatches – String mismatches – Tag mismatches

5 Example Pages

6 Example #PCDATA String mismatches are used to discover fields of the documents Wrapper is generated by replacing “John Smith” with #PCDATA

7 Example (Cont.) #PCDATA Tag Mismatches :Discovering Optionals First check to see if mismatch is caused by an iterator If not, could be an optional field in wrapper or sample Cross search used to determine possible optionals Image field determined to be optional – ( )?

8 Example (Cont.) #PCDATA Tag Mismatches :Discovering Optionals First check to see if mismatch is caused by an iterator If not, could be an optional field in wrapper or sample Cross search used to determine possible optionals Image field determined to be optional – ( )? ( )?

9 Example (Cont.) #PCDATA ( )? #PCDATA Tag Mismatches :Discovering Iterators Assume mismatch is caused by repeated elements in a list Match possible squares against earlier squares Generalize the wrapper by finding all contiguous repeated occurrences – ( Title: #PCDATA )+

10 Extracted Result

11 Recursive Example

12 Complexity

13 Discussion Assumptions – Pages are well-structured – Want to extract at the level of entire fields – Structure can be modeled without disjunctions Search Space for explaining mismatches is huge – Uses a number of heuristics to prune space Limited backtracking Limit on number of choices to explore Patterns can not be delimited by optionals – Will result in pruning possible wrappers

14 Experimental Result

15 Comparison with Other Works

16 NameStruc_ ture SemiFreeSingle- slot Multi- slot Missing items Permuta_ tions Nested_ data Resilient WIEN XXX SoftMealy XXXXXX* STALKER XXX*XXX RAPIER XX?XXX? SRV XX?XXX? WHISK XXXXXXX*? AutoSlog XXXX ROAD_ RUNNER XXXXX BYU Onto XX?XXXXXX X means the information extraction system has the capability; X* means the information extraction system has the ability as long as the training corpus can accommodate the required training data; ? Shows that the systems can has the ability in somewhat degree; * means that the extraction pattern itself doesn’t show the ability, but the overall system has the capability.


Download ppt "R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001."

Similar presentations


Ads by Google