Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.

Similar presentations


Presentation on theme: "1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock."— Presentation transcript:

1 1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock

2 2 Introduction IE from web pages is important because of the amount of semistructured information on the web IE depends on the construction of wrappers Manual wrapper construction is tedious and hard Previous wrapper learning systems require a lot of hand- marked training data STALKER is a supervised learning algorithm for inducing wrappers STALKER requires fewer triaining examples than other approaches and is able to wrap more pages

3 3 Structure of a document Semistructured documents follow a formal grammar An Embedded Catalog (EC) represents the structure of a document as a tree Leaves are items of interest Internal nodes are lists of k-tuples Each item of a k-tuple can be a leaf or another list

4

5 5 Extraction Rules Each tree node represents a sequence of tokens The root node represents the entire document Each node’s sequence is a subsequence of its parent’s An extraction rule is associated with each edge of the tree Specifies how to extract the child content x from the parent content p In other words describes how to match the prefix of x w.r.t p -- Prefix x (p) Each child’s extraction rule is independent of its siblings Use Landmarks – either tokens or wildcards (token classes) Can be disjunctive – apply rule R1 or rule R2

6 6 Extraction Rules – examples SkipTo( ) SkipTo(Name)SkipTo( ) SkipTo(Name Symbol HtmlTag) SkipTo( ) SkipTo(,)SkipUntil(Num) SkipTo(AllCaps)NextLMark(Num)

7 7 Extraction Rules as Finite Automata An extraction rule is equivalent to an FSA Transition conditions correspond to the landmarks used in the extraction rules Empty looping transitions are taken when a landmark has not been reached R1 = SkipTo( ( ) R2 = SkipTo(Phone)SkipTo( )

8 8 Learning Algorithm Sequential Covering Algorithm Choose the rule that covers most examples and remove the examples it covers Return the disjunction of all rules found

9

10

11 11 Related Work Manual wrapper construction TSIMMIS, procedural languages, etc Hard and error prone Automatic wrapper construction WEIN Less expressive – only uses the equivalent of SkipTo() without wildcards Not able to express arbitrarily deep tuples SoftMealy Generates rules as finite transducers More expressive than WEIN but strictly less expressive than STALKER Must see all possible item orderings WHISK, RAPIER, and SRV Use NLP Techniques Use landmarks similar to STALKER Ontology approach – DEG Can handle lists with multiplicity constraints Character based rather than token based landmarks

12 12 Conclusions STALKER uses EC formalism to turns a hard problem into several smaller ones Unseen permutations of data items can be recognized Arbitrarily long lists can be recognized The entire document can be interpreted as a list of tuples STALKER rules use an expressive landmark based format High accuracy wrappers can be induced automatically based on very few training examples compared with other systems

13 13 Related Work – BYU DEG RAPIER rules correspond closely to DEG data frames. Data frames are finer-grained, based on character patterns, whereas rules are based on word patterns Pre-filler and Post-filler patterns correspond closely to data frame contexts and key words Semantic categories correspond closely with lexicons Not mentioned how RAPIER handles multiple record documents Rapier data structure is given by the template (slots) defined in the input data RAPIER is very similar in purpose to what Joe is trying to do – learn extraction rules based on a filled in form

14 14 Conclusions Extracting desired pieces of information from NL text is important Manually constructing IE systems too hard RAPIER uses relational learning to build a set of pattern- match rules given a database of texts and filled templates Learned patterns employ syntactic and semantic information to match slot fillers and context Fairly accurate results can be obtained for a real-world problem with relatively small datasets RAPIER compares favorably with other IE learning systems


Download ppt "1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock."

Similar presentations


Ads by Google