Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine-learning based Semi-structured IE Chia-Hui Chang Department of Computer Science & Information Engineering National Central University

Similar presentations


Presentation on theme: "Machine-learning based Semi-structured IE Chia-Hui Chang Department of Computer Science & Information Engineering National Central University"— Presentation transcript:

1 Machine-learning based Semi-structured IE Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 9/24/2002

2 Wrapper Induction Wrapper An extracting program to extract desired information from Web pages. Semi-Structure Doc.– wrapper → Structure Info. Web wrappers wrap... “ Query-able ’’ or “ Search-able ’’ Web sites Web pages with large itemized lists The primary issues are: How to build the extractor quickly?

3 Semi-structured IE Independently of the traditional IE The necessity of extracting and integrating data from multiple Web- based sources

4 Machine-Learning Based Approach A key component of IE systems is a set of extraction patterns that can be generated by machine learning algorithms. Extractor Driver Architecture Rule Format

5 Related Work Shopbot Doorenbos, Etzioni, Weld, AA-97 Ariadne Ashish, Knoblock, Coopis-97 WIEN Kushmerick, Weld, Doorenbos, IJCAI-97 SoftMealy wrapper representation Hsu, IJCAI-99 STALKER Muslea, Minton, Knoblock, AA-99 A hierarchical FST

6 WIEN N. Kushmerick, D. S. Weld, R. Doorenbos, University of Washington, 1997 http://www.cs.ucd.ie/staff/nick/

7 Example 1

8 Extractor for Example 1

9 HLRT

10 Wrapper Induction Induction: The task of generalizing from labeled examples to a hypothesis Instances: pages Labels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)} Hypotheses: E.g. (,,,,, )

11 BuildHLRT succeeds

12 Other Family OCLR (Open-Close-Left-Right) Use Open and Close as delimiters for each tuple HOCLRT Combine OCLR with Head and Tail N-LR and N-HLRT Nested LR Nested HLRT

13 Terminology Oracles Page Oracle Label Oracle PAC analysis is to determine how many examples are necessary to build an wrapper with two parameters: accuracy  and confidence  : Pr[E(w) 1- , or Pr[E(w)>  ]< 

14 Probably Approximate Correct (PAC) Analysis With  =0.1,  =0.1, K=4, an average of 5 tuples/page, Build HLRT must examine at least 72 examples

15 Empirical Evaluation Extract 48% web pages successfully. Weakness:  Missing attributes, attributes not in order, tabular data, etc.

16 Softmealy Chun-Nan Hsu, Ming-Tzung Dung, 1998 Arizona State University http://kaukoai.iis.sinica.edu.tw/~chunnan/ mypublications.html

17 Softmealy Architecture Finite-State Transducers for Semi- Structured Text Mining Labeling: use a interface to label example by manually. Learner: FST ( Finite-State Transducer) Extractor: Demonstration  http://kaukoai.iis.sinica.edu.tw/video.html http://kaukoai.iis.sinica.edu.tw/video.html

18 Softmealy Wrapper SoftMealy wrapper representation Uses finite-state transducer where each distinct attribute permutations can be encoded as a successful path Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes

19 Example

20 4 種情形 Label the Answer Key

21 Finite State Transducer b M -A A -N N-UU e extract skip 多解決了 (N, M) 、 (N, A, M) 2 個情形

22 Find the starting position -- Single Pass 新增的定義

23 Contextual based Rule Learning Tokens Separators S L ::= … Punc(,) Spc(1) Html( ) S R ::= C1Alph(Professor) Spc(1) OAlph(of) … Rule generalization Taxonomy Tree

24 Tokens All uppercase string: CALph An uppercase letter, followed by at least one lowercase letter, C1Alph A lowercase letter, followed by zero or more characters: OAlph HTML tag: HTML Punctuation symbol: Punc Control characters: NL(1), Tab(4), Spc(3)

25 Rule Generalization

26 Learning Algorithm Generalize each column by replacing each token with their least common ancestor

27 Taxonomy Tree

28 Generating to Extract the Body The contextual rules for the head and tail separators are: h L ::= C1alpha(Staff) Html( ) NL(1)Html( ) NL(1) Html( ) t R ::= Html( ) NL(1) Html( ) NL(1) Html( ) NL(1) Html( ) Clalpha(Please)

29 More Expressive Power Softmealy allows Disjunction Multiple attribute orders within tuples Missing attributes Features of candidate strings

30 Stalker I. Muslea, S. Minton, C. Knoblock, University of Southern California http://www.isi.edu/~muslea/

31 STALKER Embedded Catalog Tree Leaves (primitive items): 所要擷取的東西。 Internal nodes (items):  Homogeneous list, or  Heterogeneous tuple.

32 EC Tree of a page

33 Extracting Data from a Document For each node in the EC Tree, the wrapper needs a rule that extracts that particular node from its parent Additionally, for each list node, the wrapper requires a list iteration rule that decomposes the list into individual tuples. Advantages: The hierarchical extraction based on the EC tree allows us to wrap information sources that have arbitrary many levels of embedded data. Second, as each node is extracted independently of its siblings, our approach does not rely on there being a fixed ordering of the items, and we can easily handle extraction tasks from documents that may have missing items or items that appear in various orders.

34 Extraction Rules as Finite Automata Landmarks A sequence of tokens and wildcards Landmark automata A non-deterministic finite automata

35 Landmark Automata (LA) A linear LA has one accepting state from each non-accepting state, there are exactly two possible transitions: a loop to itself, and a transition to the next state; each non-looping transition is labeled by a landmarks; all looping transitions have the meaning “ consume all tokens until you encounter the landmark that leads to the next state ”.

36 Rule Generation 1 st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; _Symbol_ _HtmlTag_} perfect Disj:{ _HtmlTag_} positive example: D3, D4 2 nd : uncover{D1, D2} Candidate:{; _Symbol_} Extract Credit info.

37 Possible Rules

38

39

40 The STALKER Algorithm

41

42

43 Features Process is performed in a hierarchical manner. 沒有 Attributes not in order 的問題。 Use disjunctive rule 可以解決 Missing attributes 的問題。

44 Multi-pass Softmealy Chun-Nan Hsu and Chian-Chi Chang Institute of Information Science Academia Sinica Taipei, Taiwan

45 Multi-pass

46 Tabular style document (Quote Server)

47 Tagged-list style document (Internet Address Finder)

48 Layout styles and learnability Tabular style missing attributes, ordering as hints Tagged-list style variant ordering, tags as hints Prediction single-pass for tabular style multi-pass for tagged-list style

49 Tabular result (Quote Server)

50 Tagged-list result (Internet Address Finder)

51 Alternative for Tagged-List Docs

52 Comparison Both : can handle irregular missing attributes. 對於未見過的 attribute ,需要 training Single-pass : 允許的 attribute permutations 有限 Single-pass is good for tabular pages 比較快 Multi-pass: Attribute permutations 沒有影響 Multi-pass is good for tagged-list pages 比較慢

53 Experiments Okra (tabular pages) Stalker: 97%, 1 example tuple WIEN: 100%, 13 example tuples, 30 trials SoftMealy: single-pass 100%, 1 example tuple, 30 trials Big-book (tagged-list pages) Stalker: 97%, 8 example tuples WIEN: perfect, 18 example tuples, 30 trials SoftMealy: single-pass 97%, 4 examples, 30 trials multi-pass 100%, 6 examples, 30 trials

54 Experiments (Cont.) Quote Server Stalker: 10 example tuples, 79%, 500 trials WIEN: the collection beyond learn ’ s capability SoftMealy: multi-pass 85%, single-pass 97% Internet Address Finder Stalker: 80% ~ 100%, 300 trials (omitting multiple occurrences of Organization) WIEN: the collection beyond learn ’ s capability SoftMealy: multi-pass 68%, single-pass 41%, 99% with the alternative approach

55 References Kushmerick, N. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence J. 118(1-2):15-68, 2000.Wrapper induction: Efficiency and expressiveness Chun-Nan Hsu and Ming-Tzung Dung. Generating finite- state transducers for semistructured data extraction from the web. Information Systems, 23(8):521-538, 1998.Generating finite- state transducers for semistructured data extraction from the web. Chun-Nan Hsu and Chien-Chi Chang. Finite-State Transducers for Semi-Structured Text Mining, In Proceedings of IJCAI-99 Workshop on Text Mining, Stockholm, Sweden, 1999. Page 38-49.Finite-State Transducers for Semi-Structured Text Mining Ion Muslea, Steve Minton, Craig Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources, Journal of Autonomous Agents and Multi-Agent Systems, 4:93-114, 2001.Hierarchical Wrapper Induction for Semistructured Information Sources,


Download ppt "Machine-learning based Semi-structured IE Chia-Hui Chang Department of Computer Science & Information Engineering National Central University"

Similar presentations


Ads by Google