Presentation is loading. Please wait.

Presentation is loading. Please wait.

IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan

Similar presentations


Presentation on theme: "IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan"— Presentation transcript:

1 IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan http://www.csie.ncu.edu.tw/~chia

2 2001/5/42 Outline Introduction Problem definition Related Work System architecture Extraction rule generation Experiments Summary and future work

3 2001/5/43 Introduction Web information integration multi-search engines, e.g. Metacrawler shopping agents etc. Common tasks Data collection Information extraction

4 2001/5/44 Information Extraction Information Extraction (IE) Input: Html pages Output: A set of records

5 2001/5/45 Related Work Extractor Generation Hand-coded wrappers by observation Machine learning based approach WIEN (Kushmeric), 1997 SoftMealy (Hsu), 1998 STALKER (Muslea), 1999 Fully automatic approach Embley et al, 1999 Chang et al, 2000

6 2001/5/46 System Architecture Rule Generator Extractor Extraction Results Html Page Patterns Pattern Viewer Extraction Rule Users Html Pages

7 2001/5/47 Pattern Discovery based IE Motivation Display of multiple records often forms a repeated pattern The occurrences of the pattern are spaced regularly and adjacently Now the problem becomes... Find regular and adjacent repeats in a string

8 2001/5/48 The Rule Generator Translator PAT tree construction Pattern validator Rule Composer HTML Page Token Translator PAT Tree Constructor Validator Rule Composer PAT trees and Maximal Repeats Advenced Patterns Extraction Rules A Token String

9 2001/5/49 1. Web Page Translation Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) HTML Example: Congo 242 Egypt 20 Encoded token string T( )T(_)T( )T( )T(_)T( )T( )

10 2001/5/410 Various Encoding Schemes

11 2001/5/411 Example of BL Encoding Encoding scheme=Block-Level Tags 1 ’. Only block-level tags are considered, each tag is encoded as a token 2. Any text between two tags are translated to a special token called TEXT (denoted by a underscore) 1. MGI 2.4 - Mouse Genome … The Mouse Genome Informatics (MGI).. URL:www.informatics.jax.org/ … … … Facts about: … _ _ _ _ 1 5 9 64 68

12 2001/5/412 2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible suffix strings of a text Example T( ) 000 T( )001 T( )010 T( )011 T( )100 T(_)110 000110001010110011100 T( )T(_)T( )T( )T(_)T( )T( )

13 2001/5/413 The Constructed PAT Tree

14 2001/5/414 Definition of Maximal Repeats Let  occurs in S in position p 1, p 2, p 3, …, p k  is left maximal if there exists at least one (i, j) pair such that S[p i -1]S[p j -1]  is right maximal if there exists at least one (i, j) pair such that S[p i +||]S[p j +||]  is a maximal repeat if it it both left maximal and right maximal

15 2001/5/415 Finding Maximal Repeats Definition: Let ’ s call character S[p i -1] the left character of suffix p i A node is left diverse if at least two leaves in the ’ s subtree have different left characters Lemma: The path labels of an internal node in a PAT tree is a maximal repeat if and only if is left diverse

16 2001/5/416 3. Pattern Validator Suppose a maximal repeat  are ordered by its position such that suffix p 1 < p 2 < p 3 … < p k, where p i denotes the position of each suffix in the encoded token sequence. Characteristics of a Pattern Regularity: Variance coefficient Adjacency: Density

17 2001/5/417 Pattern Validator (Cont.) Basic Screening For each maximal repeat , compute V() and D() a) check if the pattern ’ s variance: V() < 0.5 b) check if the pattern ’ s density: 0.25 < D() < 1.5 V(  )<0.5 0.25<D(  )<1.5 Yes No Discard Yes Pattern  No Discard Pattern 

18 2001/5/418 4. Rule Composer Occurrence partition Flexible variance threshold control Multiple string alignment Increase density of a pattern ’’ V(  )<0.5 0.25<D(  )<1.5 Yes No Discard Yes  occurrences No  Occurrence Partition Multiple String Alignment D(  )<1 Yes No V(  )<0.1 No Discard

19 2001/5/419 Occurrence Partition Problem Some patterns are divided into several blocks Ex: Lycos, Excite with large regularityLycosExcite Solution Clustering of the occurrences of such a pattern Clustering V(  )<0.1 No Discard  Check density Yes

20 2001/5/420 Multiple String Alignment Problem Patterns with density less than 1 can extract only part of the information Solution Align k-1 substrings among the k occurrences A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

21 2001/5/421 Multiple String Alignment (Cont.) Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'': a d c w b d a d c x b - a d c x b d The extraction pattern can be generalized as “adc[w|x]b[d|-]”

22 2001/5/422 Pattern Viewer Java-application based GUI Web based GUI http://140.115.155.102/WebIEPAD/

23 2001/5/423 The Extractor Matching the pattern against the encoding token string Knuth-Morris-Pratt ’ s algorithm Boyer-Moore ’ s algorithm Alternatives in a rule matching the longest pattern What are extracted? The whole record

24 2001/5/424 Experiment Setup Fourteen sources: search engines Performance measures Number of patterns Retrieval rate and Accuracy rate Parameters Encoding scheme Thresholds control

25 2001/5/425 # of Patterns Discovered Using BlockLevel Encoding Average 117 maximal repeats in our test Web pages

26 2001/5/426 Translation Average page length is 22.7KB

27 2001/5/427 Accuracy and Retrieval Rate

28 2001/5/428 Accuracy and Retrieval Rate

29 2001/5/429 Summary IEPAD: Information Extraction based on Pattern Discovery Rule generator The extractor Pattern viewer Performance 97% retrieval rate and 94% accuracy rate

30 2001/5/430 Problems Guarantee high retrieval rate instead of accuracy rate Generalized rule can extract more than the desired data Only applicable when there are several records in a Web page, currently

31 2001/5/431 Final Acknowledgement We would like to thank Lee-Feng Chien, Ming-Jer Lee and Jung-Liang Chen for providing their PAT tree code for us. Reference Chang, C.H. and Lui, S.C. IEPAD: Information Extraction based on Pattern Discovery, WWW10, May. 2001, Hong Kong.

32 2001/5/432 Future Work Interface for choosing a pattern http://www.csie.ncu.edu.tw/~chia/webiepad/ Multi-level extraction From record boundary extraction to attribute value extraction Extractors in Java and C++

33 2001/5/433 Rule Format level 1 encoding scheme: rule level 2 encoding scheme: rule for block 1 level 2 encoding scheme: rule for block 2... level 2 encoding scheme, rule for block k level 1 block 1, level 2 block no for attribute 1 level 1 block 1, level 2 block no for attribute 2... level 1 block 1, level 2 block no for attribute t K 個 block t 個 attribute

34 2001/5/434 Example(cont.) Line 0: Blocklevel.h, String String String String String Line 1: Alltag.h, rule for block 1 Line 2: Alltag.h, rule for block 2... Line k: Alltag.h, rule for block k Line k+1: level 1 block no, level 2 block no for attribute 1 Line k+2: level 1 block no, level 2 block no for attribute 2... Line k+t: level 1 block no, level 2 block no for attribute t Demo ex: 3, 2 ex: 5, all ex: 5, 1 3

35 Congo Example

36 2001/5/436 Performance Evaluation Definition: A pattern is said to enumerate a record if the overlapping percentage between the record and the pattern is greater than  Three Measures Retrieval Rate Accuracy Rate Matching Percentage

37 2001/5/437 Illustration Let G i,j denotes the ordered occurrences p i, p i+1,..., p j S=, i=1; For j=1 to k-1 do If R(G i,j+1 ) >  then If R(G i,j ) <  m then S= S  { G i,j } ; endif i= j+1; endif endf


Download ppt "IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan"

Similar presentations


Ads by Google