Download presentation
Presentation is loading. Please wait.
Published byAnis Logan Modified over 9 years ago
1
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan http://www.csie.ncu.edu.tw/~chia
2
2001/5/42 Outline Introduction Problem definition Related Work System architecture Extraction rule generation Experiments Summary and future work
3
2001/5/43 Introduction Web information integration multi-search engines, e.g. Metacrawler shopping agents etc. Common tasks Data collection Information extraction
4
2001/5/44 Information Extraction Information Extraction (IE) Input: Html pages Output: A set of records
5
2001/5/45 Related Work Extractor Generation Hand-coded wrappers by observation Machine learning based approach WIEN (Kushmeric), 1997 SoftMealy (Hsu), 1998 STALKER (Muslea), 1999 Fully automatic approach Embley et al, 1999 Chang et al, 2000
6
2001/5/46 System Architecture Rule Generator Extractor Extraction Results Html Page Patterns Pattern Viewer Extraction Rule Users Html Pages
7
2001/5/47 Pattern Discovery based IE Motivation Display of multiple records often forms a repeated pattern The occurrences of the pattern are spaced regularly and adjacently Now the problem becomes... Find regular and adjacent repeats in a string
8
2001/5/48 The Rule Generator Translator PAT tree construction Pattern validator Rule Composer HTML Page Token Translator PAT Tree Constructor Validator Rule Composer PAT trees and Maximal Repeats Advenced Patterns Extraction Rules A Token String
9
2001/5/49 1. Web Page Translation Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a special token called TEXT (denoted by a underscore) HTML Example: Congo 242 Egypt 20 Encoded token string T( )T(_)T( )T( )T(_)T( )T( )
10
2001/5/410 Various Encoding Schemes
11
2001/5/411 Example of BL Encoding Encoding scheme=Block-Level Tags 1 ’. Only block-level tags are considered, each tag is encoded as a token 2. Any text between two tags are translated to a special token called TEXT (denoted by a underscore) 1. MGI 2.4 - Mouse Genome … The Mouse Genome Informatics (MGI).. URL:www.informatics.jax.org/ … … … Facts about: … _ _ _ _ 1 5 9 64 68
12
2001/5/412 2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible suffix strings of a text Example T( ) 000 T( )001 T( )010 T( )011 T( )100 T(_)110 000110001010110011100 T( )T(_)T( )T( )T(_)T( )T( )
13
2001/5/413 The Constructed PAT Tree
14
2001/5/414 Definition of Maximal Repeats Let occurs in S in position p 1, p 2, p 3, …, p k is left maximal if there exists at least one (i, j) pair such that S[p i -1]S[p j -1] is right maximal if there exists at least one (i, j) pair such that S[p i +||]S[p j +||] is a maximal repeat if it it both left maximal and right maximal
15
2001/5/415 Finding Maximal Repeats Definition: Let ’ s call character S[p i -1] the left character of suffix p i A node is left diverse if at least two leaves in the ’ s subtree have different left characters Lemma: The path labels of an internal node in a PAT tree is a maximal repeat if and only if is left diverse
16
2001/5/416 3. Pattern Validator Suppose a maximal repeat are ordered by its position such that suffix p 1 < p 2 < p 3 … < p k, where p i denotes the position of each suffix in the encoded token sequence. Characteristics of a Pattern Regularity: Variance coefficient Adjacency: Density
17
2001/5/417 Pattern Validator (Cont.) Basic Screening For each maximal repeat , compute V() and D() a) check if the pattern ’ s variance: V() < 0.5 b) check if the pattern ’ s density: 0.25 < D() < 1.5 V( )<0.5 0.25<D( )<1.5 Yes No Discard Yes Pattern No Discard Pattern
18
2001/5/418 4. Rule Composer Occurrence partition Flexible variance threshold control Multiple string alignment Increase density of a pattern ’’ V( )<0.5 0.25<D( )<1.5 Yes No Discard Yes occurrences No Occurrence Partition Multiple String Alignment D( )<1 Yes No V( )<0.1 No Discard
19
2001/5/419 Occurrence Partition Problem Some patterns are divided into several blocks Ex: Lycos, Excite with large regularityLycosExcite Solution Clustering of the occurrences of such a pattern Clustering V( )<0.1 No Discard Check density Yes
20
2001/5/420 Multiple String Alignment Problem Patterns with density less than 1 can extract only part of the information Solution Align k-1 substrings among the k occurrences A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.
21
2001/5/421 Multiple String Alignment (Cont.) Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'': a d c w b d a d c x b - a d c x b d The extraction pattern can be generalized as “adc[w|x]b[d|-]”
22
2001/5/422 Pattern Viewer Java-application based GUI Web based GUI http://140.115.155.102/WebIEPAD/
23
2001/5/423 The Extractor Matching the pattern against the encoding token string Knuth-Morris-Pratt ’ s algorithm Boyer-Moore ’ s algorithm Alternatives in a rule matching the longest pattern What are extracted? The whole record
24
2001/5/424 Experiment Setup Fourteen sources: search engines Performance measures Number of patterns Retrieval rate and Accuracy rate Parameters Encoding scheme Thresholds control
25
2001/5/425 # of Patterns Discovered Using BlockLevel Encoding Average 117 maximal repeats in our test Web pages
26
2001/5/426 Translation Average page length is 22.7KB
27
2001/5/427 Accuracy and Retrieval Rate
28
2001/5/428 Accuracy and Retrieval Rate
29
2001/5/429 Summary IEPAD: Information Extraction based on Pattern Discovery Rule generator The extractor Pattern viewer Performance 97% retrieval rate and 94% accuracy rate
30
2001/5/430 Problems Guarantee high retrieval rate instead of accuracy rate Generalized rule can extract more than the desired data Only applicable when there are several records in a Web page, currently
31
2001/5/431 Final Acknowledgement We would like to thank Lee-Feng Chien, Ming-Jer Lee and Jung-Liang Chen for providing their PAT tree code for us. Reference Chang, C.H. and Lui, S.C. IEPAD: Information Extraction based on Pattern Discovery, WWW10, May. 2001, Hong Kong.
32
2001/5/432 Future Work Interface for choosing a pattern http://www.csie.ncu.edu.tw/~chia/webiepad/ Multi-level extraction From record boundary extraction to attribute value extraction Extractors in Java and C++
33
2001/5/433 Rule Format level 1 encoding scheme: rule level 2 encoding scheme: rule for block 1 level 2 encoding scheme: rule for block 2... level 2 encoding scheme, rule for block k level 1 block 1, level 2 block no for attribute 1 level 1 block 1, level 2 block no for attribute 2... level 1 block 1, level 2 block no for attribute t K 個 block t 個 attribute
34
2001/5/434 Example(cont.) Line 0: Blocklevel.h, String String String String String Line 1: Alltag.h, rule for block 1 Line 2: Alltag.h, rule for block 2... Line k: Alltag.h, rule for block k Line k+1: level 1 block no, level 2 block no for attribute 1 Line k+2: level 1 block no, level 2 block no for attribute 2... Line k+t: level 1 block no, level 2 block no for attribute t Demo ex: 3, 2 ex: 5, all ex: 5, 1 3
35
Congo Example
36
2001/5/436 Performance Evaluation Definition: A pattern is said to enumerate a record if the overlapping percentage between the record and the pattern is greater than Three Measures Retrieval Rate Accuracy Rate Matching Percentage
37
2001/5/437 Illustration Let G i,j denotes the ordered occurrences p i, p i+1,..., p j S=, i=1; For j=1 to k-1 do If R(G i,j+1 ) > then If R(G i,j ) < m then S= S { G i,j } ; endif i= j+1; endif endf
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.