Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chia-Hui Chang, Shao-Chen Lui Dept. of Computer Science and Information Engineering National Central University IEPAD: Information Extraction Based on.

Similar presentations


Presentation on theme: "Chia-Hui Chang, Shao-Chen Lui Dept. of Computer Science and Information Engineering National Central University IEPAD: Information Extraction Based on."— Presentation transcript:

1 Chia-Hui Chang, Shao-Chen Lui Dept. of Computer Science and Information Engineering National Central University IEPAD: Information Extraction Based on Pattern Discovery WWW10 ’01

2 Introduction (1/4) September 12, 2015 2

3 Introduction (2/4) September 12, 2015 Great need for value-added service that integrates information from multiple sources Customizable Web information gathering robots/crawlers Comparison-shopping agents Meta-search engines Newsbots Suppose the data has been collected from different Web sites… Write extractor program to extract the contents of the Web pages Observe the extraction rules in person Write programs for each Web site Since the format of Web pages is often subject to change, maintaining the wrapper can be expensive and impractical → labor-intensive ! 3

4 Introduction (3/4) September 12, 2015 Related works Tools that can generate wrappers automatically Machine learning techniques to summarize extraction rules Ex: WIEN, Softmealy, Stalker Designer must manually label the beginning and the end of the training examples for generating the rules Manual labeling is time-consuming and not efficient enough Fully automate wrapper construction Without users’ training examples Ex: One-tag separator approach (Embley et al.) Discover record boundaries in Web documents by identifying candidate separator tags using five independent heuristics Problem arises when the separator tag is used elsewhere among a record other than the boundary 4

5 Introduction (4/4) September 12, 2015 5 Eliminate human intervention by pattern mining Motivation is from the observation that useful information in a Web page is often placed in a structure having a particular alignment and order Ex: Web pages produced by search engines generally present search results in regular and repetitive patterns Mining repetitive patterns may discover the extraction rules for wrappers

6 System Overview (1/3) September 12, 2015 6 The system IEPAD includes three components : An extraction rule generator accepts an input Web page A graphical user interface Called pattern viewer Shows repetitive patterns discovered An extractor module Extracts desired information from similar Web pages according to the extraction rule chosen by the user

7 System Overview (2/3) September 12, 2015 7 Extraction rule generator includes : Translator PAT tree constructor Pattern discoverer Pattern validator Extraction rule composer The results of rule extractor are extraction rules discovered in a Web page

8 System Overview (3/3) September 12, 2015 8 1. User submits an HTML page 2. Receive and translate into a string of abstract representations 3. Receives the binary file to construct a PAT tree 4. Pattern discoverer uses the PAT tree to discover repetitive patterns, called maximal repeats 5. Filters out undesired patterns and produces candidate patterns 6. Rule composer revises each candidate pattern to form an extraction rule in regular expression

9 Extraction Rule Generator (1/2) September 12, 2015 9 Desired information in a Web page is often placed in a structure having a particular alignment and forms repetitive patterns May constitute the extraction rules for wrappers Repetitive patterns : Any substring that occurs at least twice in the encoded token string Include too many patterns fitting this requisite Define maximal repeats to uniquely identify the longest pattern

10 Extraction Rule Generator (2/2) Necessary for identifying the well used and popular term repeats Maximal repeats have to be further verified by the validator to filter interesting ones September 12, 2015 10

11 Translator (1/2) HTML page → token string 包含兩種token Tag token Html( ) TEXT token 兩個tag之間的non-tag文字內容當成單一個token Text(_) September 12, 2015 11

12 Translator (2/2) Example – Congo code September 12, 2015 12 1 2 3 4 5 6 7 8 9 10 11 12 1314

13 PAT Tree Construction September 12, 2015 13 Sistring: 000110001010110011100$ Bit position in the encoded bit string Used when locating a given sistring in PAT tree Bit position in the encoded bit string Used when locating a given sistring in PAT tree Store all its data in external nodes

14 Pattern Discoverer (1/2) September 12, 2015 14

15 Pattern Discoverer (2/2) 不只記下 maximal repeats, 還要記下它們的 occurrence counts, reference positions, pattern length Ex: 想找出所有長度 > 3 tokens 的 patterns, 因為 每個 token 以 3 bits encoded, 所以只需檢察 index bit > 3*3=9 的 internal nodes d,e,g,l,m 其中又只有 d 符合 left diverse, maximal repeat 為 September 12, 2015 15

16 Pattern Validator (1/2) A typical web page usually contains a large number of maximal repeats Not all useful! Validator 使用 3 criteria 來決定哪些 maximal repeats are useful Regularity Measured by computing the standard deviation of the interval between two adjacent occurrences then be devided by the mean of sequence September 12, 2015 16

17 Pattern Validator (2/2) September 12, 2015 17 利用 3 thresholds 濾掉不符合的 maximal repeats 沒有包含 Text token 的也會濾掉

18 Occurrence Partition Special case: The pattern of target information forms three information blocks in the Web page 因為用所有 instance measure, 所以 Regularity → large! Partition the occurrences into segments September 12, 2015 18 < Set to a small value close to zero

19 Rule Composer September 12, 2015 19 Find a good representation of the critical common features of multiple strings Suppose “adc” is the discovered pattern for token string “adcwbdadcxbadcxbdadcb” Multiple alignment for strings The extraction pattern can be generalized as “adc[w|x]b[d|-]” 假設records是連續的, 若 alternatives 超過10個, 仍使用maximal repeats Center String Algorithm Approximation, reduce time complexity Another problem 產生出 pattern: “c1c2c3...cn”, 實際上是“cjcj+1cj+2...cnc1c2...cj–1” 考慮cj為首的records, 並檢查是否“cjcj+1cj+2...cnc1c2...cj–1”為正 確pattern

20 The Extractor (1/2) September 12, 2015 20 1. 2 patterns discovered 2. Shows the detail measures of the selected pattern

21 The Extractor (2/2) September 12, 2015 21 3. The selected pattern is then forwarded to the extractor for pattern recognition and extraction Searching in a PAT is fast, since every subtree of a PAT tree has all its sistrings with a common prefix → efficient, linear-time PAT tree constructed already Pattern-matching algorithm or finite state machine for extraction rule (regular expression) else

22 Experiments (1/3) September 12, 2015 22 14 search engines, each with 10 Web pages All-tag encoding scheme Fixed min. length = 3 Min. frequency = 5

23 Experiments (2/3) September 12, 2015 23 recallprecision Encoding Scheme 0.4% A pattern may contain only a portion of the data record

24 Experiments (3/3) September 12, 2015 24 Occurrence partition Multiple string alignment Lycos → 92%

25 Summary Presented an unsupervised approach for pattern discovery in the encoded token string of Web pages Discovered maximal repeats are filtered by the measure regularity and compactness Regularity higher than threshold → occurrence partition Multiple string alignment is applied to patterns to generalize multiple records Express the extraction rules in regular expressions High retrieval rate and accuracy rate No human intervention and training examples Takes only 3 minutes to extract 140 pages → quick and efficient! September 12, 2015 25


Download ppt "Chia-Hui Chang, Shao-Chen Lui Dept. of Computer Science and Information Engineering National Central University IEPAD: Information Extraction Based on."

Similar presentations


Ads by Google