Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Information Extraction

Similar presentations


Presentation on theme: "Web Information Extraction"— Presentation transcript:

1 Web Information Extraction
邵蓥侠

2 Outline Background Approaches for generating wrappers Examples
Manually constructed Machine learning Examples Conclusion

3 Terminology IE = Information Extractor WIE = Web Information Extractor
TIE = Traditional Information Extractor

4 Background Abundant information on web TIE vs. WIE General approach
Structure [tables] Semi-structure [HTML / XML] Free context [blogs] TIE vs. WIE Scalability Cost flexibility General approach wrappers

5 Wrapper Wrapper Flow of WIE based on wrappers
sets of highly accurate rules that extract a particular page's content a function from a page to the set of tuples it contains Flow of WIE based on wrappers collecting training pages labeling training examples [optional] generalizing extraction rules (wrappers) extracting the relevant data outputting the result in an appropriate format Wrappers typically handle highly structured collections of web pages, such as product catalogues and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on adaptive information extraction motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured text.

6 Outline Background Approaches for generating wrappers Examples
Manually constructed Machine learning Examples Conclusion

7 Approaches for generating wrappers
Automation Degree of approaches Manually-constructed Supervised Semi-supervised Unsupervised Machine Learning A Survey of Web Information Extraction TKDE 06

8 Manually Constructed Wrapper
Definition: Manually develop rules/commands/patterns for extracting data Examples TSIMMIS [Hammer, et al, 1997] Minerva [Crescenzi, 1998] WebOQL [Arocena and Mendelzon, 1998] W4F [Saiiuguet and Azavant, 2001] XWrap [Liu, et al. 2000]

9 Manually Constructed Wrapper
Disadvantages Time-consuming to write rules Non-general Need to understand the structure of document Special expertise of users [programmers]

10 Wrapper with Supervised Learning
A machine learning task of inferring a function from supervised (labeled) training data Examples SRV [Freitag, 1998] Rapier [Califf and Mooney, 1998] WIEN [Kushmerick, 1997] WHISK [Soderland, 1999] NoDoSE [Adelberg, 1998] Softmealy [Hsu and Dung, 1998] Stalker [Muslea, 1999] DEByE [Laender, 2002b ]

11 Wrapper with Supervised Learning
Disadvantage Manually labeling training data is time-consuming vs. manually constructed general users instead of programmers can label training data, thus reducing the cost of wrapper generation

12 Wrapper with Semi-Supervised Learning
a class of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data.  Examples SEAL [Richard C. Wang,2009] Automatic Wrapper [Nilesh Dalvi ,2011] IEPAD [Chang and Lui, 2001] OLERA [Chang and Kuo, 2003] Thresher [Hogue, 2005]

13 Wrapper with Unsupervised Learning
refers to the problem of trying to find hidden structure in unlabeled data Examples Roadrunner [Crescenzi, 2001] DeLa [Wang, 2002; 2003] EXALG [Arasu and Garcia-Molina, 2003] DEPTA [Zhai, et al., 2005]

14 A Survey of Web Information Extraction Systems @ TKDE 06

15 A Survey of Web Information Extraction Systems @ TKDE 06

16 A Survey of Web Information Extraction Systems @ TKDE 06

17 A Survey of Web Information Extraction Systems @ TKDE 06

18 Outline Background Approaches for generating wrappers Examples
Manually constructed Machine learning Examples Conclusion

19 Manually-Constructed Example
TSIMMIS one of the first approaches that give a framework for manual building of Web wrappers Wrapper Manually constructed as commands Input: a specification file that declaratively states where the data of interest is located on the page Output: Object Extraction Model (OEM) Semi-structured Data: The TSIMMIS ADBIS 97

20 Manually-Constructed Example
Each command is of the form: [variables, source, pattern] where source specifies the input text to be considered pattern specifies how to find the text of interest within the source, and variables are a list of variables that hold the extracted results. Note: # means “save in the variable” * means “discard” Semi-structured Data: The TSIMMIS ADBIS 97

21 Manually-Constructed Example
Specification file Web Page OEM

22 Supervised Learning Example
SRV top-down relational algorithm that generates single-slot extraction rules Learning algorithm work like FOIL Token-oriented Logic rules SRV add predicates greedily, attem Top–down and bottom–up are strategies of information processing and knowledge ordering, mostly involving software, but also other humanistic and scientific theories (see systemics). In practice, they can be seen as a style of thinking and teaching. In many cases top–down is used as a synonym of analysis or decomposition, and bottom–up of synthesis.pting thereby to “cover” as many positive, and as few negative examples answer possible Information Extraction from HTML: Application of a General Machine Learning AAAI 98

23 SRV Learning process SRV Algorithm Input annotated document & features
Inducting rules based on 2/3 training data Validate rules based on remained 1/3 training data Iterate learning 3 times Output rules of predicted for single-slot Iterative 3 times Information Extraction from HTML: Application of a General Machine Learning AAAI 98

24 Supervised Learning Example s
Rules for extracting rating which says rating is a single numeric word and occurs within a HTML list tag Web page

25 Semi-Supervised Learning Example
SEAL (Set Expander for Any Language) expands entities automatically by utilizing resources from the Web in a language-independent fashion Flow of SEAL Extracting wrappers Ranking wrappers / Candidates Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

26 Semi-Supervised Learning Example
Extracting wrappers Input seed instance & document Find seed instance in document Generate left/right context Mining between left/right context find all the longest possible strings from left context set given some constraints, called s for each found string find the longest possible string s0 from right context such that s and s0 bracket at least one occurrence of every given seed in a document NOTE: left/right context are maintained by Patricia Trie Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

27 Semi-Supervised Learning Example
Document Seeds {Ford, Nissan, Toyota} Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

28 Semi-Supervised Learning Example
Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

29 Semi-Supervised Learning Example
Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09

30 Unsupervised Learning Example
Roadrunner A novel approach to wrapper inference for HTML pages. Idea Generate HTML page using scripts => Encoding Data Extracting from HTML pages => Decoding Formulate the problem Find the nested type of the source dataset Extract the source dataset from HTML pages. RoadRunner: Towards Automatic Data Extraction from Large Web VLDB 2001

31 Unsupervised Learning Example
Find Nested Type Theoretical Background Based on close correspondence between nested type and union-free regular expressions (UFRE). => find the Least Upper Bound UFRE Solution for LUB UFRE. ACME (Align, Collapse under Mismatch, and Extract) RoadRunner: Towards Automatic Data Extraction from Large Web VLDB 2001

32 Outline Background Approaches for generating wrappers Examples
Manually constructed Machine learning Examples Conclusion

33 Conclusion WIE will be still important due to “data flood” on Internet
Currently WIE systems almost bases on Machine Learning, but are still not perfect New technique, such as MapReduce, Hadoop, Spark, etc., promotes ML developing, and it may also benefit the WIE.

34 Q&A

35 Reference Information Extraction @ Wikipedia
Wrapper (data Wikipedia Supervised Wikipedia Semi-Supervised Wikipedia Unsupervised Wikipedia A Survey of Web Information Extraction TKDE 06 Semi-structured Data: The TSIMMIS ADBIS 97 Information Extraction from HTML: Application of a General Machine Learning AAAI-98 Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09 RoadRunner: Towards Automatic Data Extraction from Large Web VLDB 2001


Download ppt "Web Information Extraction"

Similar presentations


Ads by Google