Web Information Extraction 邵蓥侠
Outline Background Approaches for generating wrappers Examples Manually constructed Machine learning Examples Conclusion
Terminology IE = Information Extractor WIE = Web Information Extractor TIE = Traditional Information Extractor
Background Abundant information on web TIE vs. WIE General approach Structure [tables] Semi-structure [HTML / XML] Free context [blogs] TIE vs. WIE Scalability Cost flexibility General approach wrappers
Wrapper Wrapper Flow of WIE based on wrappers sets of highly accurate rules that extract a particular page's content a function from a page to the set of tuples it contains Flow of WIE based on wrappers collecting training pages labeling training examples [optional] generalizing extraction rules (wrappers) extracting the relevant data outputting the result in an appropriate format Wrappers typically handle highly structured collections of web pages, such as product catalogues and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on adaptive information extraction motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured text.
Outline Background Approaches for generating wrappers Examples Manually constructed Machine learning Examples Conclusion
Approaches for generating wrappers Automation Degree of approaches Manually-constructed Supervised Semi-supervised Unsupervised Machine Learning A Survey of Web Information Extraction Systems @ TKDE 06
Manually Constructed Wrapper Definition: Manually develop rules/commands/patterns for extracting data Examples TSIMMIS [Hammer, et al, 1997] Minerva [Crescenzi, 1998] WebOQL [Arocena and Mendelzon, 1998] W4F [Saiiuguet and Azavant, 2001] XWrap [Liu, et al. 2000]
Manually Constructed Wrapper Disadvantages Time-consuming to write rules Non-general Need to understand the structure of document Special expertise of users [programmers]
Wrapper with Supervised Learning A machine learning task of inferring a function from supervised (labeled) training data Examples SRV [Freitag, 1998] Rapier [Califf and Mooney, 1998] WIEN [Kushmerick, 1997] WHISK [Soderland, 1999] NoDoSE [Adelberg, 1998] Softmealy [Hsu and Dung, 1998] Stalker [Muslea, 1999] DEByE [Laender, 2002b ]
Wrapper with Supervised Learning Disadvantage Manually labeling training data is time-consuming vs. manually constructed general users instead of programmers can label training data, thus reducing the cost of wrapper generation
Wrapper with Semi-Supervised Learning a class of machine learning techniques that make use of both labeled and unlabeled data for training - typically a small amount of labeled data with a large amount of unlabeled data. Examples SEAL [Richard C. Wang,2009] Automatic Wrapper [Nilesh Dalvi ,2011] IEPAD [Chang and Lui, 2001] OLERA [Chang and Kuo, 2003] Thresher [Hogue, 2005]
Wrapper with Unsupervised Learning refers to the problem of trying to find hidden structure in unlabeled data Examples Roadrunner [Crescenzi, 2001] DeLa [Wang, 2002; 2003] EXALG [Arasu and Garcia-Molina, 2003] DEPTA [Zhai, et al., 2005]
A Survey of Web Information Extraction Systems @ TKDE 06
A Survey of Web Information Extraction Systems @ TKDE 06
A Survey of Web Information Extraction Systems @ TKDE 06
A Survey of Web Information Extraction Systems @ TKDE 06
Outline Background Approaches for generating wrappers Examples Manually constructed Machine learning Examples Conclusion
Manually-Constructed Example TSIMMIS one of the first approaches that give a framework for manual building of Web wrappers Wrapper Manually constructed as commands Input: a specification file that declaratively states where the data of interest is located on the page Output: Object Extraction Model (OEM) Semi-structured Data: The TSIMMIS Experience @ ADBIS 97
Manually-Constructed Example Each command is of the form: [variables, source, pattern] where source specifies the input text to be considered pattern specifies how to find the text of interest within the source, and variables are a list of variables that hold the extracted results. Note: # means “save in the variable” * means “discard” Semi-structured Data: The TSIMMIS Experience @ ADBIS 97
Manually-Constructed Example Specification file Web Page OEM
Supervised Learning Example SRV top-down relational algorithm that generates single-slot extraction rules Learning algorithm work like FOIL Token-oriented Logic rules SRV add predicates greedily, attem Top–down and bottom–up are strategies of information processing and knowledge ordering, mostly involving software, but also other humanistic and scientific theories (see systemics). In practice, they can be seen as a style of thinking and teaching. In many cases top–down is used as a synonym of analysis or decomposition, and bottom–up of synthesis.pting thereby to “cover” as many positive, and as few negative examples answer possible Information Extraction from HTML: Application of a General Machine Learning Approach @ AAAI 98
SRV Learning process SRV Algorithm Input annotated document & features Inducting rules based on 2/3 training data Validate rules based on remained 1/3 training data Iterate learning 3 times Output rules of predicted for single-slot Iterative 3 times Information Extraction from HTML: Application of a General Machine Learning Approach @ AAAI 98
Supervised Learning Example s Rules for extracting rating which says rating is a single numeric word and occurs within a HTML list tag Web page
Semi-Supervised Learning Example SEAL (Set Expander for Any Language) expands entities automatically by utilizing resources from the Web in a language-independent fashion Flow of SEAL Extracting wrappers Ranking wrappers / Candidates Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09
Semi-Supervised Learning Example Extracting wrappers Input seed instance & document Find seed instance in document Generate left/right context Mining between left/right context find all the longest possible strings from left context set given some constraints, called s for each found string find the longest possible string s0 from right context such that s and s0 bracket at least one occurrence of every given seed in a document NOTE: left/right context are maintained by Patricia Trie Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09
Semi-Supervised Learning Example Document Seeds {Ford, Nissan, Toyota} Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09
Semi-Supervised Learning Example Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09
Semi-Supervised Learning Example Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09
Unsupervised Learning Example Roadrunner A novel approach to wrapper inference for HTML pages. Idea Generate HTML page using scripts => Encoding Data Extracting from HTML pages => Decoding Formulate the problem Find the nested type of the source dataset Extract the source dataset from HTML pages. RoadRunner: Towards Automatic Data Extraction from Large Web Sites @ VLDB 2001
Unsupervised Learning Example Find Nested Type Theoretical Background Based on close correspondence between nested type and union-free regular expressions (UFRE). => find the Least Upper Bound UFRE Solution for LUB UFRE. ACME (Align, Collapse under Mismatch, and Extract) RoadRunner: Towards Automatic Data Extraction from Large Web Sites @ VLDB 2001
Outline Background Approaches for generating wrappers Examples Manually constructed Machine learning Examples Conclusion
Conclusion WIE will be still important due to “data flood” on Internet Currently WIE systems almost bases on Machine Learning, but are still not perfect New technique, such as MapReduce, Hadoop, Spark, etc., promotes ML developing, and it may also benefit the WIE.
Q&A
Reference Information Extraction @ Wikipedia Wrapper (data mining) @ Wikipedia Supervised learning@ Wikipedia Semi-Supervised learning@ Wikipedia Unsupervised learning@ Wikipedia A Survey of Web Information Extraction Systems @ TKDE 06 Semi-structured Data: The TSIMMIS Experience @ ADBIS 97 Information Extraction from HTML: Application of a General Machine Learning Approach @ AAAI-98 Character-level Analysis of Semi-Structured Documents for Set Expansion @ EMNLP 09 RoadRunner: Towards Automatic Data Extraction from Large Web Sites @ VLDB 2001