Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

Slides:



Advertisements
Similar presentations
Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201
Advertisements

ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
XML: Extensible Markup Language
1 Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University.
Chapter 5: Introduction to Information Retrieval
The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: Practical this week:
Decision Tree Approach in Data Mining
Information extraction from text Spring 2003, Part 3 Helena Ahonen-Myka.
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
Information Extraction CS 652 Information Extraction and Integration.
Aki Hecht Seminar in Databases (236826) January 2009
Machine-learning based Semi-structured IE Chia-Hui Chang Department of Computer Science & Information Engineering National Central University
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Traditional Information Extraction -- Summary CS652 Spring 2004.
Structured Data Extraction Based on the slides from Bing Liu at UCI.
Basic Data Mining Techniques
Web Mining. Two Key Problems  Page Rank  Web Content Mining.
Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Machine-learning based Semi-structured IE Chia-Hui Chang Department of Computer Science & Information Engineering National Central University
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 22 Jim Martin.
Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 9: Wrappers PRINCIPLES OF DATA INTEGRATION.
Automatically Generated DAML Markup for Semistructured Documents William Krueger, Jonathan Nilsson, Tim Oates, Tim Finin Supported by DARPA contract F
A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.
Web Mining ( 網路探勘 ) WM09 TLMXM1A Wed 8,9 (15:10-17:00) U705 Structured Data Extraction ( 結構化資料擷取 ) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
1 Lab Session-III CSIT-120 Fall 2000 Revising Previous session Data input and output While loop Exercise Limits and Bounds Session III-B (starts on slide.
Overview Integration Testing Decomposition Based Integration
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
Searching: Binary Trees and Hash Tables CHAPTER 12 6/4/15 Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education,
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Presenter: Shanshan Lu 03/04/2010
Information extraction from text Spring 2003, Part 4 Helena Ahonen-Myka.
May 11, 2005WWW Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Information Extraction for Semi-structured Documents: From Supervised learning to Unsupervised learning Chia-Hui Chang Dept. of Computer Science and Information.
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Information Extraction and Integration Bing Liu Department of Computer Science University of Illinois at Chicago (UIC)
Information Extraction and Integration Bing Liu Department of Computer Science University of Illinois at Chicago (UIC)
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 4: Introduction to C: Control Flow.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
Information Extractors Hassan A. Sleiman. Author Cuba Spain Lebanon.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.
Web Information Extraction
Web Data Extraction Based on Partial Tree Alignment
(edited by Nadia Al-Ghreimil)
Restrict Range of Data Collection for Topic Trend Detection
Supervised and unsupervised wrapper generation
Chapter 9: Structured Data Extraction
Compiler design.
Kriti Chauhan CSE6339 Spring 2009
Presentation transcript:

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation

CS511, Bing Liu, UIC 2 Introduction A large amount of information on the Web is contained in regularly structured data objects.  often data records retrieved from databases. Such Web data records are important: lists of products and services. Applications: e.g.,  Comparative shopping, meta-search, meta-query. Two approaches  Wrapper induction (supervised learning)  automatic extraction (unsupervised learning) – not covered

CS511, Bing Liu, UIC 3 Two types of data rich pages List pages  Each such page contains one or more lists of data records.  Each list in a specific region in the page  Two types of data records: flat and nested Detail pages  Each such page focuses on a single object.  But can have a lot of related and unrelated information

CS511, Bing Liu, UIC 4

5

6

7 Extraction results

CS511, Bing Liu, UIC 8 Wrapper induction Using machine learning to generate extraction rules.  The user marks the target items in a few training pages.  The system learns extraction rules from these pages.  The rules are applied to extract items from other pages. Many wrapper induction systems, e.g.,  WIEN (Kushmerick et al, IJCAI-97),  Softmealy (Hsu and Dung, 1998),  Stalker (Muslea et al. Agents-99),  BWI (Freitag and Kushmerick, AAAI-00),  WL 2 (Cohen et al. WWW-02). We will only focus on Stalker, which also has a commercial version, Fetch.

CS511, Bing Liu, UIC 9 Stalker: A hierarchical wrapper induction system Hierarchical wrapper learning  Extraction is isolated at different levels of hierarchy  This is suitable for nested data records (embedded list) Each item is extracted independent of others. Each target item is extracted using two rules  A start rule for detecting the beginning of the target item.  A end rule for detecting the ending of the target item.

CS511, Bing Liu, UIC 10 Hierarchical representation: type tree

CS511, Bing Liu, UIC 11 Data extraction based on EC tree The extraction is done using a tree structure called the EC tree (embedded catalog tree). The EC tree is based on the type tree above. To extract each target item (a node), the wrapper needs a rule that extracts the item from its parent.

CS511, Bing Liu, UIC 12 Extraction using two rules Each extraction is done using two rules,  a start rule and a end rule. The start rule identifies the beginning of the node and the end rule identifies the end of the node.  This strategy is applicable to both leaf nodes (which represent data items) and list nodes. For a list node, list iteration rules are needed to break the list into individual data records (tuple instances).

CS511, Bing Liu, UIC 13 Rules use landmarks The extraction rules are based on the idea of landmarks.  Each landmark is a sequence of consecutive tokens. Landmarks are used to locate the beginning and the end of a target item. Rules use landmarks

CS511, Bing Liu, UIC 14 An example Let us try to extract the restaurant name “Good Noodles”. Rule R1 can to identify the beginning : R1: SkipTo( )// start rule This rule means that the system should start from the beginning of the page and skip all the tokens until it sees the first tag. is a landmark. Similarly, to identify the end of the restaurant name, we use: R2: SkipTo( ) // end rule

CS511, Bing Liu, UIC 15 Rules are not unique Note that a rule may not be unique. For example, we can also use the following rules to identify the beginning of the name: R3: SkiptTo(Name _Punctuation_ _HtmlTag_) orR4: SkiptTo(Name) SkipTo( ) R3 means that we skip everything till the word “Name” followed by a punctuation symbol and then a HTML tag. In this case, “Name _Punctuation_ _HtmlTag_” together is a landmark.  _Punctuation_ and _HtmlTag_ are wildcards.

CS511, Bing Liu, UIC 16 Extract area codes

CS511, Bing Liu, UIC 17 Learning extraction rules Stalker uses sequential covering to learn extraction rules for each target item.  In each iteration, it learns a perfect rule that covers as many positive examples as possible without covering any negative example.  Once a positive example is covered by a rule, it is removed.  The algorithm ends when all the positive examples are covered. The result is an ordered list of all learned rules.

CS511, Bing Liu, UIC 18 The top level algorithm

CS511, Bing Liu, UIC 19 Example: Extract area codes

CS511, Bing Liu, UIC 20 Learn disjuncts

CS511, Bing Liu, UIC 21 Example For the example E2 of Fig. 9, the following candidate disjuncts are generated: D1:SkipTo( ( ) D2:SkipTo(_Punctuation_) D1 is selected by BestDisjunct D1 is a perfect disjunct. The first iteration of LearnRule() ends. E2 and E4 are removed

CS511, Bing Liu, UIC 22 The next iteration of LearnRule The next iteration of LearnRule() is left with E1 and E3. LearnDisjunct() will select E1 as the Seed Two candidates are then generated: D3:SkipTo( ) D4:SkipTo( _HtmlTag_ ) Both these two candidates match early in the uncovered examples, E1 and E3. Thus, they cannot uniquely locate the positive items. Refinement is needed.

CS511, Bing Liu, UIC 23 Refinement To specialize a disjunct by adding more terminals to it. A terminal means a token or one of its matching wildcards. We hope the refined version will be able to uniquely identify the positive items in some examples without matching any negative item in any example in E. Two types of refinement  Landmark refinement  Topology refinement

CS511, Bing Liu, UIC 24 Landmark refinement Landmark refinement: Increase the size of a landmark by concatenating a terminal.  E.g., D5: SkipTo( - ) D6: SkipTo( _Punctuation_ )

CS511, Bing Liu, UIC 25 Topology refinement Topology refinement: Increase the number of landmarks by adding 1-terminal landmarks, i.e., t and its matching wildcards

CS511, Bing Liu, UIC 26 Refining, specializing

CS511, Bing Liu, UIC 27 The final solution We can see that D5, D10, D12, D13, D14, D15, D18 and D21 match correctly with E1 and E3 and fail to match on E2 and E4. Using BestDisjunct in Fig. 13, D5 is selected as the final solution as it has longest last landmark (- ). D5 is then returned by LearnDisjunct(). Since all the examples are covered, LearnRule() returns the disjunctive (start) rule either D1 or D5 R7: either SkipTo( ( ) or SkipTo(- )

CS511, Bing Liu, UIC 28 Summary The algorithm learns by sequential covering It is based on landmarks. The algorithm is by no mean the only possible algorithm. There are many variations. We only discussed a supervised method.  There are many unsupervised methods that use pattern mining to find templates for extraction.  See Chapter 9 of the textbook