Web Data Extraction Based on Partial Tree Alignment

Slides:



Advertisements
Similar presentations
Advanced XSLT. Branching in XSLT XSLT is functional programming –The program evaluates a function –The function transforms one structure into another.
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
Aki Hecht Seminar in Databases (236826) January 2009
ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Microsoft ® Office Access ® 2007 Training Build a database III: Build relationships for a new Access database ICT Staff Development presents:
Webpage Understanding: an Integrated Approach
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Microsoft ® Office Access ® 2007 Training Datasheets II: Sum, sort, filter, and find your data ICT Staff Development presents:
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
Querying Structured Text in an XML Database By Xuemei Luo.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Presenter: Shanshan Lu 03/04/2010
Feature Detection in Ajax-enabled Web Applications Natalia Negara Nikolaos Tsantalis Eleni Stroulia 1 17th European Conference on Software Maintenance.
May 11, 2005WWW Chiba, Japan1 Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web Andrew Hogue GoogleMIT CSAIL.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results.
§5 Backtracking Algorithms A sure-fire way to find the answer to a problem is to make a list of all candidate answers, examine each, and following the.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Data mining in web applications
JavaScript, Sixth Edition
IST 220 – Intro to Databases
MS Access Forms, Queries, Reports Matt Martin
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Microsoft Office Access 2010 Lab 2
Julián ALARTE DAVID INSA JOSEP SILVA
Database Management System
Based on Menu Information
Data Structures Interview / VIVA Questions and Answers
PC trees and Circular One Arrangements
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
(edited by Nadia Al-Ghreimil)
Written Midterm Solutions
Restrict Range of Data Collection for Topic Trend Detection
Indexing and Hashing Basic Concepts Ordered Indices
Lectures on Graph Algorithms: searching, testing and sorting
Supervised and unsupervised wrapper generation
B-Tree.
Chapter 9: Structured Data Extraction
Discriminative Frequent Pattern Analysis for Effective Classification
Kriti Chauhan CSE6339 Spring 2009
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
CSE 589 Applied Algorithms Spring 1999
Web Page Cleaning for Web Mining
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Navya Thum January 30, 2013 Day 5: MICROSOFT EXCEL Navya Thum January 30, 2013.
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Web Development Using ASP .NET
A Small and Fast IP Forwarding Table Using Hashing
Spreadsheets, Modelling & Databases
Dynamic Programming II DP over Intervals
File Organization.
Danger Prediction by Case-Based Approach on Expressways
5.00 Apply procedures to organize content by using Dreamweaver. (22%)
Information Retrieval and Web Design
Presentation transcript:

Web Data Extraction Based on Partial Tree Alignment

Yanhong Zhai and Bing Liu Paper written by Yanhong Zhai and Bing Liu (Computer science department of the University of Illinois at Chicago) and Presented by Chuku, Ndubueze and Michael England (University of North Carolina at Charlotte) 9/20/2018

Contents Motivation Ideas-Web data Extraction Simple Tree Matching MDR-2 Partial Tree Alignment Concepts 9/20/2018

Motivation Mining data records in web pages is important because they typically display their host pages’ essential information. There is the need to extract these structured data objects found in web pages. These objects enable one to integrate data from multiple web pages to provide value-added services like comparative shopping, meta-querying and search e.t.c. Existing methods have serious limitations. 9/20/2018

Existing data extraction methods I Machine Learning - human labeling of examples from each website that data is to be extracted from. Drawback: This is time-consuming due to the large number of sites and pages involved. 9/20/2018

Existing data extraction methods II Automatic Web Discovery 2 Main types: Wrapper Induction – the use of a set of extraction rules, learned from a set of manually labeled pages or data records, to extract data from similar pages. Automatic Discovery Methods – recognizing pattern or grammar from multiple pages containing similar records. 9/20/2018

Existing data extraction methods III Drawbacks Requires initial set of pages with similar data records for training. The assumption of the existence of detailed pages is unrealistic. Identifying the many links that point to detailed information pages is not an easy task. 9/20/2018

Which of the following is a pre-existing data extraction method? Machine learning Wrapper Induction Partial tree Alignment A and B All the above 9/20/2018

Answer: D 9/20/2018

Ideas: Novel method proposed The authors propose two-step strategy to solve these problems. Step 1 Given a page, the method first segments the page to identify each data record without extracting its data items. Visual information is used to find the data records. 9/20/2018

Data regions in the page are mined using the tag tree. Visual Information helps the system in 2 It allows a system to identify gaps that separate data records, so that they can segment them correctly. It also identifies data records by analyzing HTML tag trees or DOM trees. Step2 Data regions in the page are mined using the tag tree. 9/20/2018

MDR-2 Given a webpage, the MDR-2 algorithm works in 3 steps Step 1 Visual information is used to build an HTML tag tree of the page. Step2 Data regions in the page are mined using the tag tree. 9/20/2018

MDR-2 contd A data region is an area in the page that contains a list of similar data. In this step, instead of directly mining data records, data regions are mined and then data records are found within the data regions. Step 3 Data records are identified from each data region. 9/20/2018

Concepts Building an HTML Tag Tree In a web browser, each HTML element is viewed as a rectangle. A tag tree can be built based on the nested rectangles. Details are; Find the 4 boundaries of the rectangle of each HTML element by calling the embedded parsing and rendering engine of a browser. 9/20/2018

Building an HTML Tag Tree contd Detect the containment relationship among the rectangles. A tree can then be built based on the containment check. 9/20/2018

HTML code segment and boundary coordinates Tag tree for HTML code 9/20/2018

How many boundaries are there in each element of a HTML Tag Tree? 1 2 3 4 9/20/2018

Answer: E 9/20/2018

Mining Data Regions In this step, data regions are first mined and then tag strings of individual nodes and combination of multiple adjacent nodes in each region are compared. Figure: An illustration of generalized nodes & data regions 9/20/2018

Mining Data Regions contd. To eliminate false node combinations, visual observation about the data records is used. The gap between 2 data records in a data region should be larger than the gap within a data record. 9/20/2018

Identifying Data Records After data regions are identified, data records from generalized nodes are identified. Note: Each generalized node may not represent a single data record. 9/20/2018

2 Cases Non-contiguous Data Records case 1: Figure: A multiple-record data region: each generalized node contains more than one non-contiguous data record 9/20/2018

Non-contiguous Data Records case 2: Adjacent data regions form more than one non-contiguous data records 9/20/2018

Data Extraction Key task is how to match corresponding data items or fields from all data records. 2 sub-steps: Get one rooted tag tree for each data record. After all data records are identified, the sub-trees of each data record are rearranged into a single tree. 9/20/2018

Data Extraction contd. Partial tree alignment technique - the tag trees of all data records in each data region are aligned using the partial tree alignment method which is based on tree matching. Note: In the matching process, only tags are used (not data items). 9/20/2018

Data Extraction contd. Tree Edit Distance -The tree distance between 2 trees, A and B, is the cost associated with the minimum set of operations needed to transform A into B. Involves 3 operations; node removal, node insertion and node replacement. A cost is assigned to each of the operations. 9/20/2018

Data Extraction contd. Solving the tree edit distance problem is often assisted by finding a minimum-cost mapping between two trees. Mapping is defined thus; Let X be a tree and X[i] be the ith node of the tree. Mapping btw tree A of size n1 and tree B of size n2 is a set of ordered pairs (i,j), one from each tree, that satisfies the following conditions 9/20/2018

A[i1] is on the left of A[i2] iff B[j1] is on the left B[j2]. For all (i1,j1), (i2,j2) Є M: i1=12 iff j1=j2. A[i1] is on the left of A[i2] iff B[j1] is on the left B[j2]. A[i1] is an ancestor of A[i2] iff B[j1] is an ancestor of B[j2]. General Tree mapping algorithm 9/20/2018

Which among these is used in the tree matching process discussed in the paper? Data items Data regions Tags blocks None of the above 9/20/2018

Answer: C 9/20/2018

Simple Tree Matching (SMT) This evaluates the similarity of two trees by producing the maximum matching through dynamic programming of complexity O(n1n2), where n1 and n2 are the sizes of the tree. No node replacement or level crossing are allowed. 9/20/2018

Which three of these are operations used in calculating tree edit distance between 2 trees?   Node removal Node sorting Node modification Node insertion Node replacement 9/20/2018

Answer: A, D, E 9/20/2018

SMT contd. Let A and B be 2 trees and i Є A, j Є B are 2 nodes in A and B resp. A matching btw A and B is defined to be a mapping M such that for every pair (i,j) Є M where i and j are non-root nodes, (parent(i), parent(j)) Є M. A maximum matching is a matching with the maximum number of pairs. 9/20/2018

Example for Simple Tree Matching algorithm 9/20/2018

Multiple Alignment Since each data region in a page contains multiple data records, there is a need to align multiple tag trees in order to produce a single database table with all corresponding data fields in the same column of the table. In this table, each row represents a data record and each column represents a data field in each data record. 9/20/2018

Partial Tree Alignment Partial tree alignment aligns multiple tag trees by progressively growing a seed (tag) tree. The seed tree, Ts, is initially picked to be the tree with the maximum number of data fields. Then for each Ti(i≠s), the algorithm tries to find for each node in Ti a matching node in Ts. When a match is found, a link is created from ni to ns to show its match to the seed tree. 9/20/2018

Partial Tree Alignment contd. If no match is found for node ni, then the algorithm attempts to expand the tree by inserting ni into Ts. The expanded tree is used in subsequent matching. 9/20/2018

Partial alignment of two trees 9/20/2018

Algorithm PartialTreeAlignment(S) 1. Sort trees in S in descending order according to the number of data items that are not aligned; 2. Ts = the first tree (which is the largest) and delete it from S; 3. flag = false; R = ∅; I = false; 4. while (S ≠ ∅) 5. Ti = select and delete next tree from S; 6. Simple_Tree_Matching(Ts, Ti); 7. L = alignTrees(Ts, Ti); // based on the result from line 6 8. if Ti is not completely aligned with Ts then 9. I = InsertIntoSeed(Ts, Ti); 10. if not all unaligned items in Ti are inserted into Ts then 11. Insert Ti into R; 12. endif; 13. endif; 14. if (L has new alignment) or (I is true) then 15. flag = true 16. endif; 17. if S = ∅ and flag = true then 18. S = R; R = ∅; 19. flag = false; I = false 20. endif; 21. endwhile; 22. Output data fields from each Ti to the data table based on the alignment results. Figure .The partial tree alignment Algorithm. 9/20/2018

9/20/2018

The complexity of the partial tree alignment is O(k2) without considering tree matching, where k is the number of trees. Note: The resulting alignment T, can also be used as an extraction pattern for extracting data items from other pages generated using the template. 9/20/2018

What are the 3 steps in the Partial Tree Alignment Algorithm? Build an HTML tag tree using visual information Mine the data regions in the page using the tag tree. Mine the data records are mined using the tag tree. Data records are identified from each data region. Data regions are identified based on the data records. 9/20/2018

Answer: A, B, D 9/20/2018

Conclusion In this paper, the following were presented; a new approach to extract structured data from Web pages. an enhanced method based on visual information for identifying data records without extracting each data field in the data records. partial tree alignment technique to align corresponding data fields of multiple data records. Empirical results using a large number of Web pages show that the new 2-step technique can segment data records and extract data very accurately. 9/20/2018

Thank You 9/20/2018