Web Data Extraction Based on Partial Tree Alignment

Web Data Extraction Based on Partial Tree Alignment

Yanhong Zhai and Bing Liu
Paper written by Yanhong Zhai and Bing Liu (Computer science department of the University of Illinois at Chicago) and Presented by Chuku, Ndubueze and Michael England (University of North Carolina at Charlotte) 9/20/2018

Contents Motivation Ideas-Web data Extraction Simple Tree Matching
MDR-2 Partial Tree Alignment Concepts 9/20/2018

Motivation Mining data records in web pages is important because they typically display their host pages’ essential information. There is the need to extract these structured data objects found in web pages. These objects enable one to integrate data from multiple web pages to provide value-added services like comparative shopping, meta-querying and search e.t.c. Existing methods have serious limitations. 9/20/2018

Existing data extraction methods I
Machine Learning - human labeling of examples from each website that data is to be extracted from. Drawback: This is time-consuming due to the large number of sites and pages involved. 9/20/2018

Existing data extraction methods II
Automatic Web Discovery 2 Main types: Wrapper Induction – the use of a set of extraction rules, learned from a set of manually labeled pages or data records, to extract data from similar pages. Automatic Discovery Methods – recognizing pattern or grammar from multiple pages containing similar records. 9/20/2018

Existing data extraction methods III
Drawbacks Requires initial set of pages with similar data records for training. The assumption of the existence of detailed pages is unrealistic. Identifying the many links that point to detailed information pages is not an easy task. 9/20/2018

Which of the following is a pre-existing data extraction method?
Machine learning Wrapper Induction Partial tree Alignment A and B All the above 9/20/2018

Answer: D 9/20/2018

Ideas: Novel method proposed
The authors propose two-step strategy to solve these problems. Step 1 Given a page, the method first segments the page to identify each data record without extracting its data items. Visual information is used to find the data records. 9/20/2018

Data regions in the page are mined using the tag tree.
Visual Information helps the system in 2 It allows a system to identify gaps that separate data records, so that they can segment them correctly. It also identifies data records by analyzing HTML tag trees or DOM trees. Step2 Data regions in the page are mined using the tag tree. 9/20/2018

MDR-2 Given a webpage, the MDR-2 algorithm works in 3 steps Step 1 Visual information is used to build an HTML tag tree of the page. Step2 Data regions in the page are mined using the tag tree. 9/20/2018

MDR-2 contd A data region is an area in the page that contains a list of similar data. In this step, instead of directly mining data records, data regions are mined and then data records are found within the data regions. Step 3 Data records are identified from each data region. 9/20/2018

Concepts Building an HTML Tag Tree
In a web browser, each HTML element is viewed as a rectangle. A tag tree can be built based on the nested rectangles. Details are; Find the 4 boundaries of the rectangle of each HTML element by calling the embedded parsing and rendering engine of a browser. 9/20/2018

Building an HTML Tag Tree contd
Detect the containment relationship among the rectangles. A tree can then be built based on the containment check. 9/20/2018

HTML code segment and boundary coordinates
Tag tree for HTML code 9/20/2018

How many boundaries are there in each element of a HTML Tag Tree?
1 2 3 4 9/20/2018

Answer: E 9/20/2018

Mining Data Regions In this step, data regions are first mined and then tag strings of individual nodes and combination of multiple adjacent nodes in each region are compared. Figure: An illustration of generalized nodes & data regions 9/20/2018

Mining Data Regions contd.
To eliminate false node combinations, visual observation about the data records is used. The gap between 2 data records in a data region should be larger than the gap within a data record. 9/20/2018

Identifying Data Records
After data regions are identified, data records from generalized nodes are identified. Note: Each generalized node may not represent a single data record. 9/20/2018

2 Cases Non-contiguous Data Records case 1:
Figure: A multiple-record data region: each generalized node contains more than one non-contiguous data record 9/20/2018

Non-contiguous Data Records case 2:
Adjacent data regions form more than one non-contiguous data records 9/20/2018

Data Extraction Key task is how to match corresponding data
items or fields from all data records. 2 sub-steps: Get one rooted tag tree for each data record. After all data records are identified, the sub-trees of each data record are rearranged into a single tree. 9/20/2018

Data Extraction contd. Partial tree alignment technique
- the tag trees of all data records in each data region are aligned using the partial tree alignment method which is based on tree matching. Note: In the matching process, only tags are used (not data items). 9/20/2018

Data Extraction contd. Tree Edit Distance
-The tree distance between 2 trees, A and B, is the cost associated with the minimum set of operations needed to transform A into B. Involves 3 operations; node removal, node insertion and node replacement. A cost is assigned to each of the operations. 9/20/2018

Data Extraction contd. Solving the tree edit distance problem is often assisted by finding a minimum-cost mapping between two trees. Mapping is defined thus; Let X be a tree and X[i] be the ith node of the tree. Mapping btw tree A of size n1 and tree B of size n2 is a set of ordered pairs (i,j), one from each tree, that satisfies the following conditions 9/20/2018

A[i1] is on the left of A[i2] iff B[j1] is on the left B[j2].
For all (i1,j1), (i2,j2) Є M: i1=12 iff j1=j2. A[i1] is on the left of A[i2] iff B[j1] is on the left B[j2]. A[i1] is an ancestor of A[i2] iff B[j1] is an ancestor of B[j2]. General Tree mapping algorithm 9/20/2018

Which among these is used in the tree matching process discussed in the paper?
Data items Data regions Tags blocks None of the above 9/20/2018

Answer: C 9/20/2018

Simple Tree Matching (SMT)
This evaluates the similarity of two trees by producing the maximum matching through dynamic programming of complexity O(n1n2), where n1 and n2 are the sizes of the tree. No node replacement or level crossing are allowed. 9/20/2018

Which three of these are operations used in calculating tree edit distance between 2 trees?
Node removal Node sorting Node modification Node insertion Node replacement 9/20/2018

Answer: A, D, E 9/20/2018

SMT contd. Let A and B be 2 trees and i Є A, j Є B are 2 nodes in A and B resp. A matching btw A and B is defined to be a mapping M such that for every pair (i,j) Є M where i and j are non-root nodes, (parent(i), parent(j)) Є M. A maximum matching is a matching with the maximum number of pairs. 9/20/2018

Example for Simple Tree Matching algorithm
9/20/2018

Multiple Alignment Since each data region in a page contains multiple data records, there is a need to align multiple tag trees in order to produce a single database table with all corresponding data fields in the same column of the table. In this table, each row represents a data record and each column represents a data field in each data record. 9/20/2018

Partial Tree Alignment
Partial tree alignment aligns multiple tag trees by progressively growing a seed (tag) tree. The seed tree, Ts, is initially picked to be the tree with the maximum number of data fields. Then for each Ti(i≠s), the algorithm tries to find for each node in Ti a matching node in Ts. When a match is found, a link is created from ni to ns to show its match to the seed tree. 9/20/2018

Partial Tree Alignment contd.
If no match is found for node ni, then the algorithm attempts to expand the tree by inserting ni into Ts. The expanded tree is used in subsequent matching. 9/20/2018

Partial alignment of two trees
9/20/2018

Algorithm PartialTreeAlignment(S)
1. Sort trees in S in descending order according to the number of data items that are not aligned; 2. Ts = the first tree (which is the largest) and delete it from S; 3. flag = false; R = ∅; I = false; 4. while (S ≠ ∅) 5. Ti = select and delete next tree from S; 6. Simple_Tree_Matching(Ts, Ti); 7. L = alignTrees(Ts, Ti); // based on the result from line 6 8. if Ti is not completely aligned with Ts then 9. I = InsertIntoSeed(Ts, Ti); 10. if not all unaligned items in Ti are inserted into Ts then 11. Insert Ti into R; 12. endif; 13. endif; 14. if (L has new alignment) or (I is true) then 15. flag = true 16. endif; 17. if S = ∅ and flag = true then 18. S = R; R = ∅; 19. flag = false; I = false 20. endif; 21. endwhile; 22. Output data fields from each Ti to the data table based on the alignment results. Figure .The partial tree alignment Algorithm. 9/20/2018

9/20/2018

The complexity of the partial tree alignment is O(k2) without considering tree matching, where k is the number of trees. Note: The resulting alignment T, can also be used as an extraction pattern for extracting data items from other pages generated using the template. 9/20/2018

What are the 3 steps in the Partial Tree Alignment Algorithm?
Build an HTML tag tree using visual information Mine the data regions in the page using the tag tree. Mine the data records are mined using the tag tree. Data records are identified from each data region. Data regions are identified based on the data records. 9/20/2018

Answer: A, B, D 9/20/2018

Conclusion In this paper, the following were presented;
a new approach to extract structured data from Web pages. an enhanced method based on visual information for identifying data records without extracting each data field in the data records. partial tree alignment technique to align corresponding data fields of multiple data records. Empirical results using a large number of Web pages show that the new 2-step technique can segment data records and extract data very accurately. 9/20/2018

Thank You 9/20/2018

Web Data Extraction Based on Partial Tree Alignment

Similar presentations

Presentation on theme: "Web Data Extraction Based on Partial Tree Alignment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Data Extraction Based on Partial Tree Alignment

Similar presentations

Presentation on theme: "Web Data Extraction Based on Partial Tree Alignment"— Presentation transcript:

Similar presentations

About project

Feedback