Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Similar presentations


Presentation on theme: "Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit."— Presentation transcript:

1 Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit

2 2 Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

3 3 Information Retrieval Web Mining Template Detection Content Extraction Block Detection Motivation

4 Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as: menus, status bars, advertisements, sponsored information, etc. 4 Motivation ¿What is content extraction? Discipline that tries to isolate every information block in a webpage. ¿What is block detection?

5 5 Motivation

6 6

7 7 The date is differentThe title is different

8 Component reuse. Web developers can automatically extract components from a webpage. Enhancing indexers and text analyzers to increase their performance by only processing relevant information. It has been measured that almost 40-50% of the components of a webpage represent the template. Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone Extraction of the relevant content to make the webpage more accessible for visually impaired or blind. 8 Motivation ¿Why is template extraction useful?

9 9 Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

10 10 The Technique What is a webpage?

11 Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage 11 The Technique State of the Art Densitometric features: counting characters and tags Statistics on terms: Some terms are common in templates

12 12 The Technique

13 13 The Technique

14 Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage 14 The Technique State of the Art Position of elements: lateral menus, main content centered and visible Less studied: rendering webpages is computationally expensive

15 Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage 15 The Technique State of the Art Analysis of the DOM structure: Difficulty in analysing DIV based structures Comparing several webpages: Search for common structures

16 Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags). Some assume that the main content text is continuous. Some assume that the system knows a priori the format of the webpage. Some need to (randomly) load many webpages (several dozens) to compare them. 16 The Technique Limitations of Current approaches

17 Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10]. Some assume that the main content text is continuous [11]. Some assume that the system knows a priori the format of the webpage [10]. Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated. 17 The Technique Limitations of Current approaches Directory Vicente Ramos Software Development Atmosphere 118 La Piedad, México 59300 +52 352 52 68499 His Company Company Page

18 The main problem of these approaches is a big loss of generality. They require to previously know or parse the webpages, or they require the webpage to have a particular structure. This is very inconvenient because modern webpages are mainly based on tags that do not require to be hierarchically organized (as in the table-based design). Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori. 18 The Technique Limitations of Current approaches

19 19 The Technique Other approaches are able to work: + Online (i.e., with any webpage) + In real-time (i.e., without the need to preprocess the webpages or know their structure)

20 20 Motivation Content Extraction and Block Detection Template Extraction A Technique for Content Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

21 The Document Object Model (DOM) API that provides programmers with a standard set of objects for the representation of HTML and XML documents. Given a webpage, it is completely automatic to produce its associated DOM structure and vice versa. The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically. 21 The Technique Table Div Body H1TableImage Text

22 The Document Object Model (DOM) Nodes in the DOM tree can be of two types: tag nodes, and text nodes: Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes). Text nodes are always leaves in the DOM tree because they cannot contain other nodes. 22 The Technique I want to know more! http://www.w3.org/DOM/ Ta bl e Di v Bo dy H1Ta bl e Imag e Te xt

23 23 Motivation Content Extraction and Block Detection Template Extraction A Technique for Content Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

24 Our method for template extraction in a nutsell: 1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. 2.Solve conflicts between those webpages that implement different templates. Establishing a voting system between the webpages. 3.The template is the intersection between the initial webpage and the DOM trees in the subdigraph. The intersection is computed with an Equal Top-Down Mapping between the DOM trees. The three steps can be done with a linear cost with respect to the size of the DOM trees. 24 The Technique

25 1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. 25 The Technique Menu Submenu Domain A Domain B Domain C

26 The Technique 1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. Hyperlink distance

27 1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. The Technique Hyperlink distanceDOM distance

28 2.Solve conflicts between those webpages that implement different templates. Establishing a voting system between the webpages. The Technique

29 Our method for template extraction in a nutsell: 3.The template is the intersection between the initial webpage and the DOM trees in the subdigraph. 3. The intersection is computed with an Equal Top-Down Mapping between the DOM trees. 29 The Technique Tabl e Div Body H1Tabl e Image Text Tabl e Div Body H1Tabl e Image Text Tabl e Div Body H1Tabl e Image Text Tabl e Div Body H1Tabl e Image Text Tabl e Div Body H1Tabl e Image Text P1 P2 P3 P4 P5

30 Mapping: 30 The Technique HTML Body Div Table P P HTML Body Table Div P P P P P P

31 Top-Down Mapping: 31 The Technique HTML Body Div Table P P HTML Body Table Div P P P P P P

32 Equal Top-Down Mapping: 32 The Technique HTML Body Div Table P P HTML Body Table Div P P P P P P

33 33 Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

34 Benchmarks: online heterogeneus webpages Domains with different layouts and page structures Company’s websites, news articles, forums, etc. Final evaluation set randomly selected We determined the actual template of each webpage by downloading it and manually selecting the template. The DOM tree of the selected elements was then produced and used for comparison evaluation later. F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall 34 Experiments

35 35 Experiments

36 GOLD STANDARD Downloading the complete website of each benchmark. Company’s websites, news articles, forums, etc. Four different engineers did the following independently: Manually exploring the original page and the webpages accessible from it to decide what part of the webpage is the template. Printing the key page in paper and marking the template. The four engineers met and together decided what the template was. Each element marked in the printed page was mapped to the DOM tree of the initial page. All elements in the DOM tree that did not belong to the template were included in an HTML class non-template (i.e., we enriched the HTML code of the key page with a new class). This class was later used by an algorithm that we programmed to evaluate the results obtained by our tool. 36 Experiments

37 37 Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

38 38 Conclusions and future work Conclusions: New technique proposed for template extraction: 1.It does not make assumptions about the particular structure of webpages. 2.It only needs to process a single webpage (no templates, no other webpages of the same website are needed). 3.No preprocessing stages are needed. The technique can work online. 4.It is fully language independent (it can work with pages written in English, German, etc.). 5.The particular text formatting of the webpage does not influence the performance of the technique.

39 39 Conclusions and future work Future Work: 1.Consider that a website can implement several templates along the webpages: Extend the benchmark suite by labelling all templates. A new technique to detect all templates of a website. 1.Combine template extraction with content extraction: Firstly, apply template extraction to remove the template, and Secondly, look for the main content on the remaining webpage.

40 40 Thank You


Download ppt "Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit."

Similar presentations


Ads by Google