Based on Menu Information

Based on Menu Information
Template Extraction Based on Menu Information Josep Silva Technical University of Valencia Joint work done in colaboration with Julián Alarte, David Insa, Salvador Tamarit (WWV'2013)

Contents Motivation Content Extraction and Block Detection
Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work

Motivation ¿What is content extraction? ¿What is block detection?
Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as: menus, status bars, advertisements, sponsored information, etc. ¿What is block detection? Discipline that tries to isolate every information block in a webpage.

Motivation

Motivation The date is different The title is different

Motivation ¿Why template extraction is useful?
Component reuse. Web developers can automatically extract components from a webpage. Enhancing indexers and text analyzers to increase their performance by only processing relevant information. It has been measured that almost 40-50% of the components of a webpage represent the template. Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.

The Technique State of the Art
Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage Densitometric features: counting characters and tags Statistics on terms: Some terms are common in templates

The Technique

Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage Position of elements: lateral menus, main content centered and visible Less studied: rendering webpages is computationally expensive

Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage Analysis of the DOM structure: Difficulty in analysing DIV based structures Comparing several webpages: Search for common structures

The Technique Limitations of Current approaches
Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10]. Some assume that the main content text is continuous [11]. Some assume that the system knows a priori the format of the webpage [10]. Some need to (randomly) load many webpages (several dozens) to compare them [15].

<h2>Directory</h2> <div class="vcard"> <span class="fn">Vicente Ramos</span> <div class="org">Software Development </div> <div class="adr"> <div class="street-address">Atmosphere 118</div> <span class="locality">La Piedad, México</span> <span class="postal-code">59300</span> </div> <div class="tel"> </div> <h4>His Company</h4> <a class="url" href="page2.html"> Company Page </a> Limitations of Current approaches Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10]. Some assume that the main content text is continuous [11]. Some assume that the system knows a priori the format of the webpage [10]. Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated [12].

The main problem of these approaches is a big loss of generality. They require to previously know or parse the webpages, or they require the webpage to have a particular structure. This is very inconvenient because modern webpages are mainly based on <div> tags that do not require to be hierarchically organized (as in the table-based design). Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori.

The Technique Other approaches are able to work: + Online
(i.e., with any webpage) + In real-time (i.e., without the need to preprocess the webpages or know their structure)

The Technique Site Style Tree 3 2 1 Table Div Body H1 Image Text Table

Template Extraction A Technique for Content Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work

The Technique The Document Object Model (DOM) [17]
API that provides programmers with a standard set of objects for the representation of HTML and XML documents. Given a webpage, it is completely automatic to produce its associated DOM structure and vice-versa. The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically. Table Div Body H1 Image Text

The Technique The Document Object Model (DOM) [17]
Table Div Body H1 Image Text The Document Object Model (DOM) [17] Nodes in the DOM tree can be of two types: tag nodes, and text nodes: Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes). Text nodes are always leaves in the DOM tree because they cannot contain other nodes. I want to know more!

Template Extraction A Technique for Content Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work

The Technique Our method for template extraction in a nutsell:
Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. The template is the intersection between the initial webpage and all DOM trees in the subdigraph. The intersection is computed with a Top-Down Exact Mapping between the DOM trees. Both steps can be done with a cost linear with the size of the DOM trees.

The Technique Identify a set of webpages in the website topology.
Select those nodes that belong to the menu. Use a complete subdigraph. Domain A Domain B Menu Submenu Domain C

The Technique Domain A Domain B Menu Size F1 Loads 1 62,03 2 76,16 3,4
78,35 5,75 4 78,65 7,45 5 78,85 9,3 6 78,11 14 7 78,13 16,15 8 21 Size F1 Loads 1 62,03 2 76,16 3,4 3 78,35 5,75 4 78,65 7,45 5 78,85 9,3 6 78,11 14 7 78,13 16,15 8 21 Submenu Domain C

The Technique Our method for template extraction in a nutsell:
The template is the intersection between the initial webpage and all DOM trees in the subdigraph. The intersection is computed with a Top-Down Exact Mapping between the DOM trees. Table Div Body H1 Image Text P1 Table Div Body H1 Image Text P2 Table Div Body H1 Image Text P3 Table Div Body H1 Image Text P4 Table Div Body H1 Image Text P5

The Technique Mapping: HTML HTML Body Body Div Table Table Table P

The Technique Top-Down Mapping: HTML HTML Body Body Div Table Table

The Technique Top-Down Exact Mapping: HTML HTML Body Body Div Table

The Technique Experiments Benchmarks: online heterogeneus webpages
Domains with different layouts and page structures Company’s websites, news articles, forums, etc. Final evaluation set randomly selected We determined the actual template of each webpage by downloading it and manually selecting the template. The DOM tree of the selected elements was then produced and used for comparison evaluation later. F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall

The Technique Experiments

The Technique Experiments Using CNR:
The average recall is and the average precision is Using tag ratios: The average recall is and the average precision is Interesting Phenomenon (property): Either the recall, the precision, or both, are 100%. Body Recall 0% Precision 0% Recall 100% Precision <100% H1 Div Table Image Recall 100% Precision 100% Table Text Recall <100% Precision 100% Text Table Text

Motivation A demo is better than a hundred words

The Technique Template extraction from Wikipedia
Recall 100%, precision 100% (50% of the times).

The Technique Template extraction from FilmAffinity
Recall >100% (sometimes forced by the designer).

The Technique Template extraction from FilmAffinity
Recall <100% (6% of the times).

Conclusions and future work
New technique proposed for template extraction: It does not make assumptions about the particular structure of webpages. It only needs to process a single webpage (no templates, no other webpages of the same website are needed). No preprocessing stages are needed. The technique can work online. It is fully language independent (it can work with pages written in English, German, etc.). The particular text formatting of the webpage does not influence the performance of the technique.

Conclusions and future work
Update the implementation to the new Firefox’s API 2004 -> Firefox 1 2006 -> Firefox 2 2008 -> Firefox 3 2011 -> Firefox 4 2012 -> Firefox 13!!!!! New algorithm that selects the top rated nodes based on the variance. I want to know more!

Based on Menu Information

Similar presentations

Presentation on theme: "Based on Menu Information"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Based on Menu Information

Similar presentations

Presentation on theme: "Based on Menu Information"— Presentation transcript:

Similar presentations

About project

Feedback