Web Content Extraction Based on Maximum Continuous Sum of Text Density

Web Content Extraction Based on Maximum Continuous Sum of Text Density
Kai Sun, Miao Li, Jinhua Du, Lei Chen, Zhengxin Yang, Yi Gao, Sha Fu Institute of Intelligent Machines, Chinese Academy of Sciences 1 Introduction Generally different websites have different web page structures, which would heavily affect the extraction quality when the web content is automatically collected. The maximum continuous sum of text density (MCSTD) method can extract web content from different web pages efficiently and effectively. 4 Experiments 4.1 Experimental Environment Table 1. Experimental Environment CPU Intel(R) Core(TM) i Memory 4.00 GB Operating System Windows 7 Development Language Python 2 MCSTD System The maximum continuous sum of text density (MCSTD) refers to the maximum k=i j a k of a digital sequence of positive and negative numbers a1，a2，… an. If all the numbers are negative, the maximum continuous sum of text density is 0. 4.2 Experimental result We use the crawler to crawl 1.0K content web pages from 10 websites as the experimental data set. Table 2. Data Sets Document Set Number of Pages Size(MB) Set1 100 33 Set2 200 56 Set3 500 121 Set4 800 182 Set5 1000 240 4.2 Experimental result We compare the MCSTD method with the statistical algorithm. The average accuracy of the two algorithms are shown in Table 2. Table 3. Comparison Results Of Two Methods Site Statistical MCSTD 93% 94% 95% 96% 100% 90% Overall Figure 1. Framework of MCSTD 3 Critical Modules Web Page Preprocessing Web page standardization We use the Beautifulsoup library of Python to make web page standardized Code conversion We convert encoding to UTF-8 universally during the page preprocessing Removing irrelevant tags Removing irrelevant tags is mainly to remove some invalid tags that do not affect the content extraction Calculating Text Density 𝑇𝐷 𝐿 =TextLen 𝐿 −𝐿𝑖𝑛𝑘𝐿𝑒𝑛 𝐿 −𝐴𝑣𝑒𝑟𝐿𝑒𝑛(𝐿) 𝐴𝑣𝑒𝑟𝐿𝑒𝑛 𝐿 =𝐴𝑙𝑙𝑇𝑒𝑥𝑡/𝐿𝑖𝑛𝑒𝑁𝑢𝑚𝑠 Gauss Smooth 𝑺𝑻𝑫𝒊= 𝒋=−𝟐𝝈 𝟐𝝈 𝝎𝒋∙𝑻𝑫𝒊+𝒋 𝝎𝒋= 𝒆𝒙𝒑(− 𝒋𝟐 𝟐𝝈𝟐 ) 𝒎=−𝟐𝝈 𝟐𝝈 𝒆𝒙𝒑(− 𝒎𝟐 𝟐𝝈𝟐 ) Calculating MCSTD The MCSTD problems can be solved using dynamic programming algorithm. Its time complexity is O(n) which means that it is a linear problem and its efficiency is relatively high We also compare efficiency of the MCSTD method with the node traversal method based on the DOM tree. Figure 2. Comparable Results 5 Conclusion The MCSTD method can more precisely and efficiently extract web content from news page In future, we will carry out more investigation into MCSTD and improve its performance We will construct a high-quality Mongolian and Chinese comparable corpora on the basis of MCSTD

Web Content Extraction Based on Maximum Continuous Sum of Text Density

Similar presentations

Presentation on theme: "Web Content Extraction Based on Maximum Continuous Sum of Text Density"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Content Extraction Based on Maximum Continuous Sum of Text Density

Similar presentations

Presentation on theme: "Web Content Extraction Based on Maximum Continuous Sum of Text Density"— Presentation transcript:

Similar presentations

About project

Feedback