Web Content Extraction Based on Maximum Continuous Sum of Text Density

Slides:

Advertisements

Similar presentations

iRobot: An Intelligent Crawler for Web Forums

Advertisements

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.

An Optimal Algorithm of Adjustable Delay Buffer Insertion for Solving Clock Skew Variation Problem Juyeon Kim, Deokjin Joo, Taehan Kim DAC’13.

Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.

HCI Final Project Robust Real Time Face Detection Paul Viola, Michael Jones, Robust Real-Time Face Detetion, International Journal of Computer Vision,

Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.

6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

1 SCHEMALESS APPROACH OF MAPPING XML DOCUMENTS INTO RELATIONAL DATABASE Ibrahim Dweib, Ayman Awadi, Seif Elduola Fath Elrhman, Joan Lu CIT 2008 Sydney,

Online classifier construction algorithm for human activity detection using a tri-axial accelerometer Yen-Ping Chen, Jhun-Ying Yang, Shun-Nan Liou, Gwo-Yun.

Växjö University Joakim Nivre Växjö University. 2 Who? Växjö University (800) School of Mathematics and Systems Engineering (120) Computer Science division.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Grid Load Balancing Scheduling Algorithm Based on Statistics Thinking The 9th International Conference for Young Computer Scientists Bin Lu, Hongbin Zhang.

Introduction to WEKA Aaron 2/13/2009. Contents Introduction to weka Download and install weka Basic use of weka Weka API Survey.

Overview of Search Engines

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

Final Exam Review Instructor : Yuan Long CSC2010 Introduction to Computer Science Apr. 23, 2013.

Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University

Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

Constructing Your Own Corpus from Written Language.

IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.

2006/12/081 Large Scale Crawling the Web for Parallel Texts Chikayama Taura lab. M1 Dai Saito.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor ： Dr.

Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.

Reporter ： Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.

Jun Li, Peng Zhang, Yanan Cao, Ping Liu, Li Guo Chinese Academy of Sciences State Grid Energy Institute, China Efficient Behavior Targeting Using SVM Ensemble.

2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Low Power Huffman Coding for High Performance Data Transmission Chiu-Yi Chen,Yu-Ting Pai, Shanq-Jang Ruan, International Conference on, ICHIT '06,

Citation-Based Retrieval for Scholarly Publications 指導教授：郭建明學生：蘇文正 M

 Packages:  Scrapy, Beautiful Soup  Scrapy  Website  

TEI presentation for IS 590 Robert Patrick Waltz July 10 th, 2012.

Corpus Exploitation from Wikipedia for Ontology Construction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.

Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:

WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.

Language Identification and Part-of-Speech Tagging

The Sellout: Readers Sentiment Analysis of 2016 Man Booker Prize Winner Paper ID : 748.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

FIGURES FOR CHAPTER 1 GETTING STARTED

Topics Introduction Hardware and Software How Computers Store Data

Julián ALARTE DAVID INSA JOSEP SILVA

Tools for Natural Language Processing Applications

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

Based on Menu Information

Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi

Web Data Extraction Based on Partial Tree Alignment

Mining the Data Charu C. Aggarwal, ChengXiang Zhai

Chapter 15 Lists Objectives

Topics Introduction Hardware and Software How Computers Store Data

Gradient Domain Salience-preserving Color-to-gray Conversion

Fig. 1 (a) The PageRank algorithm (b) The web link structure

Learning to Rank with Ties

Web Application Development Using PHP

Presentation transcript:

Web Content Extraction Based on Maximum Continuous Sum of Text Density Kai Sun, Miao Li, Jinhua Du, Lei Chen, Zhengxin Yang, Yi Gao, Sha Fu Email : sasunkai@mail.ustc.edu.cn Institute of Intelligent Machines, Chinese Academy of Sciences 1 Introduction Generally different websites have different web page structures, which would heavily affect the extraction quality when the web content is automatically collected. The maximum continuous sum of text density (MCSTD) method can extract web content from different web pages efficiently and effectively. 4 Experiments 4.1 Experimental Environment Table 1. Experimental Environment CPU Intel(R) Core(TM) i5-2400 CPU @3.10GHz Memory 4.00 GB Operating System Windows 7 Development Language Python 2 MCSTD System The maximum continuous sum of text density (MCSTD) refers to the maximum k=i j a k of a digital sequence of positive and negative numbers a1，a2，… an. If all the numbers are negative, the maximum continuous sum of text density is 0. 4.2 Experimental result We use the crawler to crawl 1.0K content web pages from 10 websites as the experimental data set. Table 2. Data Sets Document Set Number of Pages Size(MB) Set1 100 33 Set2 200 56 Set3 500 121 Set4 800 182 Set5 1000 240 4.2 Experimental result We compare the MCSTD method with the statistical algorithm. The average accuracy of the two algorithms are shown in Table 2. Table 3. Comparison Results Of Two Methods Site Statistical MCSTD www.sina.com.cn 93% 94% www.sohu.com 95% 96% www.cctv.com 100% www.163.com 90% Overall Figure 1. Framework of MCSTD 3 Critical Modules Web Page Preprocessing Web page standardization We use the Beautifulsoup library of Python to make web page standardized Code conversion We convert encoding to UTF-8 universally during the page preprocessing Removing irrelevant tags Removing irrelevant tags is mainly to remove some invalid tags that do not affect the content extraction Calculating Text Density 𝑇𝐷 𝐿 =TextLen 𝐿 −𝐿𝑖𝑛𝑘𝐿𝑒𝑛 𝐿 −𝐴𝑣𝑒𝑟𝐿𝑒𝑛(𝐿) 𝐴𝑣𝑒𝑟𝐿𝑒𝑛 𝐿 =𝐴𝑙𝑙𝑇𝑒𝑥𝑡/𝐿𝑖𝑛𝑒𝑁𝑢𝑚𝑠 Gauss Smooth 𝑺𝑻𝑫𝒊= 𝒋=−𝟐𝝈 𝟐𝝈 𝝎𝒋∙𝑻𝑫𝒊+𝒋 𝝎𝒋= 𝒆𝒙𝒑(− 𝒋𝟐 𝟐𝝈𝟐 ) 𝒎=−𝟐𝝈 𝟐𝝈 𝒆𝒙𝒑(− 𝒎𝟐 𝟐𝝈𝟐 ) Calculating MCSTD The MCSTD problems can be solved using dynamic programming algorithm. Its time complexity is O(n) which means that it is a linear problem and its efficiency is relatively high We also compare efficiency of the MCSTD method with the node traversal method based on the DOM tree. Figure 2. Comparable Results 5 Conclusion The MCSTD method can more precisely and efficiently extract web content from news page In future, we will carry out more investigation into MCSTD and improve its performance We will construct a high-quality Mongolian and Chinese comparable corpora on the basis of MCSTD