Data Extraction from the Web and Security Issues By Siddu P. Algur Head, Dept. of Information Science & Engineering S D M College of Engg. & Tech., Dharwad.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Web Mining.
Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums Jiang-Ming Yang, Rui Cai, Yida Wang, Jun Zhu, Lei Zhang, and Wei-Ying Ma.
Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING.
Chapter 5: Introduction to Information Retrieval
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Aki Hecht Seminar in Databases (236826) January 2009
ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Web Mining Research: A Survey
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Webpage Understanding: an Integrated Approach
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.
1 A Static Analysis Approach for Automatically Generating Test Cases for Web Applications Presented by: Beverly Leung Fahim Rahman.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
Data Mining By Dave Maung.
Presenter: Shanshan Lu 03/04/2010
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC
Effects of Visualization and Interface Design on User Comprehensibility of Composite Data Asheem Chhetri, Apoorv Wairagade, Mahesh Gorantla, Hanye Xu,
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
WEB USAGE MINING Web Usage Mining 1. Contents Web Usage Mining 2  Web Mining  Web Mining Taxonomy  Web Usage Mining  Web analysis tools  Pattern.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Web mining is the use of data mining techniques to automatically discover and extract information from Web documents/services
MetricsVis: Interactive Visual System of Customized Metrics on Evaluating Multi-Attribute Dataset Nikhil Ghanta, Jieqiong Zhao, Calvin Yau, Hanye Xu, Brian.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Data mining in web applications
Julián ALARTE DAVID INSA JOSEP SILVA
A Paper Presentation Vikram Singh Dept. of Computer Engineering ,
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Web Data Extraction Based on Partial Tree Alignment
Restrict Range of Data Collection for Topic Trend Detection
Web Page Cleaning for Web Mining
Web Mining Department of Computer Science and Engg.
Web Mining Research: A Survey
Extracting Patterns and Relations from the World Wide Web
5.00 Apply procedures to organize content by using Dreamweaver. (22%)
Information Retrieval and Web Design
Presentation transcript:

Data Extraction from the Web and Security Issues By Siddu P. Algur Head, Dept. of Information Science & Engineering S D M College of Engg. & Tech., Dharwad.

CONENT Motivation Solution Existing Approaches New Approach (VSAP Algorithm) Empirical Evaluation Experimental Results Conclusion

Motivation Huge amount of information on the Internet. Data is distributed over Internet Presence of undesired data along with relevant information Requirement of data from various sources in local repository for further analysis

The Solution – Web Mining Definition - Data mining on web pages Data mining – Extraction of useful information by observing patterns in the data Web mining can be used to extract the useful information from web pages. We use Web-Page Structure Mining

Web Mining : Data Mining On the Web A Term coined by “ Etzioni“ in 1996

Web Structure Mining: The structure of a typical web graph consists of Web pages nodes and hyperlinks as edges connecting between two related pages. It can be regarded as the process of discovering structure information from the web Web Usage Mining: It focuses on techniques that could predict user behavior while the user interacts with the web. Web Content Mining: It emphasizes on the content of the web page. It is an automatic process that extracts pattern from web pages and goes beyond only the keyword extraction.

Web-Page Structure Mining Defn : Identifying relevant information by observing the visual structure of a web page. From the visual structure of web pages, we can determine the position of relevant data on the web pages.

But… Retrieving relevant information from the web seems to be like – Finding the Needle in the Haystack...

The Web is highly volatile, distributed and heterogeneous. The Web is a huge chaotic information space without central authority. The Web is noisy.

Existing Approaches & their Limitations MDR Algorithm ( M ining D ata R ecords from Web Pages ) DEPTA Algorithm ( D ata E xtraction using P artial T ree A lignment ) VIPS Algorithm ( VI sion based P age S egmentation )

MDR Algorithm Data regions. A group of data records are presented in a particular region of a page and formatted using similar HTML tags. Data records. A group of similar data records being placed in a specific region are under the same parent in a tag tree. Observations

Building a HTML Tag Tree of a Page. Mine Data Regions in page based upon Tag Tree & string comparison Identify Data records from data regions. Steps: MDR Algorithm

Data Records 4 Generalized- Nodes Data Region Data RecordsTAG _TREE

Tag Dependent (,, etc ) Extracts irrelevant data regions also along with the Relevant data region. Needs to do content mining to identify relevant data region. Highly prone to HTML tag-structure irregularities. Hence, fails in case of misuse of tags. Incorporates considerable time in building tag tree, traversing whole tag tree and string comparison. Limitations Of MDR Algorithm

DEPTA Algorithm Steps : Given a page, it first segments the page using visual information, to identify each data record. A novel partial tree alignment method is used to align and to extract corresponding data items from the discovered data records and put the data items in a database table.

Constructing a tag tree using visual information has the limitation that, the tag tree can be built correctly only as long as the browser is able to render the page correctly. Tag-dependent and hence prone to tag-structure irregularities. The computation time for constructing the tag tree and tree matching is an overhead. Fails to identify the data records, in cases where there may be only a single record on page. Limitations Of DEPTA Algorithm

VIPS Algorithm  VIPS algorithm parses the HTML page and visual separators are detected in the parse tree.  The separators receive weights which are adjusted depending on constraints based on separator.  Finally, the content structure of the page is created, by merging ”visual” blocks that are not divided by separators.

VIPS also does not correctly identify the data regions. VIPS is dependent on number of heuristic rules which do not hold good for most of the pages. Limitations Of VIPS Algorithm

Tool Bar Content Links Search and Filtering Panel Data Region Data Object 1 ( Data Record 1 ) Data Object 2 ( Data Record 2 ) Data Object 3 ( Data Record 3 ) Data Object 4 ( Data Record 4 ) Copyright Statement Advertise - ment Links Layout of a typical Web Page

A Data Region containing 4 Data Records

V isual S tructure based A nalysis of web P ages ( The Proposed Approach ) Internet HTML Page Parsing & Rendering Engine MSHTML.dll Co-ordinates of Bounding Rectangles Of All Tags VSAP Identifying the Data Region Largest Rectangle Identifier Container Identifier Data Region Identifier (Filter) Relevant Data Region

VSAP Algorithm  Determine the co-ordinates of all the bounding rectangles.  Identify the Data Region.  Identify the Largest Rectangle.  Identify the Container within the Largest Rectangle.  Identify the Data Region containing the Data records within that Container. Steps :

Sample Web Page Of A Product related Web-site

HTML Parsing & Rendering Engine Component of every Browser Function – Parse & Render HTML Pages Used to obtain bounding rectangles for each Tag. E.g. MSHTML for Internet Explorer. HTML Page Parsing & Rendering Engine Co-ordinates of Bounding Rectangles Of All Tags

Web Page Bounding Rectangles

Data Region Extractor Made up of two components : –Container Identifier : Identifies the innermost tag which contains the data region –Filter : Filters the identified container to get the data region Data Region Extractor Data Region Co-ordinates of Bounding Rectangles Of All Tags

Container Identifier Obtains largest bounding rectangle –Child of the BODY tag Get smallest rectangle with area greater than half the area of largest bounding rectangle.

Web Page Container Identified

Filter Find Average Height of the children of the container Eliminate children whose height is less than average height

Container Data Region

Data Region Identification MDR – Dependent on specific tags for identifying data regions. VSAP – Identifies data regions independent of specific tags. Data Record Extraction MDR – Identifying data records based on keyword search ( e.g. “ $ ” ) VSAP – Identifying data records based on visual structure of the web page. Overall Time Complexity MDR – O ( NK ), N is total no. of nodes in tag tree and K is max. no. of tag nodes of a generalized node. DEPTA – O ( k2 ), k is the number of trees. VSAP – O ( n ), n is the no. of tag - comparisons made. EMPIRICAL EVALUATION

Performance Measures Recall = Ec Precision = Ec Nt Et Recall : The percentage of relevant data records identified from the web page. Precision : The correctness of the data records identified. Ec is the total number of correctly extracted records. Nt is the total number of records on the page. Et is the total number of records extracted.

MDRVSAP Cor.Wr.Cor.Wr / 080/ /25251/ /3200/ /0250/ /1105/ /0100/ /25250/ /84960/ /10100/ /1253/ /0100/ /15150/ /0150/ /0100/ /12121/ / / 0 EXPERIMENTAL RESULTS URL Recall Precision 44.3% 33.5%96.93% 100% Total

Performance Comparison of MDR and VSAP Performance Comparison of MDR and VSAP

Data Record Extractor DATA RECORD EXTRACTOR DATA REGION DATA RECORD Extraction of data records is based on visual clues. Height of each record is obtained. Average height is calculated Data records whose height is greater than the average height is extracted.

DATA REGION DATA RECORDS

Data Record Identifier DATA RECORD IDENTIFIER DATA RECORD FLAT DATA RECORD NESTED DATA RECORD The flat record gives description of a single entity whereas the nested data record gives multiple description of a single entity

Identification of data records is essential in order to simplify the task of extracting the data items, which is very much needed for various applications. The Data Identifier determines the number of data fields in each data record within the data region. The data fields in flat records are less as compared to that of nested records. The number of fields in the nested data records is approximately 40% more than that of the flat records.

In fig1 the number of fields is 12 and in fig2 the number of fields is 7.The number of fields in fig1 is 58.3% more than the number of fields in fig2. Fig1 is a nested record and fig2 is flat record.

Extraction of Data Fields Extraction of Data Fields is based on bounding rectangles. Each field is associated with a bounding rectangle. Data fields are extracted row by row The data fields are extracted and stored in a file.

Transferring the Data Items/Fields into the Database

Application VSAP can be used by any application that requires the most relevant information of a web page VSAP can provide a platform for an application that requires to analyze related data from different sources on the web. VSAP can serve as an efficient replacement of MDR, which has already found it’s place in the industry.

Conclusion Results show that Performance of VSAP is better than other existing algorithms VSAP is a novel & efficient method of web mining

References [1] Baeza Yates, R. Algorithms for string matching: A survey. ACM SIGIR Forum, 23(3-4):34—58, [2] J. Hammer, H. Garcia Molina, J. Cho, and A. Crespo. Extracting semi- structured information from the web. In Proc. of the Workshop on the Management of Semi-structured Data, [3] D. Embley, Y. Jiang, and Y. K. Ng. Record-boundary discovery in Web documents. ACM SIGMOD Conference, 1999 [4] Kushmerick, N. Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence, 118:15-68, Clustering-based Approach to Integrating Source Query ] [5] Chang, C-H., Lui, S-L. IEPAD: Information Extraction Based on Pattern Discovery. WWW-01, ] [6] Crescenzi, V., Mecca, G. and Merialdo, P. ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. VLDB-01, 2001.] [7] Y.Yang, H. Zhang. HTML Page Analysis based on Visual Cues. 6th International Conference on Document Analysis and Recognition, [8] D. Buttler, L. Liu, C. Pu. A Fully Automated Object Extraction System for the World Wide Web. International Conference on Distributed Computing Systems (ICDCS 2001), 2001

[9] Bing Liu, Kevin chen-chuan chang, Editorial: Special issue on web content mining, WWW 02, [10] Liu, B., Grossman, R. and Zhai, Y. Mining Data Records in Web Pages. KDD- 03, [11] Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y. (2003). Extracting Content Structure for Web Pages based on Visual Representation, Asia Pacific Web Conference (APWeb 2003), pp [12] A. Arasu, H. Garcia-Molina, Extracting structured data from web pages, ACM SIGMOD 2003, 2003 [13] J. Wang, F. H Lochovsky. Data Extraction and Label Assignment for Web Databases.WWW conference, [14] H. Zhao, W. Meng, Z. Wu, Raghavan, Clement Yu. Fully Automatic Wrapper Generation For Search Engines, International WWW conference 2005, May ,2005, Japan. ACM /05/005 [15] Zhai, Y., Liu, B. Web Data Extraction Based on Partial Tree Alignment, WWW-05, 2005, May 10-14, 2005, Chiba, Japan. ACM /05/0005 [16] Kosala R., Hendrick Blockeel. Web Mining Research : A Survey, SIGKDD Explorations, ACM SIGKDD, July 2000.

Snapshot Amazon.com

Result By MDR By VSAP

Cooking.com

Result By MDR By VSAP

Tigerdirect.com

Result By MDR By VSAP

Algorithm VSAP ( HTML document ) Begin Set maxRect = NULL Set dataRegion = NULL FindMaxRect (BODY); FindDataRegion ( maxRect ); FilterDataRegion ( dataRegion ); End Overall VSAP Algorithm

Algorithm to identify Largest Rectangle Procedure FindMaxRect ( BODY ) Begin for each child of BODY tag begin find the co-ordinates of the bounding rectangle for the child if the area of the bounding rectangle > area of maxRect then maxRect = child endif end End

Algorithm to identify Largest Rectangle Procedure FindMaxRect ( BODY ) Begin for each child of BODY tag begin find the co-ordinates of the bounding rectangle for the child if the area of the bounding rectangle > area of maxRect then maxRect = child endif end End

Algorithm to identify Container of Data Region Procedure FindDataRegion ( maxRect ) Begin ListChildren = depth first listing of the children of the tag associated with maxRect for each tag in ListChildren begin If area of bounding rectangle of tag > half the area of maxRect then If area of bounding rectangle dataRegion > area of bounding rectangle of tag then dataRegion = tag endif end End

Algorithm FilterDataRegion (dataRegion ) Begin for each child of dataRegion begin totalHeight += height of the bounding rectangle of child end avgHeight = totalHeight / no of children of dataRegion for each child of dataRegion begin If height of child’s bounding rectangle < avgHeight then Remove child from dataRegion endif end End Algorithm to Filter Data region from the container

Thank You