Deep-Web Crawling and Related Work Matt Honeycutt CSC 6400.

Slides:

Advertisements

Similar presentations

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Chapter 5: Introduction to Information Retrieval

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina Computer Science Department Stanford University Reviewed by Pankaj Kumar.

The “Deep Web” ISC 110 Final Project Kaila Ryan - 12/12/2013.

A Quality Focused Crawler for Health Information Tim Tang.

Information Retrieval in Practice

Aki Hecht Seminar in Databases (236826) January 2009

Crawling the Hidden Web Sriram Raghavan Hector Stanford University.

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.

Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.

Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.

Presented by Zeehasham Rasheed

Scalable Text Mining with Sparse Generative Models

Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Deep-Web Crawling “Enlightening the dark side of the web”

Google's Deep-Web Crawl (VLDB 2008) Google’s Deep-Web Crawl Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy, Google.

Warren He, Devdatta Akhawe, and Prateek MittalUniversity of California Berkeley This subset of the web application generates new requests to the server.

CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

Bootstrapping Information Extraction from Semi-Structured Web Pages Andy Carlson (Machine Learning Department, Carnegie Mellon) Charles Schafer (Google.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.

Master Thesis Defense Jan Fiedler 04/17/98

Predicting Content Change On The Web BY : HITESH SONPURE GUIDED BY : PROF. M. WANJARI.

1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)

Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Presenter: Shanshan Lu 03/04/2010

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.

LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Shubin Zhao, Jonathan Betz (KDD '07 )

Deep Web Exploration Dr. Ngu, Steven Bauer, Paris Nelson REU-IR This research is funded by the NSF REU program AbstractOur Submission Technique Results.

Crawling the Hidden Web Authors: Sriram Raghavan, Hector Garcia-Molina VLDB 2001 Speaker: Karthik Shekar 1.

Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.

©SoftMooreSlide 1 Introduction to HTML: Forms ©SoftMooreSlide 2 Forms Forms provide a simple mechanism for collecting user data and submitting it to.

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

Accessing the Hidden Web Hidden Web vs. Surface Web Surface Web (Static or Visible Web): Accessible to the conventional search engines via hyperlinks.

Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,

Data mining in web applications

Information Retrieval in Practice

Google’s Deep Web Crawler

Web Data Extraction Based on Partial Tree Alignment

Restrict Range of Data Collection for Topic Trend Detection

Data Integration for Relational Web

5.00 Apply procedures to organize content by using Dreamweaver. (22%)

Presentation transcript:

Deep-Web Crawling and Related Work Matt Honeycutt CSC 6400

Outline Basic background information Google’s Deep-Web Crawl Web Data Extraction Based on Partial Tree Alignment Bootstrapping Information Extraction from Semi- structured Web Pages Crawling Web Pages with Support for Client-Side Dynamism DeepBot: A Focused Crawler for Accessing Hidden Web Content

Background Publicly-Indexable Web (PIW) –Web pages exposed by standard search engines –Pages link to one another Deep-web –Content behind HTML forms –Database records –Estimated to be much larger than PIW –Estimated to be of higher quality than PIW

Google’s Deep-Web Crawl J. Madhavan D. Ko L. Kot V. Ganapathy A. Rasmussen A. Halevy

Summary Describes process implemented by Google Goal is to ‘surface’ content for indexing Contributions: –Informativeness test –Query selection techniques and algorithm for generating appropriate text inputs

About the Google Crawler Estimates that there are ~10 million high-quality HTML forms Index representative deep-web content across many forms, driving search traffic to the deep- web Two problems: –Which inputs to fill in? –What values to use?

Example Form

Query Templates Correspond to SQL-like queries: select * from D where P First problem is to select the best templates Second problem is to select the best values for those templates Want to ignore presentation-related fields

Incremental Search for Informative Query Templates Classify templates as either informative or uninformitive Template is informative if it generates sufficiently distinct pages from other templates Build more complex templates from simpler informative ones Signatures computed for each page

Informativeness Test T is informative if: Heuristically limit to templates with 10,000 or fewer possible submissions and no more than 3 dimensions Can estimate informativeness using a sample of possible queries (ie: 200)

Results

Observations URLs generated for larger templates are not as useful ISIT Generates far fewer URLs than CP but still has high coverage Most common reason for inability to find informative template: JavaScript –Ignoring JavaScript errors, informative templates found for 80% of forms tested

Generating Input Values Text boxes may be typed or untyped Special rules for small number of typed inputs that are common Can’t use generic lists, best keywords are site specific Select seed keywords from form, then iterate and select candidate keywords from results using TF-IDF Results are clustered and representative keywords are chosen for each cluster, ranked by page length Once candidate keywords have been selected, treat text inputs as select inputs

Identifying Typed Inputs

Conclusions Describes the innovations of “the first large-scale deep-web surfacing system” Results are already integrated into Google Informativness test is a useful building block No need to cover individual sites completely Heuristics for common input types are useful Future work: support for JavaScript and handling dependencies between inputs Limitation: only supports GET requests

Web Data Extraction Based on Partial Tree Alignment Yanhong Zhai Bing Liu

Summary Novel technique for extracting data from record lists: DEPTA (Data Extraction based on Partial Tree Alignment) Automatically identifies records and aligns their fields Overcomes limitations of existing techniques

Example

Approach Step 1: Build tag tree Step 2: Segment page to identify data regions Step 3: Identify data records within the regions Step 4: Align records to identify fields Step 5: Extract fields into common table

Building the Tag Tree and Finding Data Regions Computes bounding regions for each element Associate items to parents based on containment to build tag tree Next, compare tag strings with edit distance to find data regions Finally, identify records within regions

Identifying Regions

Partial Tree Alignment Tree matching is expensive Simple Tree Matching – faster, but not as accurate Longest record tree becomes seed Fields that don’t match are added to seed Finally, field values extracted and inserted into table

Seed Expansion

Conclusions Surpasses previous work (MDR) Capable of extracting data very accurately –Recall: 98.18% –Precision: 99.68%

Bootstrapping Information Extraction from Semi-structured Web Pages A. Carlson C. Schafer

Summary Method for extracting structured records from web pages Method requires very little training and achieves good results in two domains

Introduction Extracting structured fields enables advanced information retrieval scenarios Much previous work has been site-specific or required substantial manual labeling Heuristic-based approaches have not had great success Uses semi-supervised learning to extract fields from web pages User only has to label 2-5 pages for each of 4-6 sites

Technical Approach Human specifies domain schema Labels training records from representative sites Utilizes partial tree alignment to acquire additional records for each site New records are automatically labeled Learns regression model that predicts mappings from fields to schema columns

Mapping Fields to Columns Calculate score between each field and column Score based on field contexts and contexts observed in training Most probable mapping above a threshold is accepted

Example Context Extraction

Feature Types Precontext 3-grams Lowercase value tokens Lowercase value 3-grams Value token type categories

Example Features

Scoring Field mappings based on comparing feature distributions –Distribution computed from training contexts –Distribution computed from observed contexts Completely dissimilar field/column pairs are fully divergent –Exact field/column pairs have no divergence Feature similarities combined using “stacked” linear regression model Weights for the model are learned in training

Results

Crawling Web Pages with Support for Client-Side Dynamism Manuel Alvarez Alberto Pan Juan Raposo Justo Hidalgo

Summary Advanced crawler based on browser automation NSEQL - Language for specify browser actions Stores URLs and path back to URL

Limitations of Typical Crawlers Built on low-level HTTP APIs Limited or no support for client-side scripts Limited support for sessions Can only see what’s in the HTML

Their Crawler’s Features Built on “mini web browsers” – MSIE Browser Control Handles client-side JavaScript Routes fully support sessions Limited form-handling capabilities

NSEQL

Identifying New Routes Routes can come from links, forms, and JavaScript ‘href’ attributes extracted from normal anchor tags Tags with JavaScript click events are identified and “clicked” Captures actions and inspects them

Results and Conclusions Large scale websites are crawler-friendly Many medium-scale, deep-web sites aren’t Crawlers should handle client-side script Presented crawler has been applied to real- world applications

DeepBot: A Focused Crawler for Accessing Hidden Web Content Manuel Alvarez Juan Raposo Alberto Pan

Summary Presents a focused deep-web crawler Extension of previous work Crawls links and handles search forms

Architecture

Domain Definitions Attributes a1…aN Each attribute has name, aliases, specificity index Queries q1…qN Each query contains 1 or more (attribute,value) pairs Relevance threshold

Example Definition

Evaluating Forms Obtains bounding coordinates of all form fields and potential labels Distances and angles computed between fields and labels

Evaluating Forms If label l is within min-distance of field f, l is added to f’s list –Ties are broken using angle Lists are pruned so that labels appear in only one list and all fields have at least one possible label

Evaluating Forms Text similarity measures used to link domain attributes to fields Computes relevance of form If form score exceeds relevance threshold, DeepBot executes queries

Results and Conclusions Evaluated on three domain tasks: book, music, and movie shopping Achieves very high precision and recall Errors due to: –Missing aliases –Forms with too few fields to achieve minimum support –Sources that did not label fields

Summary of Deep Web Crawling Several challenges must be addressed: –Understanding forms –Handling JavaScript –Determining optimal queries –Identifying result links –Extracting metadata Most of the pieces exist

Questions?

References Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., and Halevy, A Google's Deep Web crawl. Proc. VLDB Endow. 1, 2 (Aug. 2008), Zhai, Y. and Liu, B Web data extraction based on partial tree alignment. In Proceedings of the 14th international Conference on World Wide Web (Chiba, Japan, May , 2005). WWW '05. ACM, New York, NY, Carlson, A. and Schafer, C Bootstrapping Information Extraction from Semi- structured Web Pages. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I (Antwerp, Belgium, September , 2008). Manuel Álvarez, Alberto Pan, Juan Raposo, Justo Hidalgo. Crawling Web Pages with Support for Client-Side Dynamism. Proceedings of the 7th International Conference, Advances in Web-Age Information Management (WAIM 2006). Lecture Notes in Computer Science. Edited by Jeffrey Xu Yu, Masaru Kitsuregawa, Hong Va Leong. Published by Springer-Verlag Berlin. ISSN: , ISBN-10: , ISBN-13: Vol. 4016, pp Hong Kong, China. June 17-19, Álvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., and Carneiro, V DeepBot: a focused crawler for accessing hidden web content. In Proceedings of the 3rd international Workshop on Data Enginering Issues in E-Commerce and Services: in Conjunction with ACM Conference on Electronic Commerce (EC '07) (San Diego, California, June , 2007). M. Hepp, M. Sayal, S. Lee, J. Lee, and J. Shim, Eds. DEECS '07, vol ACM, New York, NY,