Flint: exploiting redundant information to wring out value from Web data Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti.

Slides:



Advertisements
Similar presentations
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Advertisements

Xin Luna Dong AT&T Labs-Research Joint work w. Laure Berti-Equille, Yifan Hu, Divesh
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
TEXTRUNNER Turing Center Computer Science and Engineering
C-OWL: contextualizing ontologies Fausto Giunchiglia October 22, 2003 Paolo Bouquet, Fausto Giunchiglia, Frank van Harmelen, Luciano Serafini, and Heiner.
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
1 Unsupervised Semantic Parsing Hoifung Poon and Pedro Domingos EMNLP 2009 Best Paper Award Speaker: Hao Xiong.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei.
Aki Hecht Seminar in Databases (236826) January 2009
Crawling the Hidden Web Sriram Raghavan Hector Stanford University.
ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Annotation Free Information Extraction
R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.
To structure or not to structure, is that the question? Paolo Atzeni Based on work done with (or by!) G. Mecca, P.Merialdo, P. Papotti, and many others.
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Deep-Web Crawling “Enlightening the dark side of the web”
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.
Reasoning with context in the Semantic Web … or contextualizing ontologies Fausto Giunchiglia July 23, 2004.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Querying Structured Text in an XML Database By Xuemei Luo.
Online Data Fusion School of Computing National University of Singapore AT&T Shannon Research Labs Xuan Liu, Xin Luna Dong, Beng Chin Ooi, Divesh Srivastava.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Dimitrios Skoutas Alkis Simitsis
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Question Answering over Implicitly Structured Web Content
Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Querying Web Data – The WebQA Approach Author: Sunny K.S.Lam and M.Tamer Özsu CSI5311 Presentation Dongmei Jiang and Zhiping Duan.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Using linked data to interpret tables Varish Mulwad September 14,
Learning to Share Meaning in a Multi-Agent System (Part I) Ganesh Padmanabhan.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Context-Aware Wrapping: Synchronized Data Extraction Shui-Lung Chuang, Kevin.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Data mining in web applications
Statistical Schema Matching across Web Query Interfaces
Probabilistic Data Management
Automatic Wrapper Induction: “Look Mom, no hands!”
Data Integration for Relational Web
Adaptive entity resolution with human computation
Probabilistic Databases
Presentation transcript:

Flint: exploiting redundant information to wring out value from Web data Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Roma Tre University - Rome, Italy

Motivations The opportunity: An increasing number of web sites with structured information The problem: –Current technologies are limited in exploiting the data offered by these sources –Web semantics technologies are too complex and costly Challenges: –development of unsupervised, scalable techniques to extract and integrate data from fairly structured large corpora available on the Web [DB Claremont Report 2008] 2

3 Structured Data at Work: Search Engines

4

5

Introduction Notable approaches for massive extraction of Web data concentrate on information organized according to specific patterns that occur on the Web –WebTables [Cafarella et al VLDB2008] and ListExtract [Elmeleegy et al VLDB2009] focus on data published in HTML tables and lists –Information extraction systems (e.g. TextRunner [Banko- Etzioni ACL2008]) exploit lexical-syntactic patterns to extract collections of facts (e.g., x is the capital of y) Even a small fraction of the Web implies an impressive amount of data –given a Web fragment of 14 billion pages: 1.1% of them are good tables ->154 millions [Cafarella et al VLDB2008] 6

7 Observation Many sources publish data about one object of a real-world entity for each page Collections of pages can be thought as HTML encodings of a relation NASDAQ:AAPL … NASDAQ:GOOG … NASDAQ:MSFT …

Learned while looking for pages to evaluate RoadRunner I was frustrated … RoadRunner was build to infer arbitrary nested structures (list of list of list …) but pages were much more simpler And pages with complex structures usually were designed to support the navigation to detail pages Observation 8

Information redundancy on the Web For many disparate entities (e.g. stock quotes, people, products, movies, books, etc.) many web sites follow this publishing strategy These sites can be considered as sources that provide redundant information. The redundancy occurs: –at the schema level: the same attributes are published by more than one source (e.g. volume, min/max/avg price, market capt. for stock quotes) –at the extensional level: several objects are published by more than one source (e.g. many web sites publish data about the same stock quotes)

Abstract Generative Process SGSG SYSY SRSR tickerpriceday-maxday-minvolumebeta… nasdaq:aapl ,129, … nasdaq:goog ,894, … nyse:cat ,709, … ……………… "Hidden Relation" = λ G (e G (σ G (π G (R 0 )))) = λ Y (e Y (σ Y (π Y (R 0 )))) =λ R (e R (σ R (π R (R 0 ))))

Abstract Generative Process Each source generated by: – π projection – σ selection – e error (e.g. approx, mistakes, formattings) – λ template encoding tickerpriceday-maxday-minvolumebeta… nasdaq:aapl ,129, … nasdaq:goog ,894, … nyse:cat ,709, … ……………… σ ticker like "nasdaq:%" (π ticker, price, volume (R 0 )) tickerpricevolume nasdaq:aapl ,129,032 nasdaq:goog ,894,755 ……… "Hidden Relation"

Abstract Generative Process tickerpriceday-maxday-minvolumebeta… nasdaq:aapl ,129, … nasdaq:goog ,894, … nyse:cat ,709, … ……………… e Y (σ ticker like "nasdaq:%" (π ticker, price, volume (R 0 ))) e Y (): round(volume, 1000), price(N σ ) tickerpricevolume nasdaq:aapl ,129 nasdaq:goog ,894 ……… "Hidden Relation" Each source generated by: – π projection – σ selection – e error (e.g. approx, mistakes, formattings) – λ template encoding

"Hidden Relation" Abstract Generative Process Each source generated by: – π projection – σ selection – e error (e.g. approx, mistakes, formattings) – λ template encoding tickerpriceday-maxday-minvolumebeta… nasdaq:aapl ,129, … nasdaq:goog ,894, … nyse:cat ,709, … ……………… S Y = λ Y (e Y (σ ticker like "nasdaq:%" (π ticker, price, volume (R 0 )))) e Y (): round(volume, 1000), price(N σ )

Problem: Invert the Process tickerpriceday-maxday-minvolumebeta… nasdaq:aapl ,129, … nasdaq:goog ,894, … nyse:cat ,709, … ……………… "Hidden Relation"

The Flint approach Exploit the redundancy of information –to discover the sources [Blanco et al WIDM08, WWW11] –to generate the wrappers, to match data from different sources, to infer labels for the extracted data [Blanco et al EDBT08,WebDB10, VLDS11] –to evaluate the quality of the data and the accuracy of the sources [Blanco et al Caise2010, Wicow11] 15

The Flint approach Exploit the redundancy of information –to discover the sources [Blanco et al WIDM08, WWW11] –to generate the wrappers, to match data from different sources, to infer labels for the extracted data [Blanco et al EDBT08,WebDB10, VLDS11] –to evaluate the quality of the data and the accuracy of the sources [Blanco et al Caise2010, Wicow11] 16

The Flint approach (intuition) SGSG SYSY SRSR A1A2A3 nasdaq:aapl ,129 nasdaq:msft ,894 ……… A1A2A3 nasdaq:aapl nasdaq:goog ……… A1A2A3A4 nasdaq:aapl , nasdaq:goog , ………

The Flint approach (intuition) SGSG SYSY SRSR tickerpricevolume nasdaq:aapl ,129 nasdaq:msft ,894 ……… tickerpricebeta nasdaq:aapl nasdaq:goog ……… tickerpricevolumebeta nasdaq:aapl , nasdaq:goog , ………

Integration and Extraction 1. integration problem 2. extraction problem 3. how they can be tackled contextually We start considering the web sources as relational views over the hidden relation

Integration Problem Given a set of sources S1, S2, … Sk, each Si publishes a view of the hidden relation Problem: create a set of mappings, where each mapping is a set of attributes with the same semantics

Integration Problem SGSG SYSY SRSR A1A2A3 nasdaq:aapl ,129 nasdaq:msft ,894 ……… A1A2A3 nasdaq:aapl nasdaq:goog ……… A1A2A3A4 nasdaq:aapl , nasdaq:goog , ………

Integration Algorithm Intuition: we match attributes from different sources to build aggregations of attributes with the same semantics Assumption: alignment (record linkage) over a bunch of tuples To identify attributes with the same semantics, we rely on an instance based matching –noise implies possible discrepancies in the values! B1 # # A1 # # d(a1, b1) = 0.08 SBSA

23 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a … a3 29,129 2,894 … a b1 nasdaq:aapl nasdaq:goog … b … b … c1 nasdaq:aapl nasdaq:msft … c … c3 29,129 2,894 … Every attribute is a node SA SB SC

24 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a … a3 29,129 2,894 … a b1 nasdaq:aapl nasdaq:goog … b … b … c1 nasdaq:aapl nasdaq:msft … c … c3 29,129 2,894 … Every attribute is matched against all other attributes

25 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a … a3 29,129 2,894 … a b1 nasdaq:aapl nasdaq:goog … b … b … c1 nasdaq:aapl nasdaq:msft … c … c3 29,129 2,894 … Edges are ranked w.r.t. to the distance (due to the discrepancies). We start with the best match

26 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a … a3 29,129 2,894 … a b1 nasdaq:aapl nasdaq:goog … b … b … c1 nasdaq:aapl nasdaq:msft … c … c3 29,129 2,894 … We drop useless edges

27 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a … a3 29,129 2,894 … a b1 nasdaq:aapl nasdaq:goog … b … b … c1 nasdaq:aapl nasdaq:msft … c … c3 29,129 2,894 … We take the next edge in the rank and drop useless edges

28 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a … a3 29,129 2,894 … a b1 nasdaq:aapl nasdaq:goog … b … b … c1 nasdaq:aapl nasdaq:msft … c … c3 29,129 2,894 …

29 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a … a3 29,129 2,894 … a b1 nasdaq:aapl nasdaq:goog … b … b … c1 nasdaq:aapl nasdaq:msft … c … c3 29,129 2,894 …

Integration Algorithm Clustering algorithm to solve the problem AbstractIntegration is O(n 2 ) over the total number of attributes in the sources But we are dealing with clean relational views... are these the relations we get from wrappers?

Extraction Problem A source S i is a collection of pages S i = p1, p2,…, pn –each page publishes data about one object of a real-world entity Two different types of values can appear in a page: –target values: data from the hidden relation –noise values: not relevant data (e.g., advertising, template, layout, etc)

32 Extraction Problem –A wrapper w i is a set of extraction rules w i = er A1, …, er An A1A2A3A4 GOOG24.5Coke2.1% AAPL9.2Pepsi42ML page 1 page 2 er 1 er 2 er 3 er 4

33 Extraction Problem –A wrapper w i is a set of extraction rules w i = er A1, …, er An Unsupervised wrapper inference limits: –Extraction of noise data (e.g. er 3) –Some extraction rule may be imprecise (e.g. er 4) A1A2A3A4 GOOG24.5Coke2.1% AAPL9.2Pepsi42ML page 1 page 2 er 1 er 2 er 3 er 4

Extraction Problem An extraction rule is: –correct if for every page it extracts a target value of the same conceptual attribute –weak if it mixes either target values with different semantics or target values with noise values A1A2 GOOG24.5 AAPL9.2 A3A4 Coke2.1% Pepsi42ML

Extraction Problem Problem: Given a set of sources S = S1, S2, … Sn, produce a set of wrappers W*={w1, w2, … wn}, such that wi contains all and only the correct rules for Si We leverage the redundant information among different sources to identify and filter out the weak rules –In a redundant environment, extracted data do not match by chance!

Overlapping Rules –To increase the probability of getting the correct rules, we need a wrapper with more extraction rules html tagsbody div 2.1% div... er 1 = 2.1%, 42ML, 3.0%,... er 2 = 2.1%, 1.3%, 3.0%,... er 3 = 33ML, 42ML, 1ML,... er 4 = 5, 5, 6,... 33ML P1 P2 P3 …

Overlapping Rules –Two extraction rules from the same wrapper overlap if they extract the same occurrence of the same string from one page html tagsbody div 2.1% div... er 1 = 2.1%, 42ML, 3.0%,... er 2 = 2.1%, 1.3%, 3.0%,... 33ML P1 P2 P3 …

Overlapping Rules Given a set of overlapping rules, one is correct, the others are weak Idea: match all of them against rules from other sources: (i) correct rule is the one with the best matching score, (ii) drop the others r5 2.1% 42ML r8 2.1% 1.3% r6 2.1% 1.3% r9 Index 1.3% S1S2

Overlapping Rules –Given a set of overlapping rules, one is correct, the others are weak –Idea: match all of them against rules from other sources: (i) correct rule is the one with the best matching score, (ii) drop the others r5 2.1% 42ML r8 2.1% 1.3% r6 2.1% 1.3% r9 Index 1.3% S1S

40 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a … a3 29,129 2,894 … a b1 nasdaq:aapl nasdaq:goog … b … b … c1 nasdaq:aapl nasdaq:msft … c … c3 29,129 2,894 … –It is correct –It is O(n 2 ) over the total number of attributes in the source

41 Extraction and Integration Alg. b1 nasdaq:aapl nasdaq:goog … b … b … c1 nasdaq:aapl nasdaq:msft … c … c3 29,129 2,894 … a1 nasdaq:aapl nasdaq:goog … a … a3 29,129 2,894 … a a1 nasdaq:aapl nasdaq:goog … a1 nasdaq:aapl nasdaq:goog … a1 nasdaq:aapl nasdaq:goog … a … a … a … a3 29,129 2,894 … a3 29,129 2,894 … a3 29,129 2,894 … a a a b1 nasdaq:aapl nasdaq:goog … b1 nasdaq:aapl nasdaq:goog … b … b … b … b … c1 nasdaq:aapl nasdaq:msft … c1 nasdaq:aapl nasdaq:msft … c … c … c … c3 29,129 2,894 … c3 29,129 2,894 … c3 29,129 2,894 … c1 nasdaq:aapl nasdaq:msft … Lemma: AbstractExtraction is correct AbstractExtraction is O(n 2 ) over the total number of extraction rules

42 Extraction and Integration Alg. Greedy best-effort algorithm for integration and extraction [Blanco et al. WebDb2010, WWW2011] Promising experimental results

Some Results 43 R = number of correct extraction rules over the number of sources containing the actual attribute

Adding Labels Last step: assign a label to each mapping Candidate labels: the textual template nodes that occur closest to the extracted values –poor performances on a single source –but effective on large number of sources because it exploits the redundancy of labels (observed also in [Cafarella et al SIGMOD Record2008] )

The Flint approach Exploit the redundancy of information –to discover the sources [Blanco et al WIDM08, WWW11] –to generate the wrappers, to match data from different sources, to infer labels for the extracted data [Blanco et al EDBT08,WebDB10, VLDS11] –to evaluate the quality of the data and the accuracy of the sources [Blanco et al Caise2010, Wicow11] 45

Source Discovery We developed crawling techniques to discover and collect the collections of our input sources [Blanco et al WIDM08, WWW11] –Input: a few sample pages The crawler also associates an identifier to objects described in the collected pages 46

Data Quality and Source Accuracy Redundancy implies inconsistencies and conflicts, since sources can provide different values for the same attribute of a given object –This is modeled by the error function in the abstract generation process) A concrete example: –on April 21th 2009, the open trade for the Sun Microsystem Inc. stock quote published by three distinct finance web sites, was 9.17, 9.15 and 9.15 –Which one is correct? (probability distribution) –What is the accuracy of the sources? –… is there any source that's copying values? 47

Data Quality and Source Accuracy Probabilistic models to evaluate the accuracy of web data NAIVE (voting) ACCU [Yin et al, TKDE08; Wu&Marian, WebDb07; Galland et al, WSDM10] (voting + source accuracy) DEP [Dong et al, PVLDB09] (voting + source accuracy + copiers) M-DEP [Blanco et al, Caise10; Dong et al, PVLDB10] (voting + source accuracy + copiers over more attributes) 48

49 Conclusion Data do not match by chance –Unexpected attributes discovered –Tolerant to noise (financial data challenging) Other projects are exploiting data redundancy (e.g. Nguyen et al VLDB11, Rastogi et al VLDB10, Gupta-Sarawagi WIDM11) Plans to leverage also schema knowledge –The approach applies for domains where instances are replicated over more sites (e.g. not suitable for real estate)