Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flint: exploiting redundant information to wring out value from Web data Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti.

Similar presentations


Presentation on theme: "Flint: exploiting redundant information to wring out value from Web data Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti."— Presentation transcript:

1 Flint: exploiting redundant information to wring out value from Web data Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Roma Tre University - Rome, Italy

2 Motivations The opportunity: An increasing number of web sites with structured information The problem: –Current technologies are limited in exploiting the data offered by these sources –Web semantics technologies are too complex and costly Challenges: –development of unsupervised, scalable techniques to extract and integrate data from fairly structured large corpora available on the Web [DB Claremont Report 2008] 2

3 3 Structured Data at Work: Search Engines

4 4

5 5

6 Introduction Notable approaches for massive extraction of Web data concentrate on information organized according to specific patterns that occur on the Web –WebTables [Cafarella et al VLDB2008] and ListExtract [Elmeleegy et al VLDB2009] focus on data published in HTML tables and lists –Information extraction systems (e.g. TextRunner [Banko- Etzioni ACL2008]) exploit lexical-syntactic patterns to extract collections of facts (e.g., x is the capital of y) Even a small fraction of the Web implies an impressive amount of data –given a Web fragment of 14 billion pages: 1.1% of them are good tables ->154 millions [Cafarella et al VLDB2008] 6

7 7 Observation Many sources publish data about one object of a real-world entity for each page Collections of pages can be thought as HTML encodings of a relation NASDAQ:AAPL257.16+3.81249.11259.40… NASDAQ:GOOG485.18-5.28483.00493.45… NASDAQ:MSFT25.80-0.2025.6626.12…

8 Learned while looking for pages to evaluate RoadRunner I was frustrated … RoadRunner was build to infer arbitrary nested structures (list of list of list …) but pages were much more simpler And pages with complex structures usually were designed to support the navigation to detail pages Observation 8

9 Information redundancy on the Web For many disparate entities (e.g. stock quotes, people, products, movies, books, etc.) many web sites follow this publishing strategy These sites can be considered as sources that provide redundant information. The redundancy occurs: –at the schema level: the same attributes are published by more than one source (e.g. volume, min/max/avg price, market capt. for stock quotes) –at the extensional level: several objects are published by more than one source (e.g. many web sites publish data about the same stock quotes)

10 Abstract Generative Process SGSG SYSY SRSR tickerpriceday-maxday-minvolumebeta… nasdaq:aapl256.88259.40253.3529,129,0321.50… nasdaq:goog485.63493.45483.002,894,7551.10… nyse:cat60.7662.4260.057,709,4051.78… ……………… "Hidden Relation" = λ G (e G (σ G (π G (R 0 )))) = λ Y (e Y (σ Y (π Y (R 0 )))) =λ R (e R (σ R (π R (R 0 ))))

11 Abstract Generative Process Each source generated by: – π projection – σ selection – e error (e.g. approx, mistakes, formattings) – λ template encoding tickerpriceday-maxday-minvolumebeta… nasdaq:aapl256.88259.40253.3529,129,0321.50… nasdaq:goog485.63493.45483.002,894,7551.10… nyse:cat60.7662.4260.057,709,4051.78… ……………… σ ticker like "nasdaq:%" (π ticker, price, volume (R 0 )) tickerpricevolume nasdaq:aapl256.8829,129,032 nasdaq:goog485.632,894,755 ……… "Hidden Relation"

12 Abstract Generative Process tickerpriceday-maxday-minvolumebeta… nasdaq:aapl256.88259.40253.3529,129,0321.50… nasdaq:goog485.63493.45483.002,894,7551.10… nyse:cat60.7662.4260.057,709,4051.78… ……………… e Y (σ ticker like "nasdaq:%" (π ticker, price, volume (R 0 ))) e Y (): round(volume, 1000), price(N σ ) tickerpricevolume nasdaq:aapl256.8829,129 nasdaq:goog485.452,894 ……… "Hidden Relation" Each source generated by: – π projection – σ selection – e error (e.g. approx, mistakes, formattings) – λ template encoding

13 "Hidden Relation" Abstract Generative Process Each source generated by: – π projection – σ selection – e error (e.g. approx, mistakes, formattings) – λ template encoding tickerpriceday-maxday-minvolumebeta… nasdaq:aapl256.88259.40253.3529,129,0321.50… nasdaq:goog485.63493.45483.002,894,7551.10… nyse:cat60.7662.4260.057,709,4051.78… ……………… S Y = λ Y (e Y (σ ticker like "nasdaq:%" (π ticker, price, volume (R 0 )))) e Y (): round(volume, 1000), price(N σ )

14 Problem: Invert the Process tickerpriceday-maxday-minvolumebeta… nasdaq:aapl256.88259.40253.3529,129,0321.50… nasdaq:goog485.63493.45483.002,894,7551.10… nyse:cat60.7662.4260.057,709,4051.78… ……………… "Hidden Relation"

15 The Flint approach Exploit the redundancy of information –to discover the sources [Blanco et al WIDM08, WWW11] –to generate the wrappers, to match data from different sources, to infer labels for the extracted data [Blanco et al EDBT08,WebDB10, VLDS11] –to evaluate the quality of the data and the accuracy of the sources [Blanco et al Caise2010, Wicow11] 15

16 The Flint approach Exploit the redundancy of information –to discover the sources [Blanco et al WIDM08, WWW11] –to generate the wrappers, to match data from different sources, to infer labels for the extracted data [Blanco et al EDBT08,WebDB10, VLDS11] –to evaluate the quality of the data and the accuracy of the sources [Blanco et al Caise2010, Wicow11] 16

17 The Flint approach (intuition) SGSG SYSY SRSR A1A2A3 nasdaq:aapl256.8829,129 nasdaq:msft485.452,894 ……… A1A2A3 nasdaq:aapl256.881.55 nasdaq:goog485.451.4 ……… A1A2A3A4 nasdaq:aapl256.8829,1291.55 nasdaq:goog485.452,8941.4 ………

18 The Flint approach (intuition) SGSG SYSY SRSR tickerpricevolume nasdaq:aapl256.8829,129 nasdaq:msft485.452,894 ……… tickerpricebeta nasdaq:aapl256.881.55 nasdaq:goog485.451.4 ……… tickerpricevolumebeta nasdaq:aapl256.8829,1291.55 nasdaq:goog485.452,8941.4 ………

19 Integration and Extraction 1. integration problem 2. extraction problem 3. how they can be tackled contextually We start considering the web sources as relational views over the hidden relation

20 Integration Problem Given a set of sources S1, S2, … Sk, each Si publishes a view of the hidden relation Problem: create a set of mappings, where each mapping is a set of attributes with the same semantics

21 Integration Problem SGSG SYSY SRSR A1A2A3 nasdaq:aapl256.8829,129 nasdaq:msft485.452,894 ……… A1A2A3 nasdaq:aapl256.881.55 nasdaq:goog485.451.4 ……… A1A2A3A4 nasdaq:aapl256.8829,1291.55 nasdaq:goog485.452,8941.4 ………

22 Integration Algorithm Intuition: we match attributes from different sources to build aggregations of attributes with the same semantics Assumption: alignment (record linkage) over a bunch of tuples To identify attributes with the same semantics, we rely on an instance based matching –noise implies possible discrepancies in the values! B1 #1 24.03 #2 12.47 A1 #1 24.01 #2 12.46 d(a1, b1) = 0.08 SBSA

23 23 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a2 256.88 485.45 … a3 29,129 2,894 … a4 1.55 1.4 b1 nasdaq:aapl nasdaq:goog … b2 256.88 485.45 … b3 1.55 1.4 … c1 nasdaq:aapl nasdaq:msft … c2 256.88 485.45 … c3 29,129 2,894 … Every attribute is a node SA SB SC

24 24 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a2 256.88 485.45 … a3 29,129 2,894 … a4 1.55 1.4 b1 nasdaq:aapl nasdaq:goog … b2 256.88 485.45 … b3 1.55 1.4 … c1 nasdaq:aapl nasdaq:msft … c2 256.88 485.45 … c3 29,129 2,894 … Every attribute is matched against all other attributes

25 25 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a2 256.88 485.45 … a3 29,129 2,894 … a4 1.55 1.4 b1 nasdaq:aapl nasdaq:goog … b2 256.88 485.45 … b3 1.55 1.4 … c1 nasdaq:aapl nasdaq:msft … c2 256.88 485.45 … c3 29,129 2,894 … Edges are ranked w.r.t. to the distance (due to the discrepancies). We start with the best match

26 26 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a2 256.88 485.45 … a3 29,129 2,894 … a4 1.55 1.4 b1 nasdaq:aapl nasdaq:goog … b2 256.88 485.45 … b3 1.55 1.4 … c1 nasdaq:aapl nasdaq:msft … c2 256.88 485.45 … c3 29,129 2,894 … We drop useless edges

27 27 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a2 256.88 485.45 … a3 29,129 2,894 … a4 1.55 1.4 b1 nasdaq:aapl nasdaq:goog … b2 256.88 485.45 … b3 1.55 1.4 … c1 nasdaq:aapl nasdaq:msft … c2 256.88 485.45 … c3 29,129 2,894 … We take the next edge in the rank and drop useless edges

28 28 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a2 256.88 485.45 … a3 29,129 2,894 … a4 1.55 1.4 b1 nasdaq:aapl nasdaq:goog … b2 256.88 485.45 … b3 1.55 1.4 … c1 nasdaq:aapl nasdaq:msft … c2 256.88 485.45 … c3 29,129 2,894 …

29 29 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a2 256.88 485.45 … a3 29,129 2,894 … a4 1.55 1.4 b1 nasdaq:aapl nasdaq:goog … b2 256.88 485.45 … b3 1.55 1.4 … c1 nasdaq:aapl nasdaq:msft … c2 256.88 485.45 … c3 29,129 2,894 …

30 Integration Algorithm Clustering algorithm to solve the problem AbstractIntegration is O(n 2 ) over the total number of attributes in the sources But we are dealing with clean relational views... are these the relations we get from wrappers?

31 Extraction Problem A source S i is a collection of pages S i = p1, p2,…, pn –each page publishes data about one object of a real-world entity Two different types of values can appear in a page: –target values: data from the hidden relation –noise values: not relevant data (e.g., advertising, template, layout, etc)

32 32 Extraction Problem –A wrapper w i is a set of extraction rules w i = er A1, …, er An A1A2A3A4 GOOG24.5Coke2.1% AAPL9.2Pepsi42ML page 1 page 2 er 1 er 2 er 3 er 4

33 33 Extraction Problem –A wrapper w i is a set of extraction rules w i = er A1, …, er An Unsupervised wrapper inference limits: –Extraction of noise data (e.g. er 3) –Some extraction rule may be imprecise (e.g. er 4) A1A2A3A4 GOOG24.5Coke2.1% AAPL9.2Pepsi42ML page 1 page 2 er 1 er 2 er 3 er 4

34 Extraction Problem An extraction rule is: –correct if for every page it extracts a target value of the same conceptual attribute –weak if it mixes either target values with different semantics or target values with noise values A1A2 GOOG24.5 AAPL9.2 A3A4 Coke2.1% Pepsi42ML

35 Extraction Problem Problem: Given a set of sources S = S1, S2, … Sn, produce a set of wrappers W*={w1, w2, … wn}, such that wi contains all and only the correct rules for Si We leverage the redundant information among different sources to identify and filter out the weak rules –In a redundant environment, extracted data do not match by chance!

36 Overlapping Rules –To increase the probability of getting the correct rules, we need a wrapper with more extraction rules html tagsbody div 2.1% div... er 1 = 2.1%, 42ML, 3.0%,... er 2 = 2.1%, 1.3%, 3.0%,... er 3 = 33ML, 42ML, 1ML,... er 4 = 5, 5, 6,... 33ML P1 P2 P3 …

37 Overlapping Rules –Two extraction rules from the same wrapper overlap if they extract the same occurrence of the same string from one page html tagsbody div 2.1% div... er 1 = 2.1%, 42ML, 3.0%,... er 2 = 2.1%, 1.3%, 3.0%,... 33ML P1 P2 P3 …

38 Overlapping Rules Given a set of overlapping rules, one is correct, the others are weak Idea: match all of them against rules from other sources: (i) correct rule is the one with the best matching score, (ii) drop the others r5 2.1% 42ML r8 2.1% 1.3% r6 2.1% 1.3% r9 Index 1.3% S1S2

39 Overlapping Rules –Given a set of overlapping rules, one is correct, the others are weak –Idea: match all of them against rules from other sources: (i) correct rule is the one with the best matching score, (ii) drop the others r5 2.1% 42ML r8 2.1% 1.3% r6 2.1% 1.3% r9 Index 1.3% S1S2 0.5 1 0

40 40 Integration Algorithm a1 nasdaq:aapl nasdaq:goog … a2 256.88 485.45 … a3 29,129 2,894 … a4 1.55 1.4 b1 nasdaq:aapl nasdaq:goog … b2 256.88 485.45 … b3 1.55 1.4 … c1 nasdaq:aapl nasdaq:msft … c2 256.88 485.45 … c3 29,129 2,894 … –It is correct –It is O(n 2 ) over the total number of attributes in the source

41 41 Extraction and Integration Alg. b1 nasdaq:aapl nasdaq:goog … b2 256.88 485.45 … b3 1.55 1.4 … c1 nasdaq:aapl nasdaq:msft … c2 256.88 485.45 … c3 29,129 2,894 … a1 nasdaq:aapl nasdaq:goog … a2 256.88 485.45 … a3 29,129 2,894 … a4 1.55 1.4 a1 nasdaq:aapl nasdaq:goog … a1 nasdaq:aapl nasdaq:goog … a1 nasdaq:aapl nasdaq:goog … a2 256.88 485.45 … a2 256.88 485.45 … a2 256.88 485.45 … a3 29,129 2,894 … a3 29,129 2,894 … a3 29,129 2,894 … a4 1.55 1.4 a4 1.55 1.4 a4 1.55 1.4 b1 nasdaq:aapl nasdaq:goog … b1 nasdaq:aapl nasdaq:goog … b2 256.88 485.45 … b2 256.88 485.45 … b3 1.55 1.4 … b3 1.55 1.4 … c1 nasdaq:aapl nasdaq:msft … c1 nasdaq:aapl nasdaq:msft … c2 256.88 485.45 … c2 256.88 485.45 … c2 256.88 485.45 … c3 29,129 2,894 … c3 29,129 2,894 … c3 29,129 2,894 … c1 nasdaq:aapl nasdaq:msft … Lemma: AbstractExtraction is correct AbstractExtraction is O(n 2 ) over the total number of extraction rules

42 42 Extraction and Integration Alg. Greedy best-effort algorithm for integration and extraction [Blanco et al. WebDb2010, WWW2011] Promising experimental results

43 Some Results 43 R = number of correct extraction rules over the number of sources containing the actual attribute

44 Adding Labels Last step: assign a label to each mapping Candidate labels: the textual template nodes that occur closest to the extracted values –poor performances on a single source –but effective on large number of sources because it exploits the redundancy of labels (observed also in [Cafarella et al SIGMOD Record2008] )

45 The Flint approach Exploit the redundancy of information –to discover the sources [Blanco et al WIDM08, WWW11] –to generate the wrappers, to match data from different sources, to infer labels for the extracted data [Blanco et al EDBT08,WebDB10, VLDS11] –to evaluate the quality of the data and the accuracy of the sources [Blanco et al Caise2010, Wicow11] 45

46 Source Discovery We developed crawling techniques to discover and collect the collections of our input sources [Blanco et al WIDM08, WWW11] –Input: a few sample pages The crawler also associates an identifier to objects described in the collected pages 46

47 Data Quality and Source Accuracy Redundancy implies inconsistencies and conflicts, since sources can provide different values for the same attribute of a given object –This is modeled by the error function in the abstract generation process) A concrete example: –on April 21th 2009, the open trade for the Sun Microsystem Inc. stock quote published by three distinct finance web sites, was 9.17, 9.15 and 9.15 –Which one is correct? (probability distribution) –What is the accuracy of the sources? –… is there any source that's copying values? 47

48 Data Quality and Source Accuracy Probabilistic models to evaluate the accuracy of web data NAIVE (voting) ACCU [Yin et al, TKDE08; Wu&Marian, WebDb07; Galland et al, WSDM10] (voting + source accuracy) DEP [Dong et al, PVLDB09] (voting + source accuracy + copiers) M-DEP [Blanco et al, Caise10; Dong et al, PVLDB10] (voting + source accuracy + copiers over more attributes) 48

49 49 Conclusion Data do not match by chance –Unexpected attributes discovered –Tolerant to noise (financial data challenging) Other projects are exploiting data redundancy (e.g. Nguyen et al VLDB11, Rastogi et al VLDB10, Gupta-Sarawagi WIDM11) Plans to leverage also schema knowledge –The approach applies for domains where instances are replicated over more sites (e.g. not suitable for real estate)


Download ppt "Flint: exploiting redundant information to wring out value from Web data Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti."

Similar presentations


Ads by Google