Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.

Similar presentations


Presentation on theme: "Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli."— Presentation transcript:

1 Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre (Creative Commons License, see last slide)

2 Data-intensive websites

3 Website Data-intensive websites Database Template1 Template2 Template3 target

4 Flint goal … StockQuote LastMinMax Volume52highOpen

5 Flint System architecture Web Search [WIDM08] Data Extraction Data Integration The Web

6 Novel contribution Unsupervised Automatic Scalable No knowledge available Data Extraction RoadRunner [Vldb01] ExAlg [Sigmod03] TurboWrapper [Vldb07] Unsupervised Automatic Scalable Uncertain Data No labels available No corpus available Data Integration WebTables [Vldb08] Cimple [Vldb07] MetaQuerier [Cidr05] PayGo [Cidr07]

7 Data Extraction

8

9 AAPL, GOOG, MSFT, INTC, …128.09, 439.54, 34.89, 112.37, … 127.81, 439.25, 32.13, 111.01, …132.43, 443.82, 33.67, 114.32, … 0.50%, -0.38%, 1.23%, 3.92%, -1.65%, … Add AAPL to Your Portfolio, Add GOOG to Your Portfolio, Add MSFT to Your Portfolio, Add INTC to Your Portfolio, … …

10 Data Extraction HTML fragments taken from two pages belonging to the same website: 1,132,228, 1,735,857 /html/body/table/tr[1]/td[2] $20.66, $414.58 /html/body/table/tr[2]/td[2] $11.70, $247.30 /html/body/table/tr[3]/td[2] $20.72, $414.06 /html/body/table/tr[4]/td[2] Extraction error! $0.02, 99,494,200 /html/body/table/tr[5]/td[2] ? 4,732,600, null /html/body/table/tr[6]/td[2]

11 Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock)

12 Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5

13 Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 1.0

14 Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5

15 Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) 0.6 1.0

16 Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) ? 1.0

17 Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) 1.0

18 t=0.7 Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) 1.0

19 t=0.7 Data Integration 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock)

20 t=0.7 Wrapper Refinement 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) 10 null 10 (min/max) ?? 0.3 (weak) 0.0

21 Wrapper Refinement matching value nearby template tokens //td[contains(text(),‘Open')]/../td[2] //td[contains(text(),‘Open')]/../../tr[5]/td[1] //td[contains(text(),‘Open')]/../../tr[5]/td[2] //td[contains(text(),‘High')]/../td[2] …

22 t=0.7 Wrapper Refinement 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) 10 null 10 (min/max) 1.0 10 33 16 (max) 4 25 10 (min) //td[contains(text(),‘Max')]/../td[2] //td[contains(text(),‘Min')]/../td[2]

23 t=0.7 Wrapper Refinement 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) 10 33 16 (max) 4 25 10 (min) AA GO MS (stock) t=0.5 6 26 12 (price) 4 25 10 (min) AA GO MS (stock) 10 null 10 (min/max) 10 33 16 (max) 4 25 10 (min)

24 Experimental Results (100 websites for each domain) Soccer domain (45,714 pages) Attribute|m| Name90 Birth Date61 Height54 Nationality48 Club43 Position43 Weight 34 League14 Videogame domain (49,262 pages) Attribute|m| Title86 Publisher59 Developer45 Genre28 ESRB rating40 Release Date9 Platform9 # Players6 Finance domain (57,623 pages) Attribute|m| Stock Symbol84 Price Change73 % Change73 Volume52 Day Low43 Day High41 Last Price29 Open Price24

25 Demo Found Websites Integrated Data

26 the end! http://flint.dia.uniroma3.it

27 License This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by- sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA. http://creativecommons.org/licenses/by- sa/1.0/

28 Flint System architecture Web search Extraction Integration Probability The Web

29 Flint goal … 20.6420.4920.88 v P(v) Apple price? 20.5820.5920.6020.5720.5620.5520.5420.5320.5220.6120.62


Download ppt "Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli."

Similar presentations


Ads by Google