Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre (Creative Commons License, see last slide)
Data-intensive websites
Website Data-intensive websites Database Template1 Template2 Template3 target
Flint goal … StockQuote LastMinMax Volume52highOpen
Flint System architecture Web Search [WIDM08] Data Extraction Data Integration The Web
Novel contribution Unsupervised Automatic Scalable No knowledge available Data Extraction RoadRunner [Vldb01] ExAlg [Sigmod03] TurboWrapper [Vldb07] Unsupervised Automatic Scalable Uncertain Data No labels available No corpus available Data Integration WebTables [Vldb08] Cimple [Vldb07] MetaQuerier [Cidr05] PayGo [Cidr07]
Data Extraction
AAPL, GOOG, MSFT, INTC, …128.09, , 34.89, , … , , 32.13, , …132.43, , 33.67, , … 0.50%, -0.38%, 1.23%, 3.92%, -1.65%, … Add AAPL to Your Portfolio, Add GOOG to Your Portfolio, Add MSFT to Your Portfolio, Add INTC to Your Portfolio, … …
Data Extraction HTML fragments taken from two pages belonging to the same website: 1,132,228, 1,735,857 /html/body/table/tr[1]/td[2] $20.66, $ /html/body/table/tr[2]/td[2] $11.70, $ /html/body/table/tr[3]/td[2] $20.72, $ /html/body/table/tr[4]/td[2] Extraction error! $0.02, 99,494,200 /html/body/table/tr[5]/td[2] ? 4,732,600, null /html/body/table/tr[6]/td[2]
Data Integration (max) (min) AA GO MS (stock)
Data Integration (max) (min) AA GO MS (stock) t=0.5
Data Integration (max) (min) AA GO MS (stock) (max) (min) AA GO MS (stock) t=
Data Integration (max) (min) AA GO MS (stock) (max) (min) AA GO MS (stock) t=0.5
Data Integration (max) (min) AA GO MS (stock) (max) (min) AA GO MS (stock) t= (price) (min) AA GO MS (stock)
Data Integration (max) (min) AA GO MS (stock) (max) (min) AA GO MS (stock) t= (price) (min) AA GO MS (stock) ? 1.0
Data Integration (max) (min) AA GO MS (stock) (max) (min) AA GO MS (stock) t= (price) (min) AA GO MS (stock) 1.0
t=0.7 Data Integration (max) (min) AA GO MS (stock) (max) (min) AA GO MS (stock) t= (price) (min) AA GO MS (stock) 1.0
t=0.7 Data Integration (max) (min) AA GO MS (stock) (max) (min) AA GO MS (stock) t= (price) (min) AA GO MS (stock)
t=0.7 Wrapper Refinement (max) (min) AA GO MS (stock) (max) (min) AA GO MS (stock) t= (price) (min) AA GO MS (stock) 10 null 10 (min/max) ?? 0.3 (weak) 0.0
Wrapper Refinement matching value nearby template tokens //td[contains(text(),‘Open')]/../td[2] //td[contains(text(),‘Open')]/../../tr[5]/td[1] //td[contains(text(),‘Open')]/../../tr[5]/td[2] //td[contains(text(),‘High')]/../td[2] …
t=0.7 Wrapper Refinement (max) (min) AA GO MS (stock) (max) (min) AA GO MS (stock) t= (price) (min) AA GO MS (stock) 10 null 10 (min/max) (max) (min) //td[contains(text(),‘Max')]/../td[2] //td[contains(text(),‘Min')]/../td[2]
t=0.7 Wrapper Refinement (max) (min) AA GO MS (stock) (max) (min) AA GO MS (stock) t= (price) (min) AA GO MS (stock) 10 null 10 (min/max) (max) (min)
Experimental Results (100 websites for each domain) Soccer domain (45,714 pages) Attribute|m| Name90 Birth Date61 Height54 Nationality48 Club43 Position43 Weight 34 League14 Videogame domain (49,262 pages) Attribute|m| Title86 Publisher59 Developer45 Genre28 ESRB rating40 Release Date9 Platform9 # Players6 Finance domain (57,623 pages) Attribute|m| Stock Symbol84 Price Change73 % Change73 Volume52 Day Low43 Day High41 Last Price29 Open Price24
Demo Found Websites Integrated Data
the end!
License This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA. sa/1.0/
Flint System architecture Web search Extraction Integration Probability The Web
Flint goal … v P(v) Apple price?