Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton.

Similar presentations


Presentation on theme: "A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton."— Presentation transcript:

1 A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton Computer Sciences Department, University of Wisconsin-Madison 33rd International Conference on Very Large Data Bases, September 23-27 2007, University of Vienna, Austria 2008. 01. 18. Summarized by Chulki Lee, IDS Lab., Seoul National University Presented by Chulki Lee, IDS Lab., Seoul National University

2 I want to know how cold Madison gets during winter 2

3 After searching with keyword and browsing… I found a wiki page, “Madison, Wisconsin”

4 There is a temperate table in the page… but how can I query over this?

5 In fact… it’s the structure implicit in unstructured documents !

6 Copyright  2007 by CEBT Goal: Exploit the Structure!  Improve keyword search and browsing  Structured queries “Which universities are in places that are very cold in winter?” Queries that keyword search and browsing are not good at answering 6

7 Copyright  2007 by CEBT Challenges  Manage this evolving structure incrementally  How to enable users to query using as much structure as is currently known? 7

8 Copyright  2007 by CEBT Proposal A Relational Workbench  A way to store an expanding set of documents and attributes  Tools to incrementally process the data  A way to exploit structure in queries 8

9 Copyright  2007 by CEBT Advantages  Data always available for querying  Supports incremental data processing  Can pose increasingly sophisticated queries over time  Exploit strength of a RDBMS 9

10 Copyright  2007 by CEBT Relational Workbench  Data Storage  Data Processing  Case Study: Wikipedia 10

11 Copyright  2007 by CEBT Data Storage – Wide Table  Problems: keeps evolving, heterogeneous… No good schema!  So.. use single table! One document per row New attribute? create new column in the table! – No overhead for storing nulls with interpreted storage or column- oriented database 11 DocTitleDocContentofficial flower Madison, Wisconsin “Madison. is the captial of …” null Seattle, Washington “Seattle is the largest city in …” Dahlia

12 Copyright  2007 by CEBT Data Storage – Wide Table (more)  Allow attributes to have internal structure 12 DocTitleDocContentofficial flowerheadquarter(city, company) Madison, Wisconsin “Madison. is the captial of …” null[(Madison, Raven Software), (Madison, Human Head Studios), …] Seattle, Washington “Seattle is the largest city in …” Dahlia[(Seattle, Starbucks), (Seattle, Amazon.com), …]

13 Copyright  2007 by CEBT Data Storage – Mapping Table  Store mappings for different attributes that correspond to the same real-world concept  Query evaluation: Look up mapping table Rewrite query to include matching attributes 13 host idhost namemappings a6 temp ( ℉ ) { a6 = a7 * 9 /5 + 32 } a7 temperature ( ℃ ) { a7 = 5 / 9 * (a6 – 32) }

14 Copyright  2007 by CEBT Data Processing  Workbench does not decide how to process data  Provides three basic operators: Extract – “address” => “city”, “state”, “zip code” Integrate – “address” = “sent-to” Cluster  DBAs decide what operators to use and set the parameters 14

15 Copyright  2007 by CEBT Data Processing (more) Operator Interaction Example Improving clustering via iteration 1.Extract section names from Wikipedia pages the page without a section name cannot be clustered properly! 2.Cluster pages based on section names 3.Extract and cluster pages based on the other attributes 15

16 Copyright  2007 by CEBT Case Study: Wikipedia (1/2) Dataset: American cities, universities, male tennis players  Stage 1: Initial Loading 5 columns: PageId, PageText, RevisionId, ContributorName, LastModificationDate Can do keyword search over PageText immediately!  Stage 2: Extracting Sections 1,253 new columns - Explosion of new columns! Integrate found many aliases – 350 of all attributes belonged to 1 of 14 attribute groups 16

17 Copyright  2007 by CEBT Case Study: Wikipedia (2/2)  Stage 3: Attribute Clustering Grouped together attributes that were either both null or both non- null in a row Views on clusters – ~25 ms for each cluster – 44s sec for wide table  Stage 4: Extracting Wiki Tables we can extract temperature_wiki! 17 CityMonthLow_F… Madison, Wisconsin 16 Seattle, Washington 136

18 Now we can query! SELECT AVG(Low_F) FROM temperature_wiki as T WHERE T.city = ‘Madison Wisconsin’ AND Month = 1 OR Month = 2 OR Month = 12; 18

19 Copyright  2007 by CEBT Conclusion  Relational workbench to incrementally extract and query structure from unstructured data Wide table Mapping table Operators Many problems ahead! 19


Download ppt "A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton."

Similar presentations


Ads by Google