A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton.

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton Computer Sciences Department, University of Wisconsin-Madison 33rd International Conference on Very Large Data Bases, September 23-27 2007, University of Vienna, Austria 2008. 01. 18. Summarized by Chulki Lee, IDS Lab., Seoul National University Presented by Chulki Lee, IDS Lab., Seoul National University

I want to know how cold Madison gets during winter 2

After searching with keyword and browsing… I found a wiki page, “Madison, Wisconsin”

There is a temperate table in the page… but how can I query over this?

In fact… it’s the structure implicit in unstructured documents !

Copyright  2007 by CEBT Goal: Exploit the Structure!  Improve keyword search and browsing  Structured queries “Which universities are in places that are very cold in winter?” Queries that keyword search and browsing are not good at answering 6

Copyright  2007 by CEBT Challenges  Manage this evolving structure incrementally  How to enable users to query using as much structure as is currently known? 7

Copyright  2007 by CEBT Proposal A Relational Workbench  A way to store an expanding set of documents and attributes  Tools to incrementally process the data  A way to exploit structure in queries 8

Copyright  2007 by CEBT Advantages  Data always available for querying  Supports incremental data processing  Can pose increasingly sophisticated queries over time  Exploit strength of a RDBMS 9

Copyright  2007 by CEBT Relational Workbench  Data Storage  Data Processing  Case Study: Wikipedia 10

Copyright  2007 by CEBT Data Storage – Wide Table  Problems: keeps evolving, heterogeneous… No good schema!  So.. use single table! One document per row New attribute? create new column in the table! – No overhead for storing nulls with interpreted storage or column- oriented database 11 DocTitleDocContentofficial flower Madison, Wisconsin “Madison. is the captial of …” null Seattle, Washington “Seattle is the largest city in …” Dahlia

Copyright  2007 by CEBT Data Storage – Wide Table (more)  Allow attributes to have internal structure 12 DocTitleDocContentofficial flowerheadquarter(city, company) Madison, Wisconsin “Madison. is the captial of …” null[(Madison, Raven Software), (Madison, Human Head Studios), …] Seattle, Washington “Seattle is the largest city in …” Dahlia[(Seattle, Starbucks), (Seattle, Amazon.com), …]

Copyright  2007 by CEBT Data Storage – Mapping Table  Store mappings for different attributes that correspond to the same real-world concept  Query evaluation: Look up mapping table Rewrite query to include matching attributes 13 host idhost namemappings a6 temp ( ℉ ) { a6 = a7 * 9 /5 + 32 } a7 temperature ( ℃ ) { a7 = 5 / 9 * (a6 – 32) }

Copyright  2007 by CEBT Data Processing  Workbench does not decide how to process data  Provides three basic operators: Extract – “address” => “city”, “state”, “zip code” Integrate – “address” = “sent-to” Cluster  DBAs decide what operators to use and set the parameters 14

Copyright  2007 by CEBT Data Processing (more) Operator Interaction Example Improving clustering via iteration 1.Extract section names from Wikipedia pages the page without a section name cannot be clustered properly! 2.Cluster pages based on section names 3.Extract and cluster pages based on the other attributes 15

Copyright  2007 by CEBT Case Study: Wikipedia (1/2) Dataset: American cities, universities, male tennis players  Stage 1: Initial Loading 5 columns: PageId, PageText, RevisionId, ContributorName, LastModificationDate Can do keyword search over PageText immediately!  Stage 2: Extracting Sections 1,253 new columns - Explosion of new columns! Integrate found many aliases – 350 of all attributes belonged to 1 of 14 attribute groups 16

Copyright  2007 by CEBT Case Study: Wikipedia (2/2)  Stage 3: Attribute Clustering Grouped together attributes that were either both null or both non- null in a row Views on clusters – ~25 ms for each cluster – 44s sec for wide table  Stage 4: Extracting Wiki Tables we can extract temperature_wiki! 17 CityMonthLow_F… Madison, Wisconsin 16 Seattle, Washington 136

Now we can query! SELECT AVG(Low_F) FROM temperature_wiki as T WHERE T.city = ‘Madison Wisconsin’ AND Month = 1 OR Month = 2 OR Month = 12; 18

Copyright  2007 by CEBT Conclusion  Relational workbench to incrementally extract and query structure from unstructured data Wide table Mapping table Operators Many problems ahead! 19

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton.

Similar presentations

Presentation on theme: "A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton.

Similar presentations

Presentation on theme: "A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton."— Presentation transcript:

Similar presentations

About project

Feedback