A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB 2010.
Chapter 10: Designing Databases
Native XML Database or RDBMS. Data or Document orientation If you are primarily storing documents, then a Native XML Database may be the best option.
Database Systems: Design, Implementation, and Management Tenth Edition
Relational Database Alternatives NoSQL. Choosing A Data Model Relational database underpin legacy applications and meet business needs However, companies.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
IS 466 ADVANCED TOPICS IN INFORMATION SYSTEMS LECTURER : NOUF ALMUJALLY 20 – 11 – 2011 College Of Computer Science and Information, Information Systems.
By ANDREW ZITZELBERGER A Framework for Extraction Ontology Based Information Management.
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
Management Information Systems MS Access 2003 By: Mr. Imdadullah Lecturer, Department of M.I.S. College of Business Administration, KSU.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton University of.
2 1 Chapter 2 Data Model Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.
Module 3: Table Selection
DBMS By Narinder Singh Computer Sc. Deptt. Topics What is DBMS What is DBMS File System Approach: its limitations File System Approach: its limitations.
RDB/1 An introduction to RDBMS Objectives –To learn about the history and future direction of the SQL standard –To get an overall appreciation of a modern.
Database Design - Lecture 2
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
The Case for a Wide-Table Approach to Manage Sparse Relational Data Sets University of Wisconsin-Madison Eric Chu Jennifer Beckmann Jeffrey Naughton.
Unifying Data and Domain Knowledge Using Virtual Views IBM T.J. Watson Research Center Lipyeow Lim, Haixun Wang, Min Wang, VLDB Summarized.
1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.
Computer Science 101 Database Concepts. Database Collection of related data Models real world “universe” Reflects changes Specific purposes and audience.
Management Information Systems MS Access MS Access is an application software that facilitates us to create Database Management Systems (DBMS)
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 4th Edition Copyright © 2009 John Wiley & Sons, Inc. All rights.
A Generic Solution for Warehousing Business Process Data Malu Castellanos Joint work with Fabio Casati, Umesh Dayal, Norman Salazar Dayal, Norman Salazar.
Database Essentials. Key Terms Big Data Describes a dataset that cannot be stored or processed using traditional database software. Examples: Google search.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
EntityRank :Searching Entities Directly and Holistically Tao Cheng, Xifeng Yan, Kevin Chen-Chuan Chang Computer Science Department, University of Illinois.
Implementation of Extended Indexes in Postgres This is a recopilation of original paper of Paul M. Aoki Computer Science Departament Of EECS University.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
MS Access 2007 Management Information Systems 1. Overview 2  What is MS Access?  Access Terminology  Access Window  Database Window  Create New Database.
A Collaborative and Semantic Data Management Framework for Ubiquitous Computing Environment International Conference of Embedded and Ubiquitous Computing.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Center for E-Business Technology Seoul National University Seoul, Korea Freebase: A Collaboratively Created Graph Database For Structuring Human Knowledge.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
Data Mining for Web Intelligence Presentation by Julia Erdman.
26 Mar 04 1 Application Software Practical 5/6 MS Access.
Center for E-Business Technology Seoul National University Seoul, Korea Social Ranking: Uncovering Relevant Content Using Tag-based Recommender Systems.
Institute for Personal Robots in Education (IPRE)‏ CSC 170 Computing: Science and Creativity.
FlexTable: Using a Dynamic Relation Model to Store RDF Data IDS Lab. Seungseok Kang.
IT Architectures for Handling Big Data in Official Statistics: the Case of Scanner Data in Istat Gianluca D’Amato, Annunziata Fiore, Domenico Infante,
1 MS Access. 2 Database – collection of related data Relational Database Management System (RDBMS) – software that uses related data stored in different.
Towards Unifying Vector and Raster Data Models for Hybrid Spatial Regions Philip Dougherty.
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 5th Edition Copyright © 2015 John Wiley & Sons, Inc. All rights.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.
Department of Mathematics Computer and Information Science1 CS 351: Database Management Systems Christopher I. G. Lanclos Chapter 4.
uses of DB systems DB environment DB structure Codd’s rules current common RDBMs implementations.
DBS201: Data Modeling. Agenda Data Modeling Types of Models Entity Relationship Model.
Rationale Databases are an integral part of an organization. Aspiring Database Developers should be able to efficiently design and implement databases.
Database (Microsoft Access). Database A database is an organized collection of related data about a specific topic or purpose. Examples of databases include:
Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,
Key Terms Attribute join Target table Join table Spatial join.
Databases Chapter 16.
Introduction to Web programming
Chapter 4 Relational Databases
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Database Management  .
Database.
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Introduction to Information Retrieval
CS122B: Projects in Databases and Web Applications Winter 2019
CS122B: Projects in Databases and Web Applications Winter 2018
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Presentation transcript:

A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton Computer Sciences Department, University of Wisconsin-Madison 33rd International Conference on Very Large Data Bases, September , University of Vienna, Austria Summarized by Chulki Lee, IDS Lab., Seoul National University Presented by Chulki Lee, IDS Lab., Seoul National University

I want to know how cold Madison gets during winter 2

After searching with keyword and browsing… I found a wiki page, “Madison, Wisconsin”

There is a temperate table in the page… but how can I query over this?

In fact… it’s the structure implicit in unstructured documents !

Copyright  2007 by CEBT Goal: Exploit the Structure!  Improve keyword search and browsing  Structured queries “Which universities are in places that are very cold in winter?” Queries that keyword search and browsing are not good at answering 6

Copyright  2007 by CEBT Challenges  Manage this evolving structure incrementally  How to enable users to query using as much structure as is currently known? 7

Copyright  2007 by CEBT Proposal A Relational Workbench  A way to store an expanding set of documents and attributes  Tools to incrementally process the data  A way to exploit structure in queries 8

Copyright  2007 by CEBT Advantages  Data always available for querying  Supports incremental data processing  Can pose increasingly sophisticated queries over time  Exploit strength of a RDBMS 9

Copyright  2007 by CEBT Relational Workbench  Data Storage  Data Processing  Case Study: Wikipedia 10

Copyright  2007 by CEBT Data Storage – Wide Table  Problems: keeps evolving, heterogeneous… No good schema!  So.. use single table! One document per row New attribute? create new column in the table! – No overhead for storing nulls with interpreted storage or column- oriented database 11 DocTitleDocContentofficial flower Madison, Wisconsin “Madison. is the captial of …” null Seattle, Washington “Seattle is the largest city in …” Dahlia

Copyright  2007 by CEBT Data Storage – Wide Table (more)  Allow attributes to have internal structure 12 DocTitleDocContentofficial flowerheadquarter(city, company) Madison, Wisconsin “Madison. is the captial of …” null[(Madison, Raven Software), (Madison, Human Head Studios), …] Seattle, Washington “Seattle is the largest city in …” Dahlia[(Seattle, Starbucks), (Seattle, Amazon.com), …]

Copyright  2007 by CEBT Data Storage – Mapping Table  Store mappings for different attributes that correspond to the same real-world concept  Query evaluation: Look up mapping table Rewrite query to include matching attributes 13 host idhost namemappings a6 temp ( ℉ ) { a6 = a7 * 9 / } a7 temperature ( ℃ ) { a7 = 5 / 9 * (a6 – 32) }

Copyright  2007 by CEBT Data Processing  Workbench does not decide how to process data  Provides three basic operators: Extract – “address” => “city”, “state”, “zip code” Integrate – “address” = “sent-to” Cluster  DBAs decide what operators to use and set the parameters 14

Copyright  2007 by CEBT Data Processing (more) Operator Interaction Example Improving clustering via iteration 1.Extract section names from Wikipedia pages the page without a section name cannot be clustered properly! 2.Cluster pages based on section names 3.Extract and cluster pages based on the other attributes 15

Copyright  2007 by CEBT Case Study: Wikipedia (1/2) Dataset: American cities, universities, male tennis players  Stage 1: Initial Loading 5 columns: PageId, PageText, RevisionId, ContributorName, LastModificationDate Can do keyword search over PageText immediately!  Stage 2: Extracting Sections 1,253 new columns - Explosion of new columns! Integrate found many aliases – 350 of all attributes belonged to 1 of 14 attribute groups 16

Copyright  2007 by CEBT Case Study: Wikipedia (2/2)  Stage 3: Attribute Clustering Grouped together attributes that were either both null or both non- null in a row Views on clusters – ~25 ms for each cluster – 44s sec for wide table  Stage 4: Extracting Wiki Tables we can extract temperature_wiki! 17 CityMonthLow_F… Madison, Wisconsin 16 Seattle, Washington 136

Now we can query! SELECT AVG(Low_F) FROM temperature_wiki as T WHERE T.city = ‘Madison Wisconsin’ AND Month = 1 OR Month = 2 OR Month = 12; 18

Copyright  2007 by CEBT Conclusion  Relational workbench to incrementally extract and query structure from unstructured data Wide table Mapping table Operators Many problems ahead! 19