1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
CS 540 Database Management Systems
Efficient Query Evaluation on Probabilistic Databases
Relational Algebra 1 Chapter 5.1 V3.0 Napier University Dr Gordon Russell.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Midterm Review Lecture 14b. 14 Lectures So Far 1.Introduction 2.The Relational Model 3.Disks and Files 4.Relational Algebra 5.File Org, Indexes 6.Relational.
Chapter 3 An Introduction to Relational Databases.
1 Views. 2 What are views good for?(1) Simplifying complex queries: we saw one example. Here is another example that allows the user to "pretend" that.
Database Systems More SQL Database Design -- More SQL1.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Midterm 1 Concepts Relational Algebra (DB4) SQL Querying and updating (DB5) Constraints and Triggers (DB11) Unified Modeling Language (DB9) Relational.
RELATIONAL ALGEBRA Objectives
Chapter 6: Integrity and Security Thomas Nikl 19 October, 2004 CS157B.
Relational Model & Relational Algebra. 2 Relational Model u Terminology of relational model. u How tables are used to represent data. u Connection between.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 3: Introduction.
Chapter 3 An Introduction to Relational Databases.
CS848: Topics in Databases: Foundations of Query Optimization Topics Covered  Databases  QL  Query containment  More on QL.
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Chapter 9 Integrity. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.9-2 Topics in this Chapter Predicates and Propositions Internal vs.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
TRAC: Toward Recency And Consistency Reporting in a Database with Distributed Data Sources Jiansheng Huang Jeffrey F. Naughton Miron Livny.
Computing & Information Sciences Kansas State University Monday, 08 Sep 2008CIS 560: Database System Concepts Lecture 5 of 42 Monday, 08 September 2008.
CS 4432query processing1 CS4432: Database Systems II Lecture #11 Professor Elke A. Rundensteiner.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
Advanced Relational Algebra & SQL (Part1 )
1 CS 430 Database Theory Winter 2005 Lecture 4: Relational Model.
A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
1 CS 430 Database Theory Winter 2005 Lecture 5: Relational Algebra.
Bhanu Pratap Gupta Devang Vira S. Sudarshan Dept. of Computer Science and Engineering, IIT Bombay.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Chapter 13: Query Processing
Xiaoyong Chai, Ba-Quy Vuong, AnHai Doan, Jeffrey F. Naughton University of Wisconsin-Madison Efficiently Incorporating User Feedback into Information Extraction.
Chapter 3 An Introduction to Relational Databases.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Ritu CHaturvedi Some figures are adapted from T. COnnolly
Module 2: Intro to Relational Model
CS 480: Database Systems Lecture 13 February 13,2013.
Relational Model By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany)
CS411 Database Systems 08: Midterm Review Kazuhiro Minami 1.
Chapter 4 Relational Databases
Chapter 2: Intro to Relational Model
File Processing : Query Processing
Lecture 16: Probabilistic Databases
Relational Database Models
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
CS 3630 Database Design and Implementation
Basic Grid Projects – Condor (Part I)
Instructor: Mohamed Eltabakh
Databases.
Chapter 2: Intro to Relational Model
Probabilistic Databases
Chapter 2: Intro to Relational Model
Example of a Relation attributes (or columns) tuples (or rows)
Chapter 2: Intro to Relational Model
Chen Li Information and Computer Science
Unit Relational Algebra 1
Lecture 2 Relational Database
Presentation transcript:

1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

2 Imprecise Data in Information Extraction (IE)‏ Source-1 Source-2 …… integrate Source-n extract Imprecise Fuzzy query Answers Database Imprecise Incorrect Incomplete Examples: DBLife (WISC) Avatar (IBM) etc.

3 State of the Art AnswersNon-Answers Probability Cavallo et al., 1987; Barbara et al., 1992; Lakshmanan et al., 1997; Dalvi et al., Provenance Woodruff et al., 1997; Cui et al., 2000, 2001; Buneman et al., 2001; Bhagwat et al., 2004; Probability + Provenance Benjelloun et al., 2006

4 Motivating Example Crawl web to extract information related to academic job openings. Store result of extraction in an RDBMS. Ask SQL queries and try to interpret results

5 Extracted Jobs CS Dept. web sites yescaucsc ……… nocauc merced nocaucsd yescaberkeley ma school_state yesharvard job_openingschool_name

6 Extracted Ranking CS Ranking Web Site …… uc merced 23ucsd 3berkeley 11harvard rankschool_name

7 Question Answering What are the CS PhD programs in California (CA) that have job openings and are in the top 25? SELECT Jobs.school_name FROM Jobs, Ranking WHERE Ranking.rank <= 25 AND Jobs.job_opening = yes AND Jobs.school_name = Ranking.school_name; AND Jobs.school_state = ca

8 Answer berkeley … yescaberkeley school_statejob_openingschool_name 3berkeley rankschool_name justifies

9 Non-Answers ucsd, uc merced, harvard, ucsc,… But why? Data exists. No mechanism. yescaucsc nocauc merced nocaucsd ma school_state yesharvard job _openingschool_name uc merced 23ucsd 11harvard rankschool_name

10 Assumptions Relational data model Subset of SQL –Selection (e.g., R.a = 2)‏ –Projection (e.g., return R.a)‏ –Join (e.g., R.a = S.b)‏ Conjunctive predicates (e.g., a = 2 and b = 3)‏ Satisfiable (e.g., no “a = 2 and a = 3”)‏

11 Provenance of Non-Answers query z is a potential answer, (x, y) and (x´, y) are the provenance of z. x same query z y updates x´x´ y (x -› x´, y) explains why z is not an answer and how z can become an answer.

12 Example nocaucsd school state job opening school name 23ucsd rankschool name yescaucsd school state job opening school name 23ucsd rankschool name ucsd ucsd is a potential answer, The set of base tuples is a provenance of ucsd. JobsRanking

13 Another Example yesmaharvard school state job opening school name 23ucsd rankschool name yescaucsd school state job opening school name 23ucsd rankschool name ucsd JobsRanking

14 Trust and Constraints UntrustedTrusted query x same query z y valid updates x´x´ y Satisfy constraints Don’t consider updates

15 Example: Using Trust yesmaharvard school state job opening school name 23ucsd rankschool name yescaucsd school state job opening school name 23ucsd rankschool name ucsd trust JobsRanking

16 Factors Determining Provenance of Non-Answers Trusted Data Constraints Query specification

17 Algorithm Start from a user query and a specific non-answer Add predicates derived from the non-answer Add constraint predicates Retain only predicates on trusted attributes For attributes of a potential tuple Determine equivalent constant value (e.g., a = 2)‏ If none, return a variable Evaluate the provenance query

18 Example: Why is UCSD Not Answer? Assume that we trust –Jobs(school_name, school_state)‏ –Ranking(school_name, rank)‏ –Completeness of Jobs and Ranking

19 Computing Provenance of UCSD SELECT Jobs.school_name FROM Jobs, Ranking WHERE Jobs.job_opening = yes AND Jobs.school_state = ca AND Ranking.rank <= 25 AND Jobs.school_name = Ranking.school_name; SELECT J.school_name, J.job_opening, R.school_name, R.rank FROM Jobs AS J, Ranking AS R Trusted Specifying non-answer Hypothetical update WHERE Jobs.school_name = ucsd AND J.school_state = ca AND R.rank <= 25 AND J.school_name = R.school_name; -› yes

20 Provenance of UCSD no -› yesucsd job_openingJobs.school_name 23ucsd rankRanking.school_name UCSD is a potential answer. Why not an answer? because job_opening = no. How to become an answer? job_opening: no -› yes.

21 Provenance-Assisted Debugging While implementing our job extraction example, actually used provenance of non- answers to find a bug. Specifically, noticed UCSD is not an answer to “find all dept. in top 25 with job openings” Informed by provenance, we checked UCSD web page and found it does have a job opening. What happened?

22 Our Bug UCSD web page has a job opening Debugged extraction for UCSD instance Bug: a line in source longer than the line buffer for read Fix: increase line buffer size Re-extract and re-query produces UCSD as answer

23 New records can be inserted. –Use an all-null tuple as a proxy in our provenance report (not actually inserted). The join expression for a provenance query depends on the trust and constraints on the joined tables. Deeper Issues

24 More on Join Expression if S.c2 is trusted, –if S is complete, R join S; –if S.c2 is unique, R =x S; –Otherwise, (R join S) union (R x {null, …}); if S.c2 is not trusted –If S is complete, R x S; –Otherwise, R x (S union {null, …}); Given a join between R and S on R.c1 = S.c2 and assuming R.c1 is trusted and R is complete, the join expression for the provenance query is:

25 Example: Why is UC Santa Cruz (UCSC) Not Answer? Assume that we trust –Jobs(school_name, school_state)‏ –Ranking(school_name, rank)‏ –Completeness of Jobs Ranking.school_name is unique

26 Computing Provenance of UCSC SELECT Jobs.school_name FROM Jobs, Ranking WHERE Jobs.job_opening = yes AND Jobs.school_state = ca AND Ranking.rank <= 25 AND Jobs.school_name = Ranking.school_name; SELECT J.school_name, J.job_opening -› yes, R.school_name -› J.school_name, R.rank -› X FROM Jobs AS J LEFT OUTER JOIN Ranking AS R WHERE J.school_state = ca AND R.rank <= 25 AND J.school_name = R.school_name AND Jobs.school_name = ucsc; Trusted Specifying non-answer Hypothetical update

27 Provenance of UCSC yesucsc job_openingJobs.school_name null -› X <=25null -› ucsc rankRanking.school_name UCSC is a potential answer. Why not an answer? Because no ranking for ucsc. How to become an answer? a new ranking tuple is inserted: (null -› ucsc, null -› X <= 25)

28 Dataset for Experiment Extracted CS Ph.D. program ranks from the CRA web site (108 schools). Extracted job openings from department web sites (108 schools). Assumption: Trust Jobs(school_name, school_state) and Jobs’ completeness.

29 Impact of Trust/Constraints on Provenance of UCSD No trust/constraints on Ranking: 109 provenance tuples Trust Ranking.school: 2 provenance tuples Trust Ranking.school and Ranking.school is unique: 1 provenance tuple

30 Impact of Trust/Constraints on Provenance Scalability Scale up the database by a factor of 100, compare the number of provenance of tuples of UCSD. No trust/constraints on Ranking: x 100. Trust Ranking.school: no change.

31 Conclusion Proposed a mechanism for explaining a non- answer by using data, constraints, and query. Showed that trust and constraints are critical for getting focused provenance. Some opportunities for future work –Formal theory (e.g., in relational algebra)‏ –Provenance ranking, etc.

32 THANKS!

33 Original Context: Condor Project Distributed computing research project Develops and maintains the Condor system software, and supports a production distributed computing facility at UW-Madison.

34 O(10) ~ O(1000)‏ O(1) ~ O(1000)‏ Data Management in Condor Execute MachineSubmit Machine SchedulerExecutor Central Manager CollectorNegotiator job param machine job Job and system state in local log files!

35 CondorDB to the Rescue CondorDB Database O(10) ~ O(1000)‏ O(1) ~ O(1000)‏ Execute MachineSchedule Machine SchedulerExecutor Central Manager CollectorNegotiator job param machine job Query

36 CondorDB Deployments many more …

37 Imprecise Data in CondorDB Database Answers query Condor nodes uncontrollable unpredictable Autonomous Out of date, Inconsistent, Incorrect Incorrect, Incomplete

38 Does this problem also occur in any other application?

39 Example: Why is UC Merced Not Answer? Assume that we trust –Jobs(school_name, school_state)‏ –Ranking(school_name, rank)‏ –Completeness of Jobs and Ranking

40 Computing Provenance of UC Merced SELECT Jobs.school_name FROM Jobs, Ranking WHERE Jobs.job_opening = yes AND Jobs.school_state = ca AND Ranking.rank <= 25 AND Jobs.school_name = Ranking.school_name; SELECT J.school_name, J.job_opening -› yes, R.school_name, R.rank FROM Jobs AS J, Ranking AS R WHERE J.school_state = ca AND R.rank <= 25 AND J.school_name = R.school_name AND Jobs.school_name = uc merced; Trusted Specifying non-answer Hypothetical update

41 Provenance of UC Merced job_openingJobs.school_namerankRanking.school_name UC Merced is not a potential answer. Why not a potential answer? because the trusted data (rank) does not satisfy the query.

42 Example: Relaxing Trust for Provenance of UC Merced Assume that we trust –Jobs(school_name, school_state)‏ –Ranking(school_name)‏ –Completeness of Jobs and Ranking

43 Computing Provenance of UC Merced SELECT Jobs.school_name FROM Jobs, Ranking WHERE Jobs.job_opening = yes AND Jobs.school_state = ca AND Ranking.rank <= 25 AND Jobs.school_name = Ranking.school_name; SELECT J.school_name, J.job_opening -› yes, R.school_name, R.rank -› X <=25 FROM Jobs AS J, Ranking AS R WHERE J.school_state = ca AND J.school_name = R.school_name AND Jobs.school_name = uc merced; Trusted Specifying non-answer Hypothetical update

44 Provenance of UC Merced no -› yesuc merced job_openingJobs.school_name null -› X <= 25uc merced rankRanking.school_name UC Merced is a potential answer. Why not an answer? because job_opening = no and rank = null. How to become an answer? job_opening: no -› yes and rank: null -› X <= 25.