Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.

Similar presentations


Presentation on theme: "1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton."— Presentation transcript:

1 1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton

2 2 Imprecise Data in Information Extraction (IE)‏ Source-1 Source-2 …… integrate Source-n extract Imprecise Fuzzy query Answers Database Imprecise Incorrect Incomplete Examples: DBLife (WISC) Avatar (IBM) etc.

3 3 State of the Art AnswersNon-Answers Probability Cavallo et al., 1987; Barbara et al., 1992; Lakshmanan et al., 1997; Dalvi et al., 2004. Provenance Woodruff et al., 1997; Cui et al., 2000, 2001; Buneman et al., 2001; Bhagwat et al., 2004; Probability + Provenance Benjelloun et al., 2006

4 4 Motivating Example Crawl web to extract information related to academic job openings. Store result of extraction in an RDBMS. Ask SQL queries and try to interpret results

5 5 Extracted Jobs CS Dept. web sites yescaucsc ……… nocauc merced nocaucsd yescaberkeley ma school_state yesharvard job_openingschool_name

6 6 Extracted Ranking CS Ranking Web Site …… uc merced 23ucsd 3berkeley 11harvard rankschool_name

7 7 Question Answering What are the CS PhD programs in California (CA) that have job openings and are in the top 25? SELECT Jobs.school_name FROM Jobs, Ranking WHERE Ranking.rank <= 25 AND Jobs.job_opening = yes AND Jobs.school_name = Ranking.school_name; AND Jobs.school_state = ca

8 8 Answer berkeley … yescaberkeley school_statejob_openingschool_name 3berkeley rankschool_name justifies

9 9 Non-Answers ucsd, uc merced, harvard, ucsc,… But why? Data exists. No mechanism. yescaucsc nocauc merced nocaucsd ma school_state yesharvard job _openingschool_name uc merced 23ucsd 11harvard rankschool_name

10 10 Assumptions Relational data model Subset of SQL –Selection (e.g., R.a = 2)‏ –Projection (e.g., return R.a)‏ –Join (e.g., R.a = S.b)‏ Conjunctive predicates (e.g., a = 2 and b = 3)‏ Satisfiable (e.g., no “a = 2 and a = 3”)‏

11 11 Provenance of Non-Answers query z is a potential answer, (x, y) and (x´, y) are the provenance of z. x same query z y updates x´x´ y (x -› x´, y) explains why z is not an answer and how z can become an answer.

12 12 Example nocaucsd school state job opening school name 23ucsd rankschool name yescaucsd school state job opening school name 23ucsd rankschool name ucsd ucsd is a potential answer, The set of base tuples is a provenance of ucsd. JobsRanking

13 13 Another Example yesmaharvard school state job opening school name 23ucsd rankschool name yescaucsd school state job opening school name 23ucsd rankschool name ucsd JobsRanking

14 14 Trust and Constraints UntrustedTrusted query x same query z y valid updates x´x´ y Satisfy constraints Don’t consider updates

15 15 Example: Using Trust yesmaharvard school state job opening school name 23ucsd rankschool name yescaucsd school state job opening school name 23ucsd rankschool name ucsd trust JobsRanking

16 16 Factors Determining Provenance of Non-Answers Trusted Data Constraints Query specification

17 17 Algorithm Start from a user query and a specific non-answer Add predicates derived from the non-answer Add constraint predicates Retain only predicates on trusted attributes For attributes of a potential tuple Determine equivalent constant value (e.g., a = 2)‏ If none, return a variable Evaluate the provenance query

18 18 Example: Why is UCSD Not Answer? Assume that we trust –Jobs(school_name, school_state)‏ –Ranking(school_name, rank)‏ –Completeness of Jobs and Ranking

19 19 Computing Provenance of UCSD SELECT Jobs.school_name FROM Jobs, Ranking WHERE Jobs.job_opening = yes AND Jobs.school_state = ca AND Ranking.rank <= 25 AND Jobs.school_name = Ranking.school_name; SELECT J.school_name, J.job_opening, R.school_name, R.rank FROM Jobs AS J, Ranking AS R Trusted Specifying non-answer Hypothetical update WHERE Jobs.school_name = ucsd AND J.school_state = ca AND R.rank <= 25 AND J.school_name = R.school_name; -› yes

20 20 Provenance of UCSD no -› yesucsd job_openingJobs.school_name 23ucsd rankRanking.school_name UCSD is a potential answer. Why not an answer? because job_opening = no. How to become an answer? job_opening: no -› yes.

21 21 Provenance-Assisted Debugging While implementing our job extraction example, actually used provenance of non- answers to find a bug. Specifically, noticed UCSD is not an answer to “find all dept. in top 25 with job openings” Informed by provenance, we checked UCSD web page and found it does have a job opening. What happened?

22 22 Our Bug UCSD web page has a job opening Debugged extraction for UCSD instance Bug: a line in source longer than the line buffer for read Fix: increase line buffer size Re-extract and re-query produces UCSD as answer

23 23 New records can be inserted. –Use an all-null tuple as a proxy in our provenance report (not actually inserted). The join expression for a provenance query depends on the trust and constraints on the joined tables. Deeper Issues

24 24 More on Join Expression if S.c2 is trusted, –if S is complete, R join S; –if S.c2 is unique, R =x S; –Otherwise, (R join S) union (R x {null, …}); if S.c2 is not trusted –If S is complete, R x S; –Otherwise, R x (S union {null, …}); Given a join between R and S on R.c1 = S.c2 and assuming R.c1 is trusted and R is complete, the join expression for the provenance query is:

25 25 Example: Why is UC Santa Cruz (UCSC) Not Answer? Assume that we trust –Jobs(school_name, school_state)‏ –Ranking(school_name, rank)‏ –Completeness of Jobs Ranking.school_name is unique

26 26 Computing Provenance of UCSC SELECT Jobs.school_name FROM Jobs, Ranking WHERE Jobs.job_opening = yes AND Jobs.school_state = ca AND Ranking.rank <= 25 AND Jobs.school_name = Ranking.school_name; SELECT J.school_name, J.job_opening -› yes, R.school_name -› J.school_name, R.rank -› X FROM Jobs AS J LEFT OUTER JOIN Ranking AS R WHERE J.school_state = ca AND R.rank <= 25 AND J.school_name = R.school_name AND Jobs.school_name = ucsc; Trusted Specifying non-answer Hypothetical update

27 27 Provenance of UCSC yesucsc job_openingJobs.school_name null -› X <=25null -› ucsc rankRanking.school_name UCSC is a potential answer. Why not an answer? Because no ranking for ucsc. How to become an answer? a new ranking tuple is inserted: (null -› ucsc, null -› X <= 25)

28 28 Dataset for Experiment Extracted CS Ph.D. program ranks from the CRA web site (108 schools). Extracted job openings from department web sites (108 schools). Assumption: Trust Jobs(school_name, school_state) and Jobs’ completeness.

29 29 Impact of Trust/Constraints on Provenance of UCSD No trust/constraints on Ranking: 109 provenance tuples Trust Ranking.school: 2 provenance tuples Trust Ranking.school and Ranking.school is unique: 1 provenance tuple

30 30 Impact of Trust/Constraints on Provenance Scalability Scale up the database by a factor of 100, compare the number of provenance of tuples of UCSD. No trust/constraints on Ranking: x 100. Trust Ranking.school: no change.

31 31 Conclusion Proposed a mechanism for explaining a non- answer by using data, constraints, and query. Showed that trust and constraints are critical for getting focused provenance. Some opportunities for future work –Formal theory (e.g., in relational algebra)‏ –Provenance ranking, etc.

32 32 THANKS!

33 33 Original Context: Condor Project Distributed computing research project Develops and maintains the Condor system software, and supports a production distributed computing facility at UW-Madison.

34 34 O(10) ~ O(1000)‏ O(1) ~ O(1000)‏ Data Management in Condor Execute MachineSubmit Machine SchedulerExecutor Central Manager CollectorNegotiator job param machine job Job and system state in local log files!

35 35 CondorDB to the Rescue CondorDB Database O(10) ~ O(1000)‏ O(1) ~ O(1000)‏ Execute MachineSchedule Machine SchedulerExecutor Central Manager CollectorNegotiator job param machine job Query

36 36 CondorDB Deployments many more …

37 37 Imprecise Data in CondorDB Database Answers query Condor nodes uncontrollable unpredictable Autonomous Out of date, Inconsistent, Incorrect Incorrect, Incomplete

38 38 Does this problem also occur in any other application?

39 39 Example: Why is UC Merced Not Answer? Assume that we trust –Jobs(school_name, school_state)‏ –Ranking(school_name, rank)‏ –Completeness of Jobs and Ranking

40 40 Computing Provenance of UC Merced SELECT Jobs.school_name FROM Jobs, Ranking WHERE Jobs.job_opening = yes AND Jobs.school_state = ca AND Ranking.rank <= 25 AND Jobs.school_name = Ranking.school_name; SELECT J.school_name, J.job_opening -› yes, R.school_name, R.rank FROM Jobs AS J, Ranking AS R WHERE J.school_state = ca AND R.rank <= 25 AND J.school_name = R.school_name AND Jobs.school_name = uc merced; Trusted Specifying non-answer Hypothetical update

41 41 Provenance of UC Merced job_openingJobs.school_namerankRanking.school_name UC Merced is not a potential answer. Why not a potential answer? because the trusted data (rank) does not satisfy the query.

42 42 Example: Relaxing Trust for Provenance of UC Merced Assume that we trust –Jobs(school_name, school_state)‏ –Ranking(school_name)‏ –Completeness of Jobs and Ranking

43 43 Computing Provenance of UC Merced SELECT Jobs.school_name FROM Jobs, Ranking WHERE Jobs.job_opening = yes AND Jobs.school_state = ca AND Ranking.rank <= 25 AND Jobs.school_name = Ranking.school_name; SELECT J.school_name, J.job_opening -› yes, R.school_name, R.rank -› X <=25 FROM Jobs AS J, Ranking AS R WHERE J.school_state = ca AND J.school_name = R.school_name AND Jobs.school_name = uc merced; Trusted Specifying non-answer Hypothetical update

44 44 Provenance of UC Merced no -› yesuc merced job_openingJobs.school_name null -› X <= 25uc merced rankRanking.school_name UC Merced is a potential answer. Why not an answer? because job_opening = no and rank = null. How to become an answer? job_opening: no -› yes and rank: null -› X <= 25.


Download ppt "1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton."

Similar presentations


Ads by Google