Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovering Queries based on Example Tuples

Similar presentations


Presentation on theme: "Discovering Queries based on Example Tuples"— Presentation transcript:

1 Discovering Queries based on Example Tuples
Yanyan Shen1, Kaushik Chakrabati2, Surajit Chaudhuri2, Bolin Ding2, Lev Novik2 1National University of Singapore, 2Microsoft Corporation

2 Complex Database Schema
Source #tables #columns #text columns #foreign-key references IMDB 21 101 42 22 Axon (customer-support) 100 1263 614 63 CRM (customer-relationship) 347 5595 1074 586

3 Challenge: Querying Complex Databases
SQL SELECT CustName, DevName, AppName FROM Customer, Sales, Device, App WHERE Sales.CustId=Custom.CustId AND Sales.DevId=Device.DevId AND Sales.AppId=App.AppId Target schema Relevant tables Join path Any help to formulate a SQL query?

4 Can Keyword Search Help?
Sales Customer Input: Mike ThinkPad Office SId CustId DevId AppId s1 c1 d1 a1 s2 c2 d2 a2 s3 c3 d3 a3 CustId CustName c1 Mike Jones c2 Mary Smith c3 Bob Evans *search for sales tuples Output: matched rows Employee Device App Mike Jones s1 ThinkPad X1 Office 2013 EmpId EmpName e1 Mike Stone e2 Mary Lee e3 Bob Nash DevId DevName d1 ThinkPad X1 d2 iPad Air d3 Nexus 7 AppId AppName a1 Office 2013 a2 Evernote a3 Dropbox Mike Stone o1 ThinkPad X1 Office 2013 Owner ESR OId EmpId DevId AppId o1 e1 d1 a1 o2 e2 d3 a3 o3 e3 d2 a2 ESRId EmpId AppId Desc sr1 e1 a1 Office crash sr2 e2 a3 Dropbox can’t sync Where is schema information? Ambiguity

5 Output(Project join query)
Our Proposal *Who bought which product with which app installed. Mike Mary ThinkPad iPad Office Dropbox Bob Input (Example table) Output(Project join query) Customer A Device B App C CustId CustName DevId DevName AppId AppName Sales SId CustId DevId AppId

6 Roadmap Motivation & proposal Problem statement Solution
Candidate query generation Candidate query verification VerifyAll SimplePrune Filter Experimental results Conclusion

7 Problem Statement Input: an example table T
Output: project join query such that (valid): every row 𝑟 in T is present in the query result (minimal): removing any edges or nodes from the join tree will lead to an invalid query Mike Mary ThinkPad iPad Office Dropbox Bob minimal Not minimal Developer

8 Solution Overview Candidate Query Generation
Candidate Query Verification Candidate Projection Column Retrieval Schema Graph Traversal Example Table Result Queries IR Engine maintaining inverted index on text columns (CI) Database Schema Database Instance

9 Roadmap Motivation & proposal Problem statement Solution
Candidate query generation Candidate query verification VerifyAll SimplePrune Filter Experimental results Conclusion

10 Candidate Query Generation
Mike Mary ThinkPad iPad Office Dropbox Bob Candidate Projection Column Retrieval For each column in the example table, find candidate projection columns in the database satisfying column constraint: contain all the keywords in the column Input column Candidate projection columns A Customer.CustName Employee.EmpName B Device.DevName C App.AppName ESR.Desc

11 Candidate Query Generation
Mike Mary ThinkPad iPad Office Dropbox Bob Candidate Query Enumeration Follow candidate network generation algorithm[1]  No join is required! CQ1 CQ2 Sales Owner CQ3 Owner A A B C A B C B Customer Device App Employee Device Employee Device App C ESR CQ4 Owner CQ5 Owner Generate join tree 𝐽 Generate mapping 𝜙 Check minimal: - Every leaf node contains a column that is mapped by an input column C B A B App Device Employee Device App C ESR A ESR Employee [1] V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. VLDB 2002.

12 Roadmap Motivation & proposal Problem statement Solution
Candidate query generation Candidate query verification VerifyAll SimplePrune Filter Experimental results Conclusion

13 Algorithm 1: VerifyAll Iterate over candidate queries in outer loop and rows in ET in inner loop (or vice versa) and verify whether a candidate query 𝑪𝑸 contains a row 𝒓 in its output. A candidate is valid iff it contains all the rows in ET. Performing (CQ,r)-verification is expensive! VerifyAll is wasteful as most candidate queries are invalid! Mary iPad Mike ThinkPad Office Dropbox Bob Non-empty result implies 𝐶 𝑄 2 satisfies row 1 Empty result implies 𝐶 𝑄 2 fails for row 2 𝐶 𝑄 2 ,2 -verification: SELECT * TOP 1 FROM Owner,Employee,Device,App WHERE Owner.EmpId=Employee.EmpId AND Owner.DevId=Device.DevId AND Owner.AppId=App.AppId AND CONTAINS(EmpName,’Mary’) AND CONTAINS(DevName,’iPad’) 𝐶 𝑄 2 ,1 -verification: SELECT * TOP 1 FROM Owner,Employee,Device,App WHERE Owner.EmpId=Employee.EmpId AND Owner.DevId=Device.DevId AND Owner.AppId=App.AppId AND CONTAINS(EmpName,’Mike’) AND CONTAINS(DevName,’ThinkPad’) AND CONTAINS(AppName,’Office’)

14 Opportunity of Pruning
Mike Mary ThinkPad iPad Office Dropbox Bob (CQ2,2) fails implies (CQ5, 2) fails 𝐶 𝑄 2 ,2 -verification: SELECT * TOP 1 FROM Owner,Employee,Device,App WHERE Owner.EmpId=Employee.EmpId AND Owner.DevId=Device.DevId AND Owner.AppId=App.AppId AND CONTAINS(EmpName,’Mary’) AND CONTAINS(DevName,’iPad’) Failure dependency Verifying candidates with smaller join trees is more beneficial! 𝐶 𝑄 5 ,2 -verification: SELECT * TOP 1 FROM Owner,Employee,Device,App, ESR WHERE Owner.EmpId=Employee.EmpId AND Owner.DevId=Device.DevId AND Owner.AppId=App.AppId AND ESR.AppId=App.AppId AND CONTAINS(EmpName,’Mary’) AND CONTAINS(DevName,’iPad’)

15 Algorithm 2: SimplePrune
Order candidate queries in increasing join tree size Keep a list of CQ-row verifications performed so far that failed Iterate over ordered candidate queries in the outer loop and rows in the inner loop. When verify candidate Q, check if its failure result can be implied by the verifications in the list. If so, prune Q immediately. Otherwise, verify Q for all the rows.

16 Observation limited pruning! Mike Mary ThinkPad iPad Office Dropbox
Bob limited pruning!

17 Opportunity Mike Mary ThinkPad iPad Office Dropbox Bob Evaluating common sub-structure on certain row may prune multiple invalid candidates!

18 Filter Filter success and failure Filter evaluation query
Owner Employee Device Owner Employee Device A B A B 𝜙(A)= Employee.EmpName 𝜙(B)= Device.DevName 𝜙(C)= App.AppName 𝜙’(A)= Employee.EmpName 𝜙’(B)= Device.DevName 𝜙’(C)= * 𝜙’(A)= Employee.EmpName 𝜙’(B)= Device.DevName 𝜙’(C)= * A B C A B C Mike Thinkpad Office Mary iPad Filter success and failure Filter evaluation query 𝐹 1 succeeds, 𝐹 2 fails

19 Dependency Properties of Filters
Filter-candidate dependency 𝐹 1 fails implies 𝐶 𝑄 2 is invalid F1 Owner Employee Device A B Inter-filter failure dependency F2 Owner Employee Device A B A B C C 𝐹 1 fails implies 𝐹 2 fails Mary iPad App A B C Mary iPad Inter-filter success dependency 𝐹 2 succeeds implies 𝐹 1 succeeds

20 Adaptive Filter Selection
J1 J2 J3 J4 Owner Employee Device A B Owner App Device B C Owner Employee App A C ESR App C Employee A (J1,1) (J1,2) (J1,3) (J2,1) (J2,2) (J2,3) (J3,1) (J3,2) (J3,3) (J4,1) (J4,2) (J4,3) CQ2 CQ3 CQ4 5 evaluations!

21 Adaptive Filter Selection
J1 J2 J3 J4 Owner Employee Device A B Owner App Device B C Owner Employee App A C ESR App C Employee A (J1,1) (J1,2) (J1,3) (J2,1) (J2,2) (J2,3) (J3,1) (J3,2) (J3,3) (J4,1) (J4,2) (J4,3) CQ2 CQ3 CQ4 2 evaluations!

22 Filter Selection Problem
Given the set of filters for all the candidate queries, select a set of filters with minimized cost such that all the candidate queries are verified as valid/invalid after evaluating the selected filters. Cost of 𝐹 𝑖 : # of joins in the join tree of 𝐹 𝑖 Problem Complexity: NP-hard Greedy algorithm: approx. ratio:

23 Roadmap Motivation Problem statement Solution Experimental results
Candidate query generation Candidate query verification VerifyAll SimplePrune Filter Experimental results Conclusion

24 Experiment Settings Dataset: IMDB Example table generation
Parameters: #rows, #columns, sparsity, value length for non-empty cells Implementations VerifyAll SimplePrune Filter Weave[2] Measures Number of verifications performed Execution time [2] L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-drive schema mapping. SIGMOD 2012.

25 Results on Various Example Tables
Vary #rows Filter performs 5X fewer verifications than VerifyAll and 2X fewer than SimplePrune Filter is robust to #rows, i.e. requires similar #verifications Filter runs 4X faster than VerifyAll and 3X faster than SimplePrune

26 Comparison with Weave Filter requires 10X fewer verifications
Filter runs 4X faster than Weave

27 Conclusion Develop a new search interface for discovering queries
Address challenges in query discovery Verify candidate queries efficiently Filter selection problem Greedy strategy

28 Thanks! Q&A


Download ppt "Discovering Queries based on Example Tuples"

Similar presentations


Ads by Google