Presentation on theme: "Discovering Queries based on Example Tuples Yanyan Shen 1, Kaushik Chakrabati 2, Surajit Chaudhuri 2, Bolin Ding 2, Lev Novik 2 1 National University of."— Presentation transcript:
Discovering Queries based on Example Tuples Yanyan Shen 1, Kaushik Chakrabati 2, Surajit Chaudhuri 2, Bolin Ding 2, Lev Novik 2 1 National University of Singapore, 2 Microsoft Corporation
Complex Database Schema
Challenge: Querying Complex Databases SQL SELECT CustName, DevName, AppName FROM Customer, Sales, Device, App WHERE Sales.CustId=Custom.CustId AND Sales.DevId=Device.DevId AND Sales.AppId=App.AppId Target schema Relevant tables Join path Any help to formulate a SQL query?
ESRIdEmpIdAppIdDesc sr1e1a1Office crash sr2e2a3Dropbox can’t sync OIdEmpIdDevIdAppId o1e1d1a1 o2e2d3a3 o3e3d2a2 AppIdAppName a1Office 2013 a2Evernote a3Dropbox DevIdDevName d1ThinkPad X1 d2iPad Air d3Nexus 7 EmpIdEmpName e1Mike Stone e2Mary Lee e3Bob Nash CustIdCustName c1Mike Jones c2Mary Smith c3Bob Evans SIdCustIdDevIdAppId s1c1d1a1 s2c2d2a2 s3c3d3a3 Can Keyword Search Help? Output: matched rows Input: Where is schema information? Ambiguity *search for sales tuples Mike Jones s1ThinkPad X1 Office 2013 Mike Stone o1ThinkPad X1 Office 2013 Mike ThinkPad Office Customer Device App Employee Owner ESR Sales
Mike Mary ThinkPad iPad Office Dropbox Bob Our Proposal A CustIdCustName Customer DevIdDevName Device AppIdAppName App Sales SIdCustIdDevIdAppId B C Input (Example table) *Who bought which product with which app installed. Output(Project join query)
Candidate Query Generation Candidate Projection Column Retrieval –For each column in the example table, find candidate projection columns in the database satisfying column constraint: contain all the keywords in the column Mike Mary ThinkPad iPad Office Dropbox Bob Input columnCandidate projection columns ACustomer.CustName Employee.EmpName BDevice.DevName CApp.AppName ESR.Desc
Candidate Query Generation Candidate Query Enumeration –Follow candidate network generation algorithm  Mike Mary ThinkPad iPad Office Dropbox Bob  V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. VLDB Sales CustomerDevice App B CQ 1 A C Owner Employee App CQ 2 A B C Device CQ 3 Owner Employee Device ESR A B C Owner AppDevice CQ 4 BC Employee A ESR Owner EmployeeDeviceApp CQ 5 B C A No join is required!
Algorithm 1: VerifyAll MaryiPad MikeThinkPad Office DropboxBob Performing (CQ,r)-verification is expensive! VerifyAll is wasteful as most candidate queries are invalid!
Opportunity of Pruning Mike Mary ThinkPad iPad Office Dropbox Bob (CQ 2,2) fails implies (CQ 5, 2) fails Failure dependency Verifying candidates with smaller join trees is more beneficial! Failure dependency Verifying candidates with smaller join trees is more beneficial!
Algorithm 2: SimplePrune Order candidate queries in increasing join tree size Keep a list of CQ-row verifications performed so far that failed Iterate over ordered candidate queries in the outer loop and rows in the inner loop. –When verify candidate Q, check if its failure result can be implied by the verifications in the list. If so, prune Q immediately. Otherwise, verify Q for all the rows.
Observation Mike Mary ThinkPad iPad Office Dropbox Bob limited pruning!
Opportunity Mike Mary ThinkPad iPad Office Dropbox Bob Evaluating common sub-structure on certain row may prune multiple invalid candidates!
Filter Owner EmployeeDevice MikeThinkpadOffice ACB BA F 1 Owner Employee Device BA F 2 MaryiPad ACB
Dependency Properties of Filters Owner Employee Device A B ACB MaryiPad Filter-candidate dependency Inter-filter failure dependency Owner Employee Device A B ACB MaryiPad App C Inter-filter success dependency F 1 F 2
Adaptive Filter Selection ESR App C Employee A Owner Employee Device A B Owner AppDevice B C Owner Employee App A C CQ 2 CQ 3 CQ 4 J1J1 J2J2 J3J3 J4J4 (J 3,1) (J 3,2) (J 3,3) (J 4,1) (J 4,2) (J 4,3) (J 1,1)(J 1,2)(J 1,3)(J 2,1)(J 2,2)(J 2,3) 5 evaluations!
Adaptive Filter Selection ESR App C Employee A Owner Employee Device A B Owner AppDevice B C Owner Employee App A C CQ 2 CQ 3 CQ 4 J1J1 J2J2 J3J3 J4J4 (J 3,1) (J 3,2) (J 3,3) (J 4,1) (J 4,2) (J 4,3) 2 evaluations! (J 1,1)(J 1,2) (J 1,3) (J 2,1)(J 2,2)(J 2,3)
Experiment Settings Dataset: IMDB Example table generation –Parameters: #rows, #columns, sparsity, value length for non-empty cells Implementations –VerifyAll –SimplePrune –Filter –Weave  Measures –Number of verifications performed –Execution time  L. Qian, M. J. Cafarella, and H. V. Jagadish. Sample-drive schema mapping. SIGMOD 2012.
Results on Various Example Tables Vary #rows Filter performs 5X fewer verifications than VerifyAll and 2X fewer than SimplePrune Filter is robust to #rows, i.e. requires similar #verifications Filter runs 4X faster than VerifyAll and 3X faster than SimplePrune
Comparison with Weave Filter requires 10X fewer verifications Filter runs 4X faster than Weave
Conclusion Develop a new search interface for discovering queries Address challenges in query discovery –Verify candidate queries efficiently Filter selection problem Greedy strategy