Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis Papakonstantinou University of California, San Diego

Motivation Keyword search is the dominant information discovery method in documents Increasing amount of data is stored in databases Plain text coexists with structured data

Motivation Up until recently, information discovery in databases required: –Knowledge of schema –Knowledge of a query language (e.g., SQL) –Knowledge of the role of the keywords Goal: Enable IR-style keyword search over DBMSs without the above requirements

IR-Style Search over DBMSs IR keyword search well developed for document search Modern DBMSs offer IR-style keyword search over individual text attributes What is equivalent to document in databases?

Example – Complaints Database Schema

Example - Complaints Database Data tupleIdprodIdcustIddatecomments c1p121c3232 6-30-2002“disk crashed after just one week of moderate use on an IBM Netvista X41” c2p131c3131 7-3-2002“lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c31438-3-2002 “IBM Netvista unstable with Maxtor HD” tupleIdprodIdmanufacturermodel p1p121 “Maxtor” “D540X” p2p131 “IBM” “Netvista” p3p141“Tripplite”“Smart 700VA” tupleIdcustIdnameoccupation u1c3232“John Smith” “Software Engineer” u2c3131“Jack Lucas” “Architect” u3c3143“John Mayer” “Student” Complaints Customers Products

Example – Keyword Query [Maxtor Netvista] tupleIdprodIdcustIddatecomments c1p121c3232 6-30-2002“disk crashed after just one week of moderate use on an IBM Netvista X41” c2p131c3131 7-3-2002“lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c31438-3-2002 “IBM Netvista unstable with Maxtor HD” tupleIdprodIdmanufacturermodel p1p121 “Maxtor” “D540X” p2p131 “IBM” “Netvista” p3p141“Tripplite”“Smart 700VA” tupleIdcustIdnameoccupation u1c3232“John Smith” “Software Engineer” u2c3131“Jack Lucas” “Architect” u3c3143“John Mayer” “Student” Complaints Customers Products

Keyword Query Semantics (definition of “document” in databases) Keywords are: in same tuple in same relation in tuples connected through primary-foreign key relationships Score of result: distance of keywords within a tuple distance between keywords in terms of primary- foreign key connections IR-style score of result tree

Example – Keyword Query [Maxtor Netvista] tupleIdprodIdcustIddatecomments c1p121c3232 6-30-2002“disk crashed after just one week of moderate use on an IBM Netvista X41” c2p131c3131 7-3-2002“lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c31438-3-2002 “IBM Netvista unstable with Maxtor HD” tupleIdprodIdmanufacturermodel p1p121 “Maxtor” “D540X” p2p131 “IBM” “Netvista” p3p141“Tripplite”“Smart 700VA” tupleIdcustIdnameoccupation u1c3232“John Smith” “Software Engineer” u2c3131“Jack Lucas” “Architect” u3c3143“John Mayer” “Student” Complaints Customers Products Results: (1) c3, (2) p2  c3, (3) p1  c1

Result of Keyword Query Result is tree T of tuples where: each edge corresponds to a primary- foreign key relationship no tuple of T is redundant (minimality) - “AND” query semantics: Every query keyword appears in T - “OR” query semantics: Some query keywords might be missing from T

Score of Result T Combining function Score combines scores of attribute values of T One reasonable choice: Score=  a  T Score(a)/size(T) Attribute value scores Score(a) calculated using the DBMS's IR “datablades”

Shortcomings of Prior Work Simplistic ranking methods (e.g., based only on size of connecting tree), ignoring well-studied IR ranking strategies No straightforward extension to improve efficiency by returning just top-k results Not good in handling free-text attributes [DBXplorer,DISCOVER]

Example – Keyword Query [Maxtor Netvista] tupleIdprodIdcustIddatecomments c1p121c3232 6-30-2002“disk crashed after just one week of moderate use on an IBM Netvista X41” c2p131c3131 7-3-2002“lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c31438-3-2002 “IBM Netvista unstable with Maxtor HD” tupleIdprodIdmanufacturermodel p1p121 “Maxtor” “D540X” p2p131 “IBM” “Netvista” p3p141“Tripplite”“Smart 700VA” tupleIdcustIdnameoccupation u1c3232“John Smith” “Software Engineer” u2c3131“Jack Lucas” “Architect” u3c3143“John Mayer” “Student” Complaints Customers Products Results: (1) c3, (2) p2  c3, (3) p1  c1 Score(c3) = 4/3 Score(p2  c3) = (1+4/3)/2 = 7/6 Score(p1  c1) = (1+1/3)/2 = 4/6 score 1/3 4/3 score 1 1 0

Architecture Complaints Q = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] Products Q = [(p1,manufacturer,1), (p2,model,1)] Complaints Q Products Q Complaints Q  Products Q Complaints Q  Customer {}  Complaints Q Complaints Q  Product {}  Complaints Q... SELECT * FROM Complaints Q c, Products Q p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;... [Maxtor Netvista] c3 p2  c3 p1  c2

Candidate Network Generator Find all trees of tuple sets (free or non-free) that may produce a result, based on DISCOVER's CN generator [VLDB 2002] Use single non-free tuple set for each relation –allows “OR” semantics –fewer CNs are generated –extra filtering step required for “AND” semantics

Candidate Network Generator Example For query [Maxtor Netvista], CNs: Complaints Q Products Q Complaints Q  Products Q Complaints Q  Customer {}  Complaints Q Complaints Q  Product {}  Complaints Q Non-CNs: Complaints Q  Customer {}  Complaints {} Product Q  Complaints {}  Product Q

Architecture c3 p2  c3 p1  c2 Complaints Q = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] Products Q = [(p1,manufacturer,1), (p2,model,1)] Complaints Q Products Q Complaints Q  Products Q Complaints Q  Customer {}  Complaints Q Complaints Q  Product {}  Complaints Q... SELECT * FROM Complaints Q c, Products Q p WHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;... [Maxtor Netvista]

Execution Algorithms Users usually want top-k results. Hence, submitting to DBMS a SQL query for each CN (Naïve algorithm) is inefficient. When queries produce at most very few results, Naïve algorithm is efficient, since it fully exploits DBMS. Monotonic combining functions: if results T, T' have same schema and for every attribute Score(a i )≤Score(a' i ) then Score(T)≤Score(T')

Sparse Algorithm: Example Execution CNresults scoreMFS Products Q Complaints Q Complaints Q  Products Q c2 7 7 p1 9 9 c1  p1 (9+5)/2=7 (9+7)/2 = 8 Best when query produces at most a few results

Single Pipelined Algorithm: Example Execution CN: Complaints Q  Products Q MPFS =Max[(5+9)/2, (7+6)/2]=7Max[(1+9)/2, (7+6)/2]=6.5 resultscore Results queue p 1 →c 1 7 Output: p 1 →c 1 Max[(1+9)/2, (7+1)/2]=5 p 2 →c 2 6.5 p 2 →c 2 Get next tuple from most promising non-free tuple set

Global Pipelined Algorithm : Example Execution global MPFS=max(MPFS i ) over all CNs C i Best when query produces many results.

Hybrid Algorithm Estimate number of results. –For “OR”-semantics, use DBMS estimator –For “AND”-semantics, probabilistically adjust DBMS estimator. If at most a few query results expected, then use Sparse Algorithm. If many query results expected, then use Global Pipelined Algorithm.

Related Work DBXplorer [ICDE 2002], DISCOVER [VLDB 2002] –Similar three-step architecture –Score = 1/size(T) –Only AND semantics –No straightforward extension for efficient top-k execution BANKS [ICDE 2002], Goldman et al. [VLDB 1998] –Database viewed as graph –No use of schema Florescu et al. [WWW 2000], XQuery Full-Text Ilyas et al. [VLDB 2003], J* algorithm [VLDB 2001] –Top-k algorithms for join queries

Experiments – DBLP Dataset DBLP contains few citation edges. Synthetic citation edges were added such that average # citations is 20. Final dataset is 56MB. Experiments run over state-of-the-art commercial RDBMS. C: Conference Y: Year P: Paper A: Author

OR Semantics: Effect of Maximum Allowed CN Size Average execution time of 100 2-keyword top-10 queries

OR Semantics: Effect of Number of Objects Requested k Average execution time of 100 2-keyword queries with maximum candidate-network size of 6

OR Semantics: Effect of Number of Query Keywords Average execution time of 100 top-10 queries with maximum candidate-network size of 6

Conclusions Extend IR-style ranking to databases. Exploit text-search capabilities of modern DBMSs, to generate results of higher quality. Support both “AND” and “OR” semantics. Achieve substantial speedup over prior work via pipelined top-k query processing algorithms.

Questions?

Compare algorithms wrt Result size OR-semantics Max CN size = 6, top-10, 2 keywords, OR-semantics AND-semantics

Ranking Functions Proposed algorithms support tuple monotone combining functions That is, if results T, T ' have same schema and for every attribute Score(a i )≤Score(a ' i ) then Score(T)≤Score(T ' )

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.

Similar presentations

Presentation on theme: "Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.

Similar presentations

Presentation on theme: "Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis."— Presentation transcript:

Similar presentations

About project

Feedback