1 Efficient IR-Style Keyword Search over Relational Databases 12 December 2005 Databases and the Internet Seminar on Databases and the Internet The Hebrew.

1 Efficient IR-Style Keyword Search over Relational Databases 12 December 2005 Databases and the Internet Seminar on Databases and the Internet The Hebrew University of Jerusalem, Winter 2006

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 2Introduction This presentation is mainly based upon the work of Hristidis, Gravano, and Papakonstantinou. The work consists of showing several Efficient algorithms for Information-retrieval Keyword search, based on the DISCOVER Architecture.

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 3Contents Introduction Goal and Motivation Framework and examples Architecture Algorithms Experimental Results Criticism and Conclusion

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 5 Goal and Motivation We present a detailed framework and methods for combining IR-style keyword search over relational databases What is Information Retrieval Keyword Search in general? Mainly, it’s this…

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 6 Goal and Motivation …But not always: SELECT * FROM Complaints C WHERE CONTAINS (C.comment, ’disk crash’, 1) > 0 ORDER BY score(1) DESC SELECT * FROM Complaints C WHERE CONTAINS (C.comment, ’disk crash’, 1) > 0 ORDER BY score(1) DESC prodIDcustIDdatecomment p121c32326-30-2002“Disk crashed after one week of moderate use on an IBM Netvista X41” p131c31317-3-2002“lower-end IBM Netvista caught fire, starting apparently with disk Crash”

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 7 Goal and Motivation Current status: RDBMSs (Such as Oracle) provide querying capabilities for text attributes, provided that an exact colum is specified. Only AND semantics are being used. Limited ranking functions. Known approaches for query processing strategies are inefficient (and sometimes even infeasible).

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 8 Goal and Motivation In particular, we’d like: Efficient ways to generate “top k” results according to some form of “ranking”. The Use AND and OR semantics (not just the default AND) when gaining results. Assembling keyword occurances from multiple attributes - perhaps in “unforseen” ways – without needing to specify columns.

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 9 Goal and Motivation We would like to apply same (or similar) methods and rules that apply in this world, Prioritizing - K-best results first Prioritizing - K-best results first Efficient Searching Efficient Searching Use of AND, OR Semantics Use of AND, OR Semantics

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 10 Goal and Motivation Why should we care?? Keyword queries require little or no knowledge about the database semantics. Ranking results correctly (and returning only relevant tuples) is, of course, highly desirable. Efficient implementation should reduce the querying process to a fraction of the time of a naïve implementation.

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 13Framework Customers custId, name, occupations Complaints prodId, custId, date, comments Products prodId, model manufacturer Query Model: A database with n relations R1,…, Rn. relations possibly have primary key to foreign key constraints. The schema graph G is a directed graph, in which for each primary to foreign key relationship between Ri and Rj, there’s an edge (i,j) :

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 14Framework A possible instance of the schema graph can be: tupleIDprodIDcustIDdatecomment c1p121c32326-30- 2002 “Disk crashed after one week of moderate use on an IBM Netvista X41” c2p131c31317-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c31438-3-2002 “IBM Netvista unstable with Maxtor HD” Complaints tupleIDprodIDmanufac. model p1p121“Maxtor” “D540X” p2p131“IBM” “Netvista” p3p141“Tripplite ” “Smart 700VA” Products tupleIDcustIDnameOccupation u1c3232“John Smith” “Software engineer” u2c3131“John L.” “Architect” u3c3143“Jack M.” “student” Customers

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 15Framework Joining trees of tuples: Given a schema graph G for a database, a joining tree of tuples T is a tree of tuples where each edge (t i,t j ) in T, where t i ∈ R i and t j ∈ R j and, which satisfies 2 properties: (1)(R i, R j ) ∈ G (The schema graph we talked about) (2)t i ⋈ t j ∈ R i, ⋈ R j The size(T) of a joining tree is the number of tuples in T.

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 16Framework A joining tree of tuples for our example: tupleIDprodIDcustIDdatecomment c2p131c31317-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk” Complaints tupleIDprodIdmanufac.model p2p131“IBM” “Netvista” Products tupleIDcustIdnameOccupation u2c3131“John L.” “Architect” Customers ⋈ ⋈

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 17Framework “Top-k” keyword query a “top-k” keyword query is a list of keywords Q={w 1… w m }. The result for such a query is a list of the k joining trees of tuples T whose score(T,Q) is the highest, so that: (1)each tree T in a result is minimal: cannot have a zero-scored leaf. (2)no tuple appears more than once in a joining tree of tuples.

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 18Framework For example, the query Q = {Netvista, Maxtor} should yield the following results: C1 (by itself) tupleIDprodIDcustIDdatecomment c1p121c32326-30- 2002 “Disk crashed after one week of moderate use on an IBM Netvista X41” c3p131c31438-3-2002 “IBM Netvista unstable with Maxtor HD” Complaints tupleIDprodIdmanufac.model p1p121“Maxtor” “D540X” p2p131“IBM” “Netvista” p3p141“Tripplite” “Smart 700VA” Products tupleIDcustIdnameOccupation u1c3232“John Smith” “Software engineer” u2c3131“John L.” “Architect” u3c3143“Jack M.” “student” Customers

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 19Framework And the following: p2  c3 tupleIDprodIDcustIDdatecomment c1p121c32326-30- 2002 “Disk crashed after one week of moderate use on an IBM Netvista X41” c3p131c31438-3-2002 “IBM Netvista unstable with Maxtor HD” Complaints tupleIDprodIdmanufac.model p1p121“Maxtor” “D540X” p2p131“IBM” “Netvista” p3p141“Tripplite” “Smart 700VA” Products tupleIDcustIdnameOccupation u1c3232“John Smith” “Software engineer” u2c3131“John L.” “Architect” u3c3143“Jack M.” “student” Customers

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 20Framework And the following: p1  c1 tupleIDprodIDcustIDdatecomment c1p121c32326-30- 2002 “Disk crashed after one week of moderate use on an IBM Netvista X41” c3p131c31438-3-2002 “IBM Netvista unstable with Maxtor HD” Complaints tupleIDprodIdmanufac.model p1p121“Maxtor” “D540X” p2p131“IBM” “Netvista” p3p141“Tripplite” “Smart 700VA” Products tupleIDcustIdnameOccupation u1c3232“John Smith” “Software engineer” u2c3131“John L.” “Architect” u3c3143“Jack M.” “student” Customers

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 21Framework Score (a i,Q) A method to evaluate the relevance of a tree of tuples. Consists of a single-attribute (a i ) IR- style relevance scoring function: tf - Term frequency of w (w ∈ Q) in a i tf - Term frequency of w (w ∈ Q) in a i N - number of tuples in a i ’s relation N - number of tuples in a i ’s relation df - number of tuples in a i ’s relation with the word w df - number of tuples in a i ’s relation with the word w dl, avdl - (average) attribute value size dl, avdl - (average) attribute value size S - a constant S - a constant

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 22Framework Combined Score (T,Q) another function should be used to combine the single attributes into a final score: those are only optional candidates This framework can handle many functions - as long as they satisfy the Tuple monitonicity property: if individual Scores of tuples in T’ < individual Scores of T, then the combined score of the trees will also have this property.

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 23Framework Candidate Networks (CN) can be thought of as a join expression that involves tuple sets plus (perhaps) “base” relations, that do not have occurrences of query keywords, but help to connect relations that do… tupleIDprodIDcustIDdatecomment Complaints {} tupleIDprodIdmanufac.model p2p131“IBM”“netvista” Products Q tupleIDcustIdnameOccupation u2c3131“John L.” “Architect” ⋈ ⋈ Q = {IBM, Architect} customers Q

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 24Framework For example, all the candidate networks (With scores) For Q = {Maxtor,Netvista}: P = products C = complaints U = customers

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 27Architecture Follows is a quick overview of the system architecture needed in order to efficiently implement top-k keyword queries. Description relies much on the DISCOVER architecture, but is not really OS/RDBMS specific.

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 28Architecture The architecture consists of: –an IR Engine –a CN generator –an Execution Engine Keywords IR Engine Tuple Sets Candidate Network Generator Database Schema Execution engine Database Candidate Networks Parameterized SQL queries User IR index

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 29Architecture IR Engine Modern RDBMSs include IR- style text-indexing functionality (e.g. Oracle Text). It is useful to think of the IR- engine as an indexer that gives a SCORE>0 to tuples that have occurrences of the keywords Keywords IR Engine Tuple Sets Candidate Network Generator Database Schema Execution engine Database Candidate Networks Parameterized SQL queries User IR index

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 30Architecture IR Engine The proposed architecture exploits this functionality - upon arrival of a query Q, generates for each relation the tuple set R Q = { t ∈ R | Score(t,Q) > 0} The tuple sets are then sorted by decreasing score and passed on to the next module. Keywords IR Engine Tuple Sets Candidate Network Generator Database Schema Execution engine Database Candidate Networks Parameterized SQL queries User IR index

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 31Architecture CN Generator receives non-empty tuple sets (Such as C Q, P Q ), and the general schema graph. attempts to join those sets, perhaps using “base” relations (U { } … remember?) - generates Candidate Networks (CNs)! Keywords IR Engine Tuple Sets Candidate Network Generator Database Schema Execution engine Database Candidate Networks Parameterized SQL queries User IR index

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 32Architecture CN Generator Also receives a parameter M, that bounds the maximum tuple sets participating in a CN (either free or non-free).] Why is this boundary needed? Keywords IR Engine Tuple Sets Candidate Network Generator Database Schema Execution engine Database Candidate Networks Parameterized SQL queries User IR index Number of CN Might be exponential in query size!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 33Architecture CN Generator The generated CNs MUST satisfy: No “leaf” of a tuple set is a “free” tuple set (P {} …). No R  S  R tuple set exists – a tree of tuples cannot include duplicate tuples! Keywords IR Engine Tuple Sets Candidate Network Generator Database Schema Execution engine Database Candidate Networks Parameterized SQL queries User IR index

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 34Architecture Execution Engine This is the module that actually contacts the RDBMS query tools, in order to generate the top-k results. This is our focus! (as it’s the most hard to implement efficiently) Keywords IR Engine Tuple Sets Candidate Network Generator Database Schema Execution engine Database Candidate Networks Parameterized SQL queries User IR index

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 35 Sparse algorithm example Recall the database from before, with the query Q= {Maxtor, Netvista} tupleIDprodIDcustIDdatecomment c1p121c32326-30- 2002 “Disk crashed after one week of moderate use on an IBM Netvista X41” c2p131c31317-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c31438-3-2002 “IBM Netvista unstable with Maxtor HD” Complaints tupleIDprodIdmanufac.model p1p121“Maxtor” “D540X” p2p131“IBM” “Netvista” p3p141“Tripplite” “Smart 700VA” Products tupleIDcustIdnameOccupation u1c3232“John Smith” “Software engineer” u2c3131“John L.” “Architect” u3c3143“Jack M.” “student” Customers

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 36 Architecture - demonstration {Maxtor, netvista} User Database Keywords IR Engine Tuple Sets Candidate Network Generator Database Schema Execution engine Database Candidate Networks Parameterized SQL queries User IR index

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 37 Architecture - demonstration {Maxtor, netvista} User Database Keywords IR Engine Tuple Sets Candidate Network Generator Database Schema Execution engine Database Candidate Networks Parameterized SQL queries User IR index We now turn our attention to how THIS is done

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 40 First of all, what do we have so far? An architecture that constructs Candidate Networks from keyword queries, using “black box” functions of modern RDBMSs, and some given SCORE functions. A notion of what should be done in order to produce the keyword query results. So, how would you do it???

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 41 Naïve algorithm The naïve approach: simply issue an SQL query for each CN. The results from all the queries are then combined using Sort-Merge-Join. Main problem – runtime. What characteristic(s) can we use in order to make our algorithm more efficient?

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 42 Naïve algorithm is too slow Remember that the IR Engine returns Tuple sets that are ranked in DESCENDING order in respect to the SCORE() function. So, when applying COMBINE(Score(T,Q)) for a whole CN, we can get an ESTIMATE of its maximal possible score For CN i (MPS i ). We can use this knowledge to disregard “unfruitful” CNs!!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 43 Sparse Algorithm For every CN i, compute MPS i. If MPS i does not exceed the lowest “best-k” match for the query found so far, DISCARD CN i. Otherwise, join tuples in CN i as usual… As a further optimization, CNs are evaluated in ASCENDING SIZE order - smaller CNs, are evaluated first, while “heavy” CNs might be discarded after only short calculation steps!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 44 Sparse algorithm example Remember this database, with the query Q= {Maxtor, Netvista} ? tupleIDprodIDcustIDdatecomment c1p121c32326-30- 2002 “Disk crashed after one week of moderate use on an IBM Netvista X41” c2p131c31317-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c31438-3-2002 “IBM Netvista unstable with Maxtor HD” Complaints tupleIDprodIdmanufac.model p1p121“Maxtor” “D540X” p2p131“IBM” “Netvista” p3p141“Tripplite” “Smart 700VA” Products tupleIDcustIdnameOccupation u1c3232“John Smith” “Software engineer” u2c3131“John L.” “Architect” u3c3143“Jack M.” “student” Customers

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 45 Sparse algorithm example Suppose we want to find the Top-2 best results for this query Q={Maxtor, Netvista} on our existing database. The CN generator supplies our execution engine with the following Candidate Networks, with M=3: We start off with C Q,let’s take a look:

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 46 Sparse algorithm example C Q consists of all the tuples (with Different scores, of course): tupleIDprodIDcustIDdatecomment c1p121c32326-30- 2002 “Disk crashed after one week of moderate use on an IBM Netvista X41” c2p131c31317-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c31438-3-2002 “IBM Netvista unstable with Maxtor HD” Complaints Q C3 – it’s SCORE is 1.33 C2 – it’s SCORE is 0.33 C1 – it’s SCORE is 0.33

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 47 Sparse algorithm example We start off with C Q, no need to calculate MPS(C Q ) – but we do it anyway! We already know everything! (We got these exact results from the IR engine! We now turn to examine the CN P Q... CQCQ C 3 = 1.33 C 1 = 0.33 C 2 = 0.33 MPS(C Q )= 1.33 2 BEST RESULTS QUEUE C 3 = 1.33 C 1 = 0.33

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 48 Sparse algorithm example These are the relevant tuples that P Q consists of: tupleIDprodIdmanufac.model p1p121“Maxtor” “D540X” p2p131“IBM” “Netvista” p3p141“Tripplite” “Smart 700VA” Products Q P1 – it’s SCORE is 1 P2 – it’s SCORE is 1

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 49 Sparse algorithm example Let’s look at the algorithm function over P Q : We calculate MPS(P Q ) = 1, so it might still yield some result that can be added to the TOP-K Queue. We now turn to examine the CN C Q  P Q... CQCQ C 3 = 1.33 C 1 = 0.33 C 2 = 0.33 MPS(C Q )= 1.33 2 BEST RESULTS QUEUE C 3 = 1.33 C 1 = 0.33 PQPQ P 1 = 1 P 2 = 1 MPS(P Q )= 1 P 1 = 1

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 50 Sparse algorithm example These are the joins of C Q  P Q : tupleIDprodIdmanufac.model p1p121“Maxtor” “D540X” p2p131“IBM” “Netvista” Products Q tupleIDprodIDcustIDdatecomment c1p121c32326-30- 2002 “Disk crashed after one week of moderate use on an IBM Netvista X41” c2p131c31317-3- 2002 “lower-end IBM Netvista caught fire, starting apparently with disk” c3p131c31438-3- 2002 “IBM Netvista unstable with Maxtor HD” Complaints Q C 3  P 2 SCORE: 1.17 C 2  P 2 SCORE: 0.66 C 1  P 1 SCORE: 0.66

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 51 Sparse algorithm example Now, we turn to examine C Q  P Q... We calculate MPS(C Q  P Q ) = (1+1.33) / 2=1.17, so it might still yield some result! CQCQ C 3 = 1.33 C 1 = 0.33 C 2 = 0.33 MPS(C Q )= 1.33 2 BEST RESULTS QUEUE C 3 = 1.33 P 1 = 1 PQPQ P 2 = 1 MPS(P Q )= 1 MPS (C Q  P Q ) = 1.17 C Q  P Q C 3  P 2 = 1.17 C 1  P 1 = 0.67 C 2  P 2 = 0.67 C3  P2 = 1.17

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 52 Sparse algorithm example Now, we turn to examine C Q  P { }  C Q... We calculate MPS(C Q  P { }  C Q )= (1.33 + 1.33) / 3 = 0.89, so we don’t need to calculate this CN! and the same goes for C Q  U { }  C Q. We’re finished! We return {C 3, C 3  P 2 } as results. CQCQ C 3 = 1.33 C 1 = 0.33 C 2 = 0.33 MPS(C Q )= 1.33 2 BEST RESULTS QUEUE C 3 = 1.33 P 1 = 1 PQPQ P 2 = 1 MPS(P Q )= 1 MPS (C Q  P Q ) = 1.17 C Q  P Q C 3  P 2 = 1.17 C 1  P 1 = 0.67 C 2  P 2 = 0.67 C3  P2 = 1.17 MPS (C Q  U {}  P Q ) = 0.89 No need To calculate!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 53 Sparse is nice, but… What if there are many possible answers, some of them requiring multiple joins? (Keywords are “hiding” in multiple relations) Apparently, the Sparse algorithm becomes (almost) as inefficient as the Naïve algorithm – especially acute in AND queries. What plan should we devise now?? We need to make better use of our architecture!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 54 The Single-pipelined algorithm This Single-Pipelined Algorithm is essentially what we’d like to happen in a SINGLE CN case. IT DOES NOT solve the problem in whole but… It’s a great building block for the more sophisticated General-pipelined algorithm!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 55 The Single-pipelined algorithm This algorithm accepts a Candidate Network, The Non-empty tuple-sets TS 1 …TS k that participate in it. Recall TS i corresponds with a relation R i, that has tuples matching the query keywords (already ordered in descending order according to the SCORE function). The Single-Pipelined Algorithm’s output: A stream of joining trees of tuples in descending SCORE order.

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 56 The Single-pipelined algorithm We need to keep track of the prefix S(TS) we’ve already retrieved from every tuple set. Each iteration, retrieve another tuple t from some TS k, and try to match it against all other tuple sets, to create potential joining trees. All the joining trees of tuples T that we’ve found are added to the Queue of results. Anyone see a problem here?

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 57 The Single-pipelined algorithm Yup, we’re back to the Naïve algorithm, aren’t we? Well – not quite! In order to guarantee that some result we’ve produced will be in the top-k, we need a similar method to the MPS. The MPFS i - Maximum Possible Future Score will be our estimate for the maximum score of any yet “unseen” result from TS i.

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 58 The Single-pipelined algorithm We would’ve liked using the status of each prefix S(TS k ) to bound the maximum score it can yield from a yet unretrieved tuple: MPFS i = Max { Score(T,Q) | T ∈ TS 1 ⋈ … TS i-1 ⋈ ( TS i – S(TS i )) ⋈ … TS n } This is expensive! Instead we produce a cheaper over estimate – MPFS’ i – computed as the score of the next tuple from TS i, combined with the top-ranked tuples from every other TS.

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 59 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} Suppose the algorithm receives a CN with 3 Tuple sets and three free tuple sets that connect them: TS 3 S(TS 3 )= ∅ MPFS’ 3 = ? TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 S(TS 1 )= ∅ TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 S(TS 2 )= ∅ MPFS’ 1 = ? MPFS’ 2 = ? MPFS’ all = ∅ TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore Output Queue TupleScore

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 60 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} We want the algorithm to output BEST-6 results! TS 3 S(TS 3 )= ∅ MPFS’ 3 = ? TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 S(TS 1 )= ∅ TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 S(TS 2 )= ∅ MPFS’ 1 = ? MPFS’ 2 = ? MPFS’ all = ∅ TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore Output Queue TupleScore

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 61 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} First, we calculate MPFS’ i which is similar in every TS in The beginning. TS 3 S(TS 3 )= ∅ TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 S(TS 1 )= ∅ TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 S(TS 2 )= ∅ MPFS’ all = ∅ TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore Output TupleScore MPFS’ 3 = 3.16 MPFS’ 1 =3+9 +7/6=3.16 MPFS’ 2 = 3.16

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 62 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} Then we compute MPFS’ all as the maximum of MPFS’ i TS 3 S(TS 3 )= ∅ TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 S(TS 1 )= ∅ TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 S(TS 2 )= ∅ TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore TupleScore MPFS’ 3 = 3.16 MPFS’ 1 =3+9 +7/6=3.16 MPFS’ 2 = 3.16 MPFS’ all =3.16 Output

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 63 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} Now, We advance one of the S(TS i ), say S(TS 1 ), and Have to update MPFS’ 1 ! TS 3 S(TS 3 )= ∅ TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 S(TS 2 )= ∅ TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore TupleScore MPFS’ 3 = 3.16 MPFS’ 1 =2+9 +7/6=3 MPFS’ 2 = 3.16 MPFS’ all =3.16 S(TS 1 )= Output

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 64 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} We try to join A 1, with all the other tuples in S(TSi), but There aren’t any. TS 3 S(TS 3 )= ∅ TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 S(TS 2 )= ∅ TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore TupleScore MPFS’ 3 = 3.16 MPFS’ 1 =2+9 +7/6=3 MPFS’ 2 = 3.16 MPFS’ all =3.16 S(TS 1 )= Output

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 65 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} We advance S(TS 2 ), We also have no luck getting join results. Now the MPFS’s will be: TS 3 S(TS 3 )= ∅ TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore TupleScore MPFS’ 3 = 3.16 MPFS’ 2 = 3+3 +7/6 = 2.16 MPFS’ all =3.16 S(TS 1 )= S(TS 2 )= Output MPFS’ 1 =3

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 66 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} We advance S(TS 3 ), this time we’ve managed to join C 1 ⇝ B 1 ⇜ A 1. (We’re not forgetting to update MPFS’ 3 !) : TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore TupleScore MPFS’ 3 = 3+3+9/6=2.5 MPFS’ 2 = 2.16 MPFS’ all =3.16 S(TS 1 )= S(TS 2 )= S(TS 3 )= Output MPFS’ 1 =3

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 67 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} The SCORE of C1 ⇝ B1 ⇜ A1 is 3.16 =MPFS’ all, so we output it! TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 TupleScore MPFS’ 3 = 2.5 MPFS’ 2 = 2.16 MPFS’ all =3.16 S(TS 1 )= S(TS 2 )= S(TS 3 )= Output MPFS’ 1 =3

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 68 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} But now MPFS’ all should reduce! Remember - it’s equal to the Max{MPFS’ i }… TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 TupleScore MPFS’ 3 = 2.5 MPFS’ 2 = 2.16 MPFS’ all =3 S(TS 1 )= S(TS 2 )= S(TS 3 )= Output MPFS’ 1 =3

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 69 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} Now, we turn to advance S(TS1) again… TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 TupleScore MPFS’ 3 = 2.5 MPFS’ 2 = 2.16 MPFS’ all =3 S(TS 1 )= S(TS 2 )= S(TS 3 )= Output MPFS’ 1 =3

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 70 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} Now, we turn to advance S(TS1) again… we have no luck joining A 2, but we update MPFS’ 1 and MPFS’ all … TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 TupleScore MPFS’ 3 = 2.5 MPFS’ 1 =1+9 +7/6=2.83 MPFS’ 2 = 2.16 MPFS’ all =2.83 S(TS 1 )= S(TS 2 )= S(TS 3 )= Output

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 71 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} We now try join B 2, with any other in S(TS i ) and succeed! We find C 1 ⇝ B 2 ⇜ A 2 with score 3+2+7/6 = 2 < MPFS ’all ! TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 TupleScore MPFS’ 2 = 1.84 S(TS 1 )= S(TS 2 )= Output MPFS’ 1 =2.83 MPFS’ all =2.83 MPFS’ 3 = 2.5 S(TS 3 )= We keep C1 ⇝ B2 ⇜ A2 in a queue for later output! We keep C1 ⇝ B2 ⇜ A2 in a queue for later output!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 72 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} We now try join C 2, with any other in S(TS i ) and succeed! We find C 2 ⇝ B 1 ⇜ A 1 with score 3+3+9/6 = 2.5 <MPFS’ all ! TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 TupleScore MPFS’ 3 = 2.33 MPFS’ 2 = 1.84 S(TS 1 )= S(TS 2 )= S(TS 3 )= Output MPFS’ 1 =2.83 MPFS’ all =2.83 We keep C2 ⇝ B1 ⇜ A1 in a queue for later output! We keep C2 ⇝ B1 ⇜ A1 in a queue for later output!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 73 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} We go back to S(TS 1 ), TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 TupleScore MPFS’ 3 = 2.33 MPFS’ 2 = 1.84 S(TS 1 )= S(TS 2 )= S(TS 3 )= Output MPFS’ 1 =2.83 MPFS’ all =2.83

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 74 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} We advance S(TS 1 ), And manage to find two joins – C 1 ⇝ B 1 ⇜ A 3 =1+9+7/6=2.83, which we output! TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 C1B1A3C1B1A3 2.83 TupleScore MPFS’ 3 = 2.33 MPFS’ 1 =0 MPFS’ 2 = 1.84 S(TS 1 )= S(TS 2 )= S(TS 3 )= Output MPFS’ all =2.83

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 75 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} We advance S(TS 1 ), And manage to find two joins – C 2 ⇝ B 1 ⇜ A 3 =1+9+3/6=2.33, which we can’t yet output TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 C1B1A3C1B1A3 2.83 TupleScore MPFS’ 3 = 2.33 MPFS’ 1 =0 MPFS’ 2 = 1.84 S(TS 1 )= S(TS 2 )= S(TS 3 )= Output MPFS’ all =2.83 We keep C2 ⇝ B1 ⇜ A3 in a queue for later output! We keep C2 ⇝ B1 ⇜ A3 in a queue for later output!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 76 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} Now, MPFS’ all updates to 2.33, but we already have a result that can be output from before C 2 ⇝ B 1 ⇜ A 1 ! (SCORE=2.5) TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 C1B1A3C1B1A3 2.83 C2B1A1C2B1A1 2.5 TupleScore MPFS’ 3 = 2.33 MPFS’ 1 =0 MPFS’ 2 = 1.84 MPFS’ all =2.33 S(TS 1 )= S(TS 2 )= S(TS 3 )= Output Remember C2 ⇝ B1 ⇜ A1 ? It’s now output! Remember C2 ⇝ B1 ⇜ A1 ? It’s now output!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 77 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} And what about C 2 ⇝ B 1 ⇜ A 3, with score 2.33? Well, it’s time for it to be output also! TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 C1B1A3C1B1A3 2.83 C2B1A1C2B1A1 2.5 TupleScore C2B1A1C2B1A1 2.33 MPFS’ 3 = 2.33 MPFS’ 1 =0 MPFS’ 2 = 1.84 S(TS 1 )= S(TS 2 )= S(TS 3 )= MPFS’ all =2.33 Output Now C2 ⇝ B1 ⇜ A3 Is also output! Now C2 ⇝ B1 ⇜ A3 Is also output!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 78 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} Now, let’s advance S(TS 3 ). CAN ANYONE GUESS WHY NOT S(TS 2 )? TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 C1B1A3C1B1A3 2.83 C2B1A1C2B1A1 2.5 TupleScore C2B1A1C2B1A1 2.33 MPFS’ 3 = 2.33 MPFS’ 1 =0 MPFS’ 2 = 1.84 S(TS 1 )= S(TS 2 )= S(TS 3 )= MPFS’ all =2.33 Output

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 79 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} Now, let’s advance S(TS 3 ). It has the biggest MPFS i – most likely to yield results...! TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 C1B1A3C1B1A3 2.83 C2B1A1C2B1A1 2.5 TupleScore C2B1A1C2B1A1 2.33 MPFS’ 3 = 2.33 MPFS’ 1 =0 MPFS’ 2 = 1.84 S(TS 1 )= S(TS 2 )= S(TS 3 )= MPFS’ all =2.33 Output

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 80 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} We advance S(TS 3 ), And manage to find two joins – C3 ⇝ B1 ⇜ A2 =2+9+2/6=2.16, which we can’t yet output TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 C1B1A3C1B1A3 2.83 C2B1A1C2B1A1 2.5 TupleScore C2B1A1C2B1A1 2.33 MPFS’ 3 = 0 MPFS’ 1 =0 MPFS’ 2 = 1.84 S(TS 1 )= S(TS 2 )= S(TS 3 )= MPFS’ all =2.33 Output

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 81 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} But – with MPFS 3 =0, we have to update MPFS’ all, so turns out we can output C 3 ⇝ B 1 ⇜ A 2 after all… TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 C1B1A3C1B1A3 2.83 C2B1A1C2B1A1 2.5 TupleScore C2B1A1C2B1A1 2.33 C3B1A2C3B1A2 2.16 MPFS’ 3 = 0 MPFS’ 1 =0 MPFS’ 2 = 1.84 S(TS 1 )= S(TS 2 )= S(TS 3 )= MPFS’ all =1.84 Output

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 82 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} Also, remember C 1 ⇝ B 2 ⇜ A 2 with SCORE=2? Its time has come to be output! TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 C1B1A3C1B1A3 2.83 C2B1A1C2B1A1 2.5 TupleScore C2B1A1C2B1A1 2.33 C3B1A2C3B1A2 2.16 C1B2A2C1B2A2 2 MPFS’ 3 = 0 MPFS’ 1 =0 MPFS’ 2 = 1.84 S(TS 1 )= S(TS 2 )= S(TS 3 )= MPFS’ all =1.84 Output Now C1 ⇝ B2 ⇜ A2 Is also output! Now C1 ⇝ B2 ⇜ A2 Is also output!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 83 The Single-pipelined algorithm Some Free Relations R {},Q {},P {} That’s it, we’re done! Phew…. TS 3 TupleIdScore A1A1 3 A2A2 2 A3A3 1 TS 1 TupleIdScore B1B1 9 B2B2 3 B3B3 1 TS 2 TupleIdScore C1C1 7 C2C2 3 C3C3 2 TupleScore C1B1A1C1B1A1 3.16 C1B1A3C1B1A3 2.83 C2B1A1C2B1A1 2.5 TupleScore C2B1A1C2B1A1 2.33 C3B1A2C3B1A2 2.16 C1B2A2C1B2A2 2 MPFS’ 3 = 0 MPFS’ 1 =0 MPFS’ 2 = 1.84 S(TS 1 )= S(TS 2 )= S(TS 3 )= MPFS’ all =1.84 Output

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 84 In the common case… This algorithm would output the best results of the specific CN quickly And will save time by not touching non- promising TSs! In our example it didn’t really happen (only the last tuple from TS 3 was untouched)…

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 85 The General-pipelined algorithm As mentioned before, the Single Pipelined algorithm (that operates on a SINGLE CN) does not solve the whole problem. However, a concurrent approach using the single algorithm might! This is exactly the idea behind the general- pipelined algorithm:

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 86 The General-pipelined algorithm The General pipelined algorithm evaluates concurrently all the CNs, using a priority preemptive, round-robin protocol. What’s the priority of each CN i ? MPFS’ i ! Also, a result will only be output once its score is higher than GMPFS’ - the maximal value of the current set of MPFS’s.

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 87 The General-pipelined algorithm CN 5 CN 1 CN 5 CN 3 CN 2 … CN Queue ordered by ascending MPFS Execution engine Output to user TupleScore B 1  C3 4.22 C 1  A2 7 A 1 3 Queue of Future(?) Results

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 88 The Hybrid algorithm The hybrid algorithm simply combines the power of the two most successful algorithms It estimates the number of results that would be for a query If expecting “few” results, it runs the Sparse algorithm. In any other case - it runs the General Pipelined algorithm!

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 91Runtime All the algorithms were run through a series of runtime tests. the tests used the DBLP data set translated to relations (Conferences, Papers, Citations…) The tests consisted of some one parameter (I.E. Query size) while others are constant. Different tests for AND and OR semantics. Also, sometimes use two modified algorithms: –SASymmetric - Single pipelined with round-robin –GASymmetric - General pipelined with round-robin

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 92 Maximal CN size (OR) This test evaluates M, the maximal CN size.

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 93 Maximal CN size (AND) Clearly, bigger M’s have greater impact using AND queries (Why?).

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 94 Number of keywords (OR)

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 95 Number of keywords (AND)

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 98Criticism Runtime is not clearly stated in the article (For a reason!) Effected heavily by query size! for |Q|>4, most queries will take a lot of time! The same goes for M>6… The system is a bit “platform-dependant”… prone to future RDBMS policy changes…

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 99Conclusion Today we’ve discussed a method for using IR- Style keyword search over relational databases: –Motivations for such searches –An Architecture that can achieve such goal –Several algorithms, in varying efficiencies, that can issue results. –Experimental results that allow better evaluation of runtime.

100 Thank You! …Questions? Phew!...

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 101References 1.V. Hristidis, L. Gravano, Y. Papakonstantinou. Efficient IR- Style Keyword Search over Relational Databases. VLDB, 2003. 2.V. Hristidis, Y. Papakonstantinou. DISCOVER: Keyword Search in Relational Databases. VLDB, 2002. 3.(A. Balmin, V. Hristidis, Y. Papakonstantinou. ObjectRank: Authority-Based Keyword Search in Databases. VLDB, 2004).

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 102 DISCOVER – original Architecture

Efficient IR-Style Keyword Search over Relational Databases SDBI 05 ’ 103 DISCOVER – original Architecture

1 Efficient IR-Style Keyword Search over Relational Databases 12 December 2005 Databases and the Internet Seminar on Databases and the Internet The Hebrew.

Similar presentations

Presentation on theme: "1 Efficient IR-Style Keyword Search over Relational Databases 12 December 2005 Databases and the Internet Seminar on Databases and the Internet The Hebrew."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Efficient IR-Style Keyword Search over Relational Databases 12 December 2005 Databases and the Internet Seminar on Databases and the Internet The Hebrew.

Similar presentations

Presentation on theme: "1 Efficient IR-Style Keyword Search over Relational Databases 12 December 2005 Databases and the Internet Seminar on Databases and the Internet The Hebrew."— Presentation transcript:

Similar presentations

About project

Feedback