Keyword Searching in Relational Databases

Name: Keyword Searching in Relational Databases
Uploaded: 2017-10-12T01:14:55+00:00
Duration: PTM25S14
Channel: Ebony Tarr
Description: Keyword Searching in Relational Databases

Keyword Searching in Relational Databases
Esha Palta ( ) Kumar Gaurav Bijay ( )

Dilbert Strip 

Motivation Keyword search We have SQL, why keyword-querying?
SQL - not appropriate for naive users So many online databases (imdb, citeseer, bseindia …) – user cannot keep track of schema for all of these

Simple Approaches Using Form interfaces How about Google?
Require separate form for each type of query – confusing Not suitable for ad-hoc queries – how many forms will you provide? How about Google? Export data from db to documents and do keyword- querying on these Suffers from duplication overheads Google wants all keywords in one document. DB is often normalized, so need to join tables and store as documents Multiple combinations of tables to join. Not scalable … Probably that’s why there is no Google DB Seach 

Differences from Web Search
Related data split across multiple tuples due to normalization Different keywords may match tuples from different relations What joins are to be computed can only be decided on the fly Need to find result containing all keywords and rank them somehow Paper (PaperId, PaperName) Writes (AuthorId, PaperId) Author (AuthorId, AuthorName) Cites (Citing, Cited) The DBLP Bibliography Schema

Systems for DB search BANKS (Browsing and Keyword Search)
– IITB (ICDE ’02) DBXplorer – Microsoft Research (ICDE ’02) ObjectRank – IBM, UCSD, FIU (VLDB ’04) Bidirectional BANKS – IITB (VLDB ’05)

Systems for DB search BANKS (Browsing and Keyword Search)
– IITB (ICDE ’02) DBXplorer – Microsoft Research (ICDE ’02) ObjectRank – IBM, UCSD, FIU (VLDB ’04) Bidirectional BANKS – IITB (VLDB ’05) will cover in depth

BANKS (ICDE ’02)

The BANKS system BANKS Architecture Available on the web
Connects to database using JDBC JDBC metadata features used to provide schema browsing Preprocesses db User BANKS HTTP JDBC Web-server Database

Basic Model Database: modeled as a graph Nodes = tuples
Edges = references between tuples foreign key (assume for this talk), inclusion dependencies, .. Edges are directed. PaperId:PaperName BANKS01:Keyword Search MO:MultiQuery Optimizn paper AuthorID:PaperId Charuta:BANKS01 writes AuthorId Charuta S. Sudarshan Prasan Roy author DBLP example

The BANKS Answer Model Query: set of search terms {t1, t2, .., tn}
For each search term ti we find set of nodes Si matching ti Eg: Query = Sudarshan Roy (t1= Sudarshan, t2 = Roy) Answer: rooted, directed tree connecting nodes matching keywords Root node has special significance, may be restricted to some relations E.g. relations representing entities, not relationships May include intermediate nodes not in any Si (Steiner Tree) Multiple answers Ranking based on proximity + prestige

Answer Example Query: sudarshan roy Paper MultiQuery Optimization Writes Writes Author Author S. Sudarshan Prasan Roy We would like to find sets of (closely) connected tuples that match all given keywords

Edge Directionality Directed tree will miss desired answers. For eg:
Query = DBXplorer ObjectRank So, for each forward edge, BANKS adds a back edge CitedBy Cited BANKS Cites DBXPlorer Cites ObjectRank Here Banks cites both DBXplorer and ObjectRank. Cited CitedBy BANKS Cites DBXPlorer Cites ObjectRank

Edge Directionality What if we ignore directionality?
Some popular tuples are connected to many other tuples E.g. Students -> departments -> university Problem: A popular tuple would create misleading shortcuts between tuples E.g. every student would be closely linked with every other student via the department/university Solution: define different forward and backward edge weights Forward edges: In the direction of the foreign key reference

Edge Weight Weight of forward edge based on schema
e.g. citation link weights > “writes” link weights Weight of backward edge = indegree of edges pointing to the node 3 1 1 1

Edge Weight Scaling Normalize edge score Escore(e)
Make edge weight scale-free by dividing edge weigth by wmin Problem: Some backward edges have unduly large weights Depress the scale by defining Escore(e) as log(1+w(e)/wmin ) Overall Escore E = 1 / (1 + e Escore(e))

Node Weight Set weight of a Node = Indegree of the node
As per prestige rankings nodes with multiple pointers to them get a higher prestige So, higher node weight corresponds to higher prestige Problem: Nodes with many in-edges result in skewed answers Subdue extreme node weights by using log(1+indegree) Node score Nscore = Average of node scores (root-node-weight +  leaf-node-weights)

Combining Scores Combining two independent metrics: node weight and edge weight Normalize each to 0-1 Combine using weighting factor  Additive: (1- ) Escore +  Nscore Multiplicative: Escore * Nscore Performance study to compare alternatives and to find reasonable values for 

First Step – Symbol Table
The first step is to build a symbol table This table is in the db and is not normalized Example: Keyword List of Matching Nodes Database {NICDE_2, NVLDB_3, …} Search {NBANKS1, NBANKS2, NDBXPLR,…} Rank {NOBJRNK, NXRANK, NSPHSRCH, …} … N… are nodes that match the keyword.

Searching for Best Answers
Backward Expanding Search Algorithm: Assume: graph fits in memory Idea: find vertices from which a forward path exists to at least one node from each Si. Run concurrent single source shortest path algorithm from each node matching a keyword Create an iterator for each node matching a keyword Traverse the graph edges in reverse direction Output a node whenever it is on the intersection of the sets of nodes reached from each keyword Answer trees may not be generated in relevance order

Backward Expanding Search
Query: sudarshan roy MultiQuery Optimization paper writes S. Sudarshan Prasan Roy authors Iterators

BANKS Query Result Example
Result of “Sudarshan Roy”

Answers need not be always in Relevance order
Result Ordering Answers need not be always in Relevance order This tree is output Better Root Missed 2 2 5 2 2 1

Result Ordering (contd…)
Solution: Generate all connection trees and then sort them Increases computation costs and leads to a greatly increased time to generate initial results Create a small heap ordered on the relevance of the trees Output highest ranked tree from heap to user when heap is full What about duplicate results? Maintain a list of generated results for duplicate detection Discard result according to relevance

Experience and Performance
BANKS provides keyword search coupled with extensive browsing facilities Schema browsing + data browsing Graphical display of data Implemented using Java + servlets Keyword search response times typically 1 to 3 seconds on DBLP database with 100,000 tuples/300,000 edges P3 600 MHz, 512 MB RAM

Anecdotes “Mohan” “Transaction” “Sunita Seltzer”
Returns C. Mohan at top based on prestige (number of papers written) “Transaction” Returns Jim Gray’s classic paper and textbook as top answers based on prestige (number of citations) “Sunita Seltzer” No common papers, but both have papers with Stonebraker: system finds this connection

Effect of Parameters Log scaling of edge weights worked well
(1- ) E +  N versus E Nmade little difference Best with  = .2 (subdue node weights but not entirely) EdgeLog

BANKS (VLDB ’05)

Motivation BANKS performs poorly if
Keyword matches lot of nodes (so lot of Dijkstra sources) Search hits a node with large fan – in. … Wastes time Sudarshan Roy

New Ideas – Forward Search
Why only backward, lets search forward too : How about fwd Searching ? … Sudarshan Roy

New Ideas - Activation Activation :- Cannot forward search from each node. Spread activation from keyword nodes to others. Activation is like Page Rank with decay. High Activation  close to many keywords.

Activation Spreading Spreading Activation
Node with highest activation explored first Activation spread to neighbors (μ = 0.3) Gives low activation to neighbors of hubs

Modifications to Model
Graph model stays the same. BANKS is concerned with search more than how to tune parameters or define node – weights / edge – weights. BANKS code : Tree Node – Score, N = Tree Edge – Score, E = Total Score = ENl (l = 0.2)

The New Algorithm Need two priority queues :
Qin - do backward search from these nodes Qout - do forward search from these nodes Each node, n keeps 3 variables per keyword, ti sp [i] : Node to got to from n for shortest-path to ti distance [i] : Length of the shortest-path from n to ti Activation [i] : Activation to n from keyword ‘ti’

The New Algorithm – continued…
Set initial activation keyword nodes and add to Qin for backward-search. At each step, pick node with maximum activation i.e. if (Qin.getMaxActivation > Qout. getMaxActivation)) // use node from Qin else // use node from Qout If node from Qin, do backward search and add itself to Qout. (newly explored nodes into Qin) If node from Qout, do forward search If node has reached from all keyword, generate result- tree. [answer is buffered as results can be out of order]

Explanation with example
Qin Qout N100 N4 Roy Sudarshan N1 … N3 N2 Roy Sudarshan

Qin Qout N100 N4 Roy Sudarshan N1 N2 N1 … N3 N2 Roy Sudarshan

Qin Qout N100 N2 N3 … N100 N1 Roy Sudarshan N4 N1 … N3 N2 Roy Sudarshan Result Found !

Generation of top-k results
If we know the score of next-best answer, all buffered answers with better score can be output. Need upper bounds

Computation of upper bound
For each keyword ti, we have explored nodes upto some length – say li. So, next – best – score (approx.) = This is not a true upper bound, but works quite well and is simple !

Are we losing answers ? BANKS – I used many Dijkstra states, BANKS – II uses 2 only – forward and backward search- states. The result is that we can now lose answers !

This is the generated answer.
Answer Loss Example Ny K1 Nx K2 K1 Ny Nx K1 K2 Ny K1 K2 This is the generated answer. This answer is lost.

But, we will generate this tree rooted at Nx:
So, a rotated tree with same nodes but different root is often generated ! NY K2 NX K1

Metrics of Performance
Manually obtain best relevant answers. Determine 2 times : Time taken to produce last relevant answer. Time taken to output last relevant answer. Search algorithms MI-Bkwd: original backward search Iterator for every node matching a keyword SI-Bkwd: backward search with single backward iterator Bidirec: bidirectional search Datasets DBLP, IMDB ~ 2 million nodes, 9 million edges US Patent DB ~ 4 million nodes, 15 million edges

Graph - I MI-Bkwd versus SI-Bkwd
SI-Bkwd gain increases with origin size, # keywords

Graph - II SI-Bkwd versus Bidirec
Bidirec gain increases with origin size, # keywords

A Critique BANKS needs a lot of memory.
Need to cluster and keep parts of graph on disk. Work is in progress 

DBXplorer (ICDE ’02)

DBXplorer : (Microsoft Research)
Use symbol – table to determine which tables to join. Generate all possible table – join combinations : Figure : T1, T2, T3, T4 and T5 are tables

Cool ideas in DBXplorer
Symbol table need not be at tuple level. If column has an index, column – level symbol table is ok. Table Compression : e.g. : Keywords Columns Keywords Columns K1 K1 C1 C1 K2 K2 X C2 K3 C2 K3 K4 K4 K5 K5 Intermediate Column

ObjectRank (VLDB ‘04)

ObjectRank (IBM, FIU, UCSD)
Creates objects in database. Object definition is manual. e.g. in DBLP, author, conference and paper can be defined as objects. Heavily inspired by PageRank. Each node is given global ObjectRank just like PageRank of Google.

ObjectRank Ideas Keyword-level ObjectRank : for each keyword, precompute and save object ranks of nodes [can optimize by defining cut-off) Score of node, n w.r.t. keyword k : scorek(n) = f (Global-object-rank (n), Objectrankk (n)) At run time, scores are combined : scorek1,k2,…,km(n) = scorek1(n) * scorek2(n) * …* scorekm(n)

ObjectRank Algorithm and answers
If graph is DAG or near DAG, topologically sort and spread ObjectRank in this order. Answers are single objects and not Cluster / group as in BANKS. Demo at :

Conclusion Studied BANKS, both versions.
Covered cool ideas from DBXplorer and ObjectRank. Graph of BANKS must be made disk-resident.

References Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan. Keyword Searching and Browsing in Databases using BANKS. In International Conference on Data Engineering (ICDE), pages 1083–1096, Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan et. al. Bidirectional Expansion for Keyword Search on Graph Databases. In VLDB Conference, pages 505–516, 2005. Sanjay Agrawal, Surajit Chaudhari, and Gautam Das. DBXplorer: A System for Keyword-Based Search over Relational Databases. In International Conference on Data Engineering (ICDE), pages 5–22, 2002. Andrey Balmin, Vagelis Hristidis, and Yannis Papakonstantinou. ObjectRank: Authority-Based Keyword Search in Databases. In VLDB Conference, pages 564–575, 2004.

Appendix Extra slides

Browsing - May add?????? Hyperlinks are there for all primary key foreign key attributes Each table is displayed with set of tools for interacting with data Projection (using drop), Selection, Join, Group-by, Sort Template facilities to do a variety of tasks Browsing data by grouping and creating crosstabs e.g., theses grouped by department and year Hierarchical views of data Nested XML style, even on relational data Graphical displays Bar charts, pie charts, etc Templates are generic and can be applied on any data matching assumed schema Can be applied after applying selections New templates can be created by user, interactively

Example of Browsing in BANKS

Related Work DataSpot (DTL)/Mercado Intuifind [VLDB 98]
Based on patent by Palmon (filed 1995, granted 1998) Based on hypergraph model, similar answer model to ours Differences: our model of backward link weights and prestige Proximity Search [VLDB98] Different model of proximity based on adding up support No edge weights, prestige, different evaluation algorithm Information units (linked Web pages) [WWW10] No directionality, only studied in Web context Microsoft DBExplorer (this conference) No ranking, based on SQL generation Addresses efficient construction of text indexes Microsoft English query

Extensions Summarization of output
group the output tuples into sets that have same tree structure define the notion of similarity between two result trees perform restricted search Metadata queries (attribute:keyword queries) For example: author:levy match all the tuples of a relation costly Forward searching approach

Proposed Conclusions and Future Work
BANKS is an integrated browsing and keyword querying system for relational databases Future work: Keyword queries on XML Disambiguating queries by selecting Nodes: G.W.Bush: “Bush Jr” or “Bush Sr” Tree structure: “coauthors” or “cites” Boolean queries Metadata queries Summarization of output

Keyword Searching in Relational Databases

Similar presentations

Presentation on theme: "Keyword Searching in Relational Databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Keyword Searching in Relational Databases

Similar presentations

Presentation on theme: "Keyword Searching in Relational Databases"— Presentation transcript:

Similar presentations

About project

Feedback