Presentation is loading. Please wait.

Presentation is loading. Please wait.

03/13/06 1 Gaurav Kumar/ Esha Palta Keyword Searching in Relational Databases Esha Palta (05329017) Kumar Gaurav Bijay (02005013)

Similar presentations


Presentation on theme: "03/13/06 1 Gaurav Kumar/ Esha Palta Keyword Searching in Relational Databases Esha Palta (05329017) Kumar Gaurav Bijay (02005013)"— Presentation transcript:

1 03/13/06 1 Gaurav Kumar/ Esha Palta Keyword Searching in Relational Databases Esha Palta ( ) Kumar Gaurav Bijay ( )

2 03/13/06 2 Gaurav Kumar/ Esha Palta Dilbert Strip

3 03/13/06 3 Gaurav Kumar/ Esha Palta Motivation  Keyword search  We have SQL, why keyword-querying?  SQL - not appropriate for naive users  So many online databases (imdb, citeseer, bseindia …) – user cannot keep track of schema for all of these

4 03/13/06 4 Gaurav Kumar/ Esha Palta Simple Approaches  Using Form interfaces  Require separate form for each type of query – confusing  Not suitable for ad-hoc queries – how many forms will you provide?  How about Google?  Export data from db to documents and do keyword- querying on these  Suffers from duplication overheads  Google wants all keywords in one document. DB is often normalized, so need to join tables and store as documents  Multiple combinations of tables to join. Not scalable …

5 03/13/06 5 Gaurav Kumar/ Esha Palta Differences from Web Search  Related data split across multiple tuples due to normalization  Different keywords may match tuples from different relations  What joins are to be computed can only be decided on the fly  Need to find result containing all keywords and rank them somehow Cites (Citing, Cited) Paper (PaperId, PaperName) Writes (AuthorId, PaperId) Author (AuthorId, AuthorName) The DBLP Bibliography Schema

6 03/13/06 6 Gaurav Kumar/ Esha Palta Systems for DB search  BANKS (Browsing and Keyword Search) – IITB (ICDE ’02)  DBXplorer – Microsoft Research (ICDE ’02)  ObjectRank – IBM, UCSD, FIU (VLDB ’04)  Bidirectional BANKS – IITB (VLDB ’05)

7 03/13/06 7 Gaurav Kumar/ Esha Palta Systems for DB search  BANKS (Browsing and Keyword Search) – IITB (ICDE ’02)  DBXplorer – Microsoft Research (ICDE ’02)  ObjectRank – IBM, UCSD, FIU (VLDB ’04)  Bidirectional BANKS – IITB (VLDB ’05) will cover in depth

8 03/13/06 8 Gaurav Kumar/ Esha Palta BANKS (ICDE ’02)

9 03/13/06 9 Gaurav Kumar/ Esha Palta The BANKS system  BANKS Architecture  Available on the web   Connects to database using JDBC  JDBC metadata features used to provide schema browsing  Preprocesses db User BANKS HTTP JDBC Database Web-server

10 03/13/06 10 Gaurav Kumar/ Esha Palta Basic Model  Database: modeled as a graph  Nodes = tuples  Edges = references between tuples  foreign key (assume for this talk), inclusion dependencies,..  Edges are directed. MO:MultiQuery Optimizn S. SudarshanPrasan Roy writes author paper Charuta Charuta: BANKS01 BANKS01:Keyword Search DBLP example PaperId:PaperName AuthorID:PaperId AuthorId

11 03/13/06 11 Gaurav Kumar/ Esha Palta The BANKS Answer Model  Query: set of search terms {t 1, t 2,.., t n }  For each search term t i we find set of nodes S i matching t i  Eg: Query = Sudarshan Roy (t 1 = Sudarshan, t 2 = Roy)  Answer: rooted, directed tree connecting nodes matching keywords  Root node has special significance, may be restricted to some relations  E.g. relations representing entities, not relationships  May include intermediate nodes not in any S i (Steiner Tree)  Multiple answers  Ranking based on proximity + prestige

12 03/13/06 12 Gaurav Kumar/ Esha Palta Answer Example  We would like to find sets of (closely) connected tuples that match all given keywords Query: sudarshan roy MultiQuery Optimization S. Sudarshan Prasan Roy Paper Writes Author

13 03/13/06 13 Gaurav Kumar/ Esha Palta Edge Directionality  Directed tree will miss desired answers. For eg: Query = DBXplorer ObjectRank  So, for each forward edge, BANKS adds a back edge BANKSDBXPlorer CitedBy Cited DBXPlorer CitedByCited BANKS Cites ObjectRank Cites

14 03/13/06 14 Gaurav Kumar/ Esha Palta Edge Directionality  What if we ignore directionality?  Some popular tuples are connected to many other tuples  E.g. Students -> departments -> university  Problem: A popular tuple would create misleading shortcuts between tuples  E.g. every student would be closely linked with every other student via the department/university  Solution: define different forward and backward edge weights  Forward edges: In the direction of the foreign key reference

15 03/13/06 15 Gaurav Kumar/ Esha Palta Edge Weight  Weight of forward edge based on schema  e.g. citation link weights > “writes” link weights  Weight of backward edge = indegree of edges pointing to the node

16 03/13/06 16 Gaurav Kumar/ Esha Palta Edge Weight Scaling  Normalize edge score Escore(e)  Make edge weight scale-free by dividing edge weigth by w min  Problem: Some backward edges have unduly large weights  Depress the scale by defining Escore(e) as log(1+w(e)/w min )  Overall Escore E = 1 / (1 +   Escore(e))

17 03/13/06 17 Gaurav Kumar/ Esha Palta Node Weight  Set weight of a Node = Indegree of the node  As per prestige rankings nodes with multiple pointers to them get a higher prestige  So, higher node weight corresponds to higher prestige  Problem: Nodes with many in-edges result in skewed answers  Subdue extreme node weights by using log(1+indegree)  Node score Nscore = Average of node scores (root-node-weight +  leaf-node-weights)

18 03/13/06 18 Gaurav Kumar/ Esha Palta Combining Scores  Combining two independent metrics: node weight and edge weight  Normalize each to 0-1  Combine using weighting factor  Additive: (1- ) Escore + Nscore  Multiplicative: Escore * Nscore  Performance study to compare alternatives and to find reasonable values for

19 03/13/06 19 Gaurav Kumar/ Esha Palta First Step – Symbol Table  The first step is to build a symbol table  This table is in the db and is not normalized  Example: KeywordList of Matching Nodes Database{N ICDE_2, N VLDB_3, …} Search{N BANKS1, N BANKS2, N DBXPLR,…} Rank{N OBJRNK, N XRANK, N SPHSRCH, …} …

20 03/13/06 20 Gaurav Kumar/ Esha Palta Searching for Best Answers  Backward Expanding Search Algorithm:  Assume: graph fits in memor y  Idea: find vertices from which a forward path exists to at least one node from each Si.  Run concurrent single source shortest path algorithm from each node matching a keyword  Create an iterator for each node matching a keyword  Traverse the graph edges in reverse direction  Output a node whenever it is on the intersection of the sets of nodes reached from each keyword  Answer trees may not be generated in relevance order

21 03/13/06 21 Gaurav Kumar/ Esha Palta Backward Expanding Search S. Sudarshan Prasan Roy authors MultiQuery Optimization paper Query: sudarshan roy writes Iterators

22 03/13/06 22 Gaurav Kumar/ Esha Palta BANKS Query Result Example  Result of “Sudarshan Roy”

23 03/13/06 23 Gaurav Kumar/ Esha Palta Result Ordering  Answers need not be always in Relevance order 1 This tree is output Better Root Missed 2

24 03/13/06 24 Gaurav Kumar/ Esha Palta Result Ordering (contd…)  Solution:  Generate all connection trees and then sort them  Increases computation costs and leads to a greatly increased time to generate initial results  Create a small heap ordered on the relevance of the trees  Output highest ranked tree from heap to user when heap is full  What about duplicate results?  Maintain a list of generated results for duplicate detection  Discard result according to relevance

25 03/13/06 25 Gaurav Kumar/ Esha Palta Experience and Performance  BANKS provides keyword search coupled with extensive browsing facilities  Schema browsing + data browsing  Graphical display of data  Implemented using Java + servlets  Keyword search response times typically 1 to 3 seconds on  DBLP database with 100,000 tuples/300,000 edges  P3 600 MHz, 512 MB RAM

26 03/13/06 26 Gaurav Kumar/ Esha Palta Anecdotes  “Mohan”  Returns C. Mohan at top based on prestige (number of papers written)  “Transaction”  Returns Jim Gray’s classic paper and textbook as top answers based on prestige (number of citations)  “Sunita Seltzer”  No common papers, but both have papers with Stonebraker: system finds this connection

27 03/13/06 27 Gaurav Kumar/ Esha Palta Effect of Parameters  Log scaling of edge weights worked well  (1- ) E + N versus E N  made little difference  Best with =.2 (subdue node weights but not entirely) EdgeLog

28 03/13/06 28 Gaurav Kumar/ Esha Palta BANKS (VLDB ’05)

29 03/13/06 29 Gaurav Kumar/ Esha Palta Motivation  BANKS performs poorly if  Keyword matches lot of nodes (so lot of Dijkstra sources)  Search hits a node with large fan – in. Sudarshan … Wastes time Roy

30 03/13/06 30 Gaurav Kumar/ Esha Palta New Ideas – Forward Search  Why only backward, lets search forward too : SudarshanRoy … How about fwd Searching ?

31 03/13/06 31 Gaurav Kumar/ Esha Palta New Ideas - Activation  Activation :- Cannot forward search from each node.  Spread activation from keyword nodes to others.  Activation is like Page Rank with decay. High Activation  close to many keywords.

32 03/13/06 32 Gaurav Kumar/ Esha Palta Activation Spreading  Spreading Activation  Node with highest activation explored first  Activation spread to neighbors (μ = 0.3)  Gives low activation to neighbors of hubs

33 03/13/06 33 Gaurav Kumar/ Esha Palta Modifications to Model  Graph model stays the same.  BANKS is concerned with search more than how to tune parameters or define node – weights / edge – weights.  BANKS code : Tree Node – Score, N = Tree Edge – Score, E = Total Score = EN 

34 03/13/06 34 Gaurav Kumar/ Esha Palta The New Algorithm  Need two priority queues : Q in - do backward search from these nodes Q out - do forward search from these nodes  Each node, n keeps 3 variables per keyword, t i  sp [i] : Node to got to from n for shortest-path to t i  distance [i] : Length of the shortest-path from n to t i  Activation [i] : Activation to n from keyword ‘t i’

35 03/13/06 35 Gaurav Kumar/ Esha Palta The New Algorithm – continued…  Set initial activation keyword nodes and add to Qin for backward-search.  At each step, pick node with maximum activation i.e. if (Q in.getMaxActivation > Q out. getMaxActivation)) // use node from Q in else // use node from Q out  If node from Q in, do backward search and add itself to Q out. (newly explored nodes into Q in )  If node from Q out, do forward search  If node has reached from all keyword, generate result- tree. [answer is buffered as results can be out of order]

36 03/13/06 36 Gaurav Kumar/ Esha Palta Explanation with example Roy N1 N2 Sudarshan N3 N4 … N100 Q in Q out Roy Sudarshan

37 03/13/06 37 Gaurav Kumar/ Esha Palta Explanation with example Roy N1 N2 Sudarshan N3 N4 … N100 Q in Q out Roy Sudarshan N1 N2

38 03/13/06 38 Gaurav Kumar/ Esha Palta Explanation with example Roy N1 N2 Sudarshan N3 N4 … N100 Q in Q out N1 Roy Sudarshan N2 N3 … N100 Result Found !

39 03/13/06 39 Gaurav Kumar/ Esha Palta Generation of top-k results  If we know the score of next-best answer, all buffered answers with better score can be output.  Need upper bounds

40 03/13/06 40 Gaurav Kumar/ Esha Palta Computation of upper bound  For each keyword t i, we have explored nodes upto some length – say l i.  So, next – best – score (approx.) =  This is not a true upper bound, but works quite well and is simple !

41 03/13/06 41 Gaurav Kumar/ Esha Palta Are we losing answers ?  BANKS – I used many Dijkstra states, BANKS – II uses 2 only – forward and backward search- states.  The result is that we can now lose answers !

42 03/13/06 42 Gaurav Kumar/ Esha Palta Answer Loss Example K1 K2Nx Ny K1K2 This is the generated answer. Ny Nx K1 K2 This answer is lost.

43 03/13/06 43 Gaurav Kumar/ Esha Palta  But, we will generate this tree rooted at Nx:  So, a rotated tree with same nodes but different root is often generated ! NYNY NXNX K1K1 K2K2

44 03/13/06 44 Gaurav Kumar/ Esha Palta Metrics of Performance  Manually obtain best relevant answers.  Determine 2 times : 1.Time taken to produce last relevant answer. 2.Time taken to output last relevant answer.  Search algorithms  MI-Bkwd: original backward search  Iterator for every node matching a keyword  SI-Bkwd: backward search with single backward iterator  Bidirec: bidirectional search  Datasets  DBLP, IMDB ~ 2 million nodes, 9 million edges  US Patent DB ~ 4 million nodes, 15 million edges

45 03/13/06 45 Gaurav Kumar/ Esha Palta Graph - I  MI-Bkwd versus SI-Bkwd  SI-Bkwd gain increases with origin size, # keywords

46 03/13/06 46 Gaurav Kumar/ Esha Palta Graph - II  SI-Bkwd versus Bidirec  Bidirec gain increases with origin size, # keywords

47 03/13/06 47 Gaurav Kumar/ Esha Palta A Critique  BANKS needs a lot of memory.  Need to cluster and keep parts of graph on disk.  Work is in progress

48 03/13/06 48 Gaurav Kumar/ Esha Palta DBXplorer (ICDE ’02)

49 03/13/06 49 Gaurav Kumar/ Esha Palta DBXplorer : (Microsoft Research)  Use symbol – table to determine which tables to join.  Generate all possible table – join combinations : Figure : T1, T2, T3, T4 and T5 are tables

50 03/13/06 50 Gaurav Kumar/ Esha Palta Cool ideas in DBXplorer  Symbol table need not be at tuple level. If column has an index, column – level symbol table is ok.  Table Compression :  e.g. : Keywords ColumnsKeywords Columns K1 K2 K3 K4 K5 C1 C2 K4 K3 X K1 K2 C1 C2 Intermediate Column

51 03/13/06 51 Gaurav Kumar/ Esha Palta ObjectRank (VLDB ‘04)

52 03/13/06 52 Gaurav Kumar/ Esha Palta ObjectRank (IBM, FIU, UCSD)  Creates objects in database. Object definition is manual. e.g. in DBLP, author, conference and paper can be defined as objects.  Heavily inspired by PageRank.  Each node is given global ObjectRank just like PageRank of Google.

53 03/13/06 53 Gaurav Kumar/ Esha Palta ObjectRank Ideas  Keyword-level ObjectRank : for each keyword, precompute and save object ranks of nodes [can optimize by defining cut-off)  Score of node, n w.r.t. keyword k : score k (n) = f (Global-object-rank (n), Objectrank k (n))  At run time, scores are combined : score k1,k2,…,km (n) = score k1 (n) * score k2 (n) * …* score km (n)

54 03/13/06 54 Gaurav Kumar/ Esha Palta ObjectRank Algorithm and answers  If graph is DAG or near DAG, topologically sort and spread ObjectRank in this order.  Answers are single objects and not Cluster / group as in BANKS.  Demo at :

55 03/13/06 55 Gaurav Kumar/ Esha Palta Conclusion  Studied BANKS, both versions.  Covered cool ideas from DBXplorer and ObjectRank.  Graph of BANKS must be made disk-resident.

56 03/13/06 56 Gaurav Kumar/ Esha Palta References  Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan. Keyword Searching and Browsing in Databases using BANKS. In International Conference on Data Engineering (ICDE), pages 1083–1096,  Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan et. al. Bidirectional Expansion for Keyword Search on Graph Databases. In VLDB Conference, pages 505–516,  Sanjay Agrawal, Surajit Chaudhari, and Gautam Das. DBXplorer: A System for Keyword-Based Search over Relational Databases. In International Conference on Data Engineering (ICDE), pages 5–22,  Andrey Balmin, Vagelis Hristidis, and Yannis Papakonstantinou. ObjectRank: Authority-Based Keyword Search in Databases. In VLDB Conference, pages 564–575, 2004.

57 03/13/06 57 Gaurav Kumar/ Esha Palta Appendix Extra slides

58 03/13/06 58 Gaurav Kumar/ Esha Palta Browsing - May add??????  Hyperlinks are there for all primary key foreign key attributes  Each table is displayed with set of tools for interacting with data  Projection (using drop), Selection, Join, Group-by, Sort  Template facilities to do a variety of tasks  Browsing data by grouping and creating crosstabs  e.g., theses grouped by department and year  Hierarchical views of data  Nested XML style, even on relational data  Graphical displays  Bar charts, pie charts, etc  Templates are generic and can be applied on any data matching assumed schema  Can be applied after applying selections  New templates can be created by user, interactively

59 03/13/06 59 Gaurav Kumar/ Esha Palta Example of Browsing in BANKS

60 03/13/06 60 Gaurav Kumar/ Esha Palta Related Work  DataSpot (DTL)/Mercado Intuifind [VLDB 98]  Based on patent by Palmon (filed 1995, granted 1998)  Based on hypergraph model, similar answer model to ours  Differences: our model of backward link weights and prestige  Proximity Search [VLDB98]  Different model of proximity based on adding up support  No edge weights, prestige, different evaluation algorithm  Information units (linked Web pages) [WWW10]  No directionality, only studied in Web context  Microsoft DBExplorer (this conference)  No ranking, based on SQL generation  Addresses efficient construction of text indexes  Microsoft English query

61 03/13/06 61 Gaurav Kumar/ Esha Palta Extensions  Summarization of output  group the output tuples into sets that have same tree structure  define the notion of similarity between two result trees  perform restricted search  Metadata queries (attribute:keyword queries)  For example: author:levy  match all the tuples of a relation  costly  Forward searching approach

62 03/13/06 62 Gaurav Kumar/ Esha Palta Proposed Conclusions and Future Work BANKS is an integrated browsing and keyword querying system for relational databases Future work:  Keyword queries on XML  Disambiguating queries by selecting  Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”  Tree structure: “coauthors” or “cites”  Boolean queries  Metadata queries  Summarization of output


Download ppt "03/13/06 1 Gaurav Kumar/ Esha Palta Keyword Searching in Relational Databases Esha Palta (05329017) Kumar Gaurav Bijay (02005013)"

Similar presentations


Ads by Google