Presentation is loading. Please wait.

Presentation is loading. Please wait.

Keyword Searching and Browsing in Databases using BANKS

Similar presentations


Presentation on theme: "Keyword Searching and Browsing in Databases using BANKS"— Presentation transcript:

1 Keyword Searching and Browsing in Databases using BANKS
2/22/2019 Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe Joint work with: Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan I.I.T. Bombay 2/22/2019

2 Motivation Keyword search of documents on the Web has been enormously successful Simple and intuitive, no need to learn any query language Database querying using keywords is desirable SQL is not appropriate for casual users Form interfaces cumbersome: Require separate form for each type of query — confusing for casual users of Web information systems Not suitable for ad hoc queries 2/22/2019

3 Motivation Many Web documents are dynamically generated from databases
2/22/2019 Motivation Many Web documents are dynamically generated from databases E.g. Catalog data Keyword querying of generated Web documents May miss answers that need to combine information on different pages Suffers from duplication overheads Changed the 2nd bullet 2/22/2019

4 Examples of Keyword Queries
On a railway reservation database “mumbai bangalore” On a university database “database course” On an e-store database “camcorder panasonic” On a book store database “sudarshan databases” 2/22/2019

5 Differences from IR/Web Search
Related data split across multiple tuples due to normalization E.g. Paper (paper-id, title, journal), Author (author-id, name) Writes (author-id, paper-id, position) Different keywords may match tuples from different relations What joins are to be computed can only be decided on the fly Cites(citing-paper-id, cited-paper-id) 2/22/2019

6 Connectivity Tuples may be connected by
Foreign key and object references Inclusion dependencies and join conditions Implicit links (shared words), etc. Would like to find sets of (closely) connected tuples that match all given keywords 2/22/2019

7 Basic Model Database: modeled as a graph Nodes = tuples
Edges = references between tuples foreign key, inclusion dependencies, .. Edges are directed. BANKS: Keyword search… MultiQuery Optimization paper writes Charuta S. Sudarshan Prasan Roy author 2/22/2019

8 Answer Example Query: sudarshan roy paper MultiQuery Optimization
writes writes author author S. Sudarshan Prasan Roy 2/22/2019

9 The BANKS Answer Model Query: set of keywords {k1, k2, .., kn}
Each keyword ki matches set of nodes Si Answer: rooted, directed tree connecting nodes, with one node from each Si Root node has special significance, may be restricted to some relations E.g. relations representing entities, not relationships May include intermediate nodes not in any Si and hence a steiner tree. Multiple answers Ranking based on proximity + prestige 2/22/2019

10 Edge Directionality Some popular tuples are connected to many other tuples E.g. Students -> departments -> university Popular tuples would create misleading shortcuts from every tuple to every other E.g. every student would be closely linked with every other student via the department/university Solution: define different forward and backward edge weights Forward edges: In the direction of the foreign key reference 2/22/2019

11 Edge Weight Weight of forward edge based on schema
e.g. citation link weights > “writes” link weights Weight of backward edge = indegree of edges pointing to the node 3 1 1 1 2/22/2019

12 Edge Weight Scaling Problem: Some backward edges have unduly large weights Scale edge weights by using log(1+raw-edgeweight) total-edge-weight =  edge-weights Edge score E = 1 / total-edge-weight 2/22/2019

13 Node Weight Nodes have prestige weights too Set node weight = indegree
Observation: nodes with intuitively greater prestige tend to have greater indegree Set node weight = indegree Problem: Nodes with many in-edges result in skewed answers Subdue extreme node weights by using log(1+indegree) Node score N = root-node-weight +  leaf-node-weights 2/22/2019

14 Combining Scores Problem: how to combine two independent metrics: node weight and edge weight Normalize each to 0-1 Combine using weighting factor  Additive: (1- ) E +  N Multiplicative: E N Performance study to compare alternatives and to find reasonable values for  2/22/2019

15 Finding Answer Trees Backward Expanding Search Algorithm:
Intuition: find vertices from which a forward path exists to at least one node from each Si. Run concurrent single source shortest path algorithm from each node matching a keyword Create an iterator for each node matching a keyword Traverse the graph edges in reverse direction Output a node whenever it is on the intersection of the sets of nodes reached from each keyword 2/22/2019

16 Backward Expanding Search
Query: sudarshan roy MultiQuery Optimization paper writes S. Sudarshan Prasan Roy authors 2/22/2019

17 Result Ordering Answer trees may not be generated in relevance order
Solution: Best-first search across all iterators, based on path length Output answers to a buffer Output highest ranked answer from buffer to user when buffer is full 2/22/2019

18 2/22/2019 The BANKS System BANKS provides keyword search coupled with extensive browsing facilities Schema browsing + data browsing Graphical display of data Implemented using Java + servlets Keyword search response times typically 1 to 3 seconds on DBLP database with 100,000 tuples/300,000 edges P3 600 MHz, 512 MB RAM Try it out at New slide, with stuff on browsing, and one more on browsing next 2/22/2019

19 Example of Browsing in BANKS
2/22/2019

20 Anecdotes “Mohan” “Transaction” “Sunita Seltzer”
Returns C. Mohan at top based on prestige (number of papers written) “Transaction” Returns Jim Gray’s classic paper and textbook as top answers based on prestige (number of citations) “Sunita Seltzer” No common papers, but both have papers with Stonebraker: system finds this connection 2/22/2019

21 Effect of Parameters Log scaling of edge weights worked well
(1- ) E +  N versus E N -- made little difference Best with  = .2 (subdue node weights but not entirely) EdgeLog 2/22/2019

22 Related Work DataSpot (DTL)/Mercado Intuifind [VLDB 98]
2/22/2019 Related Work DataSpot (DTL)/Mercado Intuifind [VLDB 98] Based on patent by Palmon (filed 1995, granted 1998) Based on hypergraph model, similar answer model to ours Differences: our model of backward link weights and prestige Proximity Search [VLDB98] Different model of proximity based on adding up support No edge weights, prestige, different evaluation algorithm Information units (linked Web pages) [WWW10] No directionality, only studied in Web context Microsoft DBExplorer (this conference) No ranking, based on SQL generation Addresses efficient construction of text indexes Microsoft English query Changed DataSpot bullets added English Query and verify claims on DBExplorer with Surajit 2/22/2019

23 Conclusions and Future Work
2/22/2019 Conclusions and Future Work The next big wave: keyword searching and browsing of databases? Future work: Keyword queries on XML Disambiguating queries by selecting Nodes: G.W.Bush: “Bush Jr” or “Bush Sr” Tree structure: “coauthors” or “cites” Boolean queries, stemming, thesaurus Metadata: column/relation names NOTE!!!: Changed first bullet to something cheeky. You can ask viewers to decide for themselves if its true Changed future work description significantly with new examples 2/22/2019

24 Thank You 2/22/2019

25 BANKS Query Result Example
Result of “Soumen Sunita” 2/22/2019

26 2/22/2019

27 Browsing Features Hyperlinks are automatically added to all displayed results Template facilities to do a variety of tasks Browsing data by grouping and creating crosstabs e.g., theses grouped by department and year Hierarchical views of data Nested XML style, even on relational data Graphical displays Bar charts, pie charts, etc Templates are generic and can be applied on any data matching assumed schema Can be applied after applying selections New templates can be created by user, interactively 2/22/2019

28 Combining Keyword Search and Browsing
Catalog searching applications Keywords may restrict answers to a small set, then user needs to browse answers If there are multiple answers, hierarchical browsing required on the answers 2/22/2019

29 The BANKS System Available on the web, with (part of) DBLP data
Connects to any database using JDBC JDBC metadata features used to provide schema browsing No programming needed for customization Minimal preprocessing of database to create indices and give weights to links Extensive set of browsing features User HTTP BANKS JDBC Web Server + Servlets Database 2/22/2019


Download ppt "Keyword Searching and Browsing in Databases using BANKS"

Similar presentations


Ads by Google