Presentation is loading. Please wait.

Presentation is loading. Please wait.

Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University 2009. 02. 12.

Similar presentations


Presentation on theme: "Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University 2009. 02. 12."— Presentation transcript:

1 Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University 2009. 02. 12.

2 Copyright  2009 by CEBT Outline  Introduction  Bibliography  Fundamental Characteristics  Research Dimensions Summary  Future Direction 2

3 Copyright  2009 by CEBT DataQuerying Introduction  Querying structured data Relational databases – A repository for a significant amount of data (e.g. enterprise data) – RDBMS managing an abstract view of underlying data Structured Query Language (SQL) – Precise and complete – Difficult for casual users  Querying unstructured data (Web) documents – Collection of unstructured (natural language) documents available online – Search engine The most popular application for information discovery Keyword search – Simple and user-friendly – Approximating the precise results In statistical and semantic ways  Deep Web Information over the Web comes out of relational databases 3 Structured Unstructured Precise Easy Easy way of querying structured data

4 Copyright  2009 by CEBT Introduction  Enabling casual users to query relational databases with keywords “casual users” – Without any knowledge about the schema information – Without any knowledge of the query language (SQL)  Search system should have the knowledge in behalf of users  Challenges Inherent discrepancy of data between IR and DB – Information often splits across the tables (or tuples) in relational databases Ex) A single retrieval unit of information 4 Relational Databases Results SQLkeywords

5 Copyright  2009 by CEBT Bibliography  Proximity [Goldman et al., VLDB, 1998] Proximity Search in Databases  DataSpot [Palmon et al., VLDB, 1998] DTL's DataSpot - Database Exploration Using Plain Language [Palmon et al., SIGMOD, 1998] DTL's DataSpot- database exploration as easy as browsing the Web  DBXplorer [Agrawal et al., 2002, ICDE] DBXplorer: a system for keyword-based search over relational databases  BANKS [Hulgeri et al., 2001, DEBU] Keeyword Search in Databases [Hulgeri et al., 2002, ICDE] Keyword Searching and Browsing in Databases using BANKS [Kacholia et al., 2005, VLDB] Bidirectional Expansion For Keyword Search  DISCOVER [Hristidis et al., 2002, VLDB] DISCOVER: Keyword search in relational databases [Hristidis et al., 2003, VLDB] Efficient IR-Style Keyword Search over Relational Databases. [Liu et al., 2006 SIGMOD] Effective Keyword Search in Relational Databases  ObjectRank [Balmin and Hristidis et al., 2004, VLDB] ObjectRank: Authority-Based Keyword Search in Databases [Balmin and Hristidis et al., 2008, TODS] Authority-based search on databases 5

6 Copyright  2009 by CEBT Proximity  Proximity Measure of how related objects are Object related by a distance function – Shortest path computation K-neighborhood distance look-up table 6 ……… …… documentrelational database

7 Copyright  2009 by CEBT DataSpot  Hyperbase Modeling data graph Sub-hyperbase as an answer  Best-first searching 7 Customer ID… 123456… Customers …Customer ID …123456 Orders Record Field Field Name Field Value String Key 123456 Text “Customer” Text “ID” Stem Stem “customer” Thesaurus Stem “client” Relational Databases Hyperbase keywords query convert SQL query

8 Copyright  2009 by CEBT DBXplorer  Symbol table index for schema entities Locating objects efficiently – Granularity – Compaction  Schema graph Join tree enumeration – Joining several tables on the fly 8 Relational Databases term.location …… …… …… keywords query

9 Copyright  2009 by CEBT BANKS  Directed (data) graph Backward edge Graph traversing algorithm – NP-hard problem – Heuristics Backward Expanding search Bi-directional expanding search  Rich interface 9

10 Copyright  2009 by CEBT DISCOVER  High level representation of the architecture for keyword search in relational databases  Top-k join query processing Pipeline algorithm – Threshold [Fagin et al. 2001]  IR-style ranking function TF-IDF based tuple ranking 10

11 Copyright  2009 by CEBT ObjectRank  Authority Measure of how important objects are – Authority flow graph Modified Pagerank algorithm – (Global) ObjectRank algorithm – Inverse ObjectRank algorithm 11

12 Copyright  2009 by CEBT Fundamental Characteristics  Identifying schema elements To avoid linearly scanning all the tables Indexing structure – Inverted index  Processing queries Keyword query processing – Making the best of the lack of syntax in query keywords Formalizing internal queries – e.g. SQL  Modeling answers Logical unit of retrieval is not a document – e.g. Directed Acyclic Graph (DAG)  Ranking answers Assign a single score, which can reflect the semantics of underlying schema, for each answer Order the returned answers 12 RDB RDBMS Indexing Processing Model Ranking Search system k1 k2 k3k4

13 Copyright  2009 by CEBT Research Dimensions  Model  Processing  Indexing  Ranking 13  Data Representation  Query Representation  Efficient Processing  Top-k query processing  Indexing structure  Ranking  Presentation

14 Copyright  2009 by CEBT Data representation (1/4)  Graph model Data graph Schema graph 14 Writes AuthorID PaperID … Author AuthorID AuthorName … Paper PaperID PaperName … Cites Citing Cited … Writes J.H.Park0 8 Web Content Summarization Using … PaperIDPaperName JHPark J.H.Park0 8 AuthorIDPaperID SGLee S.G.Lee0 8 JHParkJaehui Park AuthorIDAuthorName SGLeeSang-goo Lee Paper Author

15 Copyright  2009 by CEBT Data representation (2/4)  Data graph Efficient graph traversing – Search time reducing Finding an optimal answer – NP-hard : Steiner tree problem Heuristics Size problem – Too huge to fit into main memory Maintenance problem – Not appropriate for update-intensive databases 15 RDB traversekeywords

16 Copyright  2009 by CEBT Data representation (3/4)  Schema graph Smaller Size – Scales well for huge database Utilize underlying RDBMS facilities – e.g. Database indexes on columns Exploiting the schema of the underlying database – Generating optimal internal queries : SQL – Evaluation for Queries 16 Query keywords : Jaehui Relational Database -------------------------------------------------- Candidate join queries: Tmp 1 : select * from Paper, Writes where Paper.PaperName = ‘Relational Database’ AND … Tmp 2 : select * from Tmp 1, Author where … Author.AuthorName = ‘Jaehui’ AND … RDB traverse Query keywords

17 Copyright  2009 by CEBT Data representation (4/4)  Graph model A logical unit of information – Subgraph A set of multiple nodes joined together may include some tuples that does not contain any query keywords Weighting scheme – Edges Distance (or Proximity)  Join operations – Nodes Importance (or Authority) 17 T1T1 T2T2 T3T3 T4T4 T5T5 K2K2 T6T6 K3K3 K3K3 K1K1 T1T1 T2T2 T3T3 T4T4 K2K2 K3K3 K1K1 T1T1 T2T2 T3T3 T5T5 K2K2 T6T6 K3K3 K1K1 T1T1 T3T3 K2K2 T6T6 K3K3 K1K1

18 Copyright  2009 by CEBT Ranking  Relevance Answer size – Minimal subgraph including all the query keywords – Distance as the semantics closeness between objects The distance between an entity and its attributes The distance between tuples in the same table The distance between tuples related through primary and foreign key Term frequency – Standard IR weighting method TF-IDF  Text databases (e.g. user complaints, product descriptions, book reviews, etc.)  Importance Authority – Authority transfer graph Nodes with incoming link with high authority are assumed to have higher importance – Specificity problem Specific results should be ranked higher than general one e.g., InverseObjectRank algorithm 18 Writes AuthorID PaperID … Author AuthorID AuthorName … Paper PaperID PaperName … Cites Citing Cited … 0.7 0 0.2 Writes Jane Tom … Paper Tree Traverse algorithm … Query Evaluation … … 0.2 0.4 0.8

19 Copyright  2009 by CEBT Efficient processing  Indexing structure Reducing scan time – Granularity levels of schema elements Column level vs. Record (or Cell) level Reducing computation time – Precomputation edge weights, node weights, relevance scores, etc.  Query execution technique Top-k query processing – Avoiding creating all query results Decide which candidate answers will produce top-k results  e.g. Sparse algorithm Pipeline algorithm ROWID b1 b2 b3 Score 90 50 12 …… ROWID a1 a2 a3 Score 76 60 15 ……

20 Copyright  2009 by CEBT Query representation  Logical operators conjunction, disjunction  Type and condition Type – Find type, Near type Conditional keywords – e.g. Year > 300 20

21 Copyright  2009 by CEBT Presentation  Visualizing search result e.g. Tree view – structural level vs. tuple level  Limiting maximum size of an answer  Limiting maximum number of answer  … 21

22 Copyright  2009 by CEBT Summary  Comparison in a common framework 22 Data modelRankingEfficiencyQuery representation Presentation ProximityData-graphDistanceK-neighborhood distance look-up Type, Conjunction - DataSpotData-graphNumber of edges-ConjunctionTable DBXplorerSchema-graphNumber of joinsSymbol tableConjunctionEnumerated rows BANKSData-graph (directed) Edge weight, Node weight Disk resident index on keyword ConjunctionDynamic Joined Tree DISCOVERSchema-graphNumber of joinsMaster IndexConjunction, Disjunction - ObjectRankSchema-graph, Data-graph AuthorityMaster IndexConjunction, Disjunction -

23 Copyright  2009 by CEBT Future Directions  Probabilistic model Naïve approaches – Rank measures on the answer size Cannot directly estimate the (probability of) relevance between the query and the retrieved tuples Heuristic performs well Probabilistic model – e.g. Bayesian belief network Term-based approach to approximate optimal answer Modification for dealing with relational database  Dependencies between schema elements  Efficient query processing Top-k query processing have shown a great impact on performance – Ranking function involves aggregation or grouping operator – Symbol table design  Conclusion Various approaches are described with our understanding We envision the above research directions to be important to pursue. 23

24 Copyright  2009 by CEBT 24 Thank you


Download ppt "Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University 2009. 02. 12."

Similar presentations


Ads by Google