Presentation is loading. Please wait.

Presentation is loading. Please wait.

Issues in Bridging DB & IR Announcements: Next class: Interactive Review (Come prepared) Homework III solutions online Demos tomorrow (instructions will.

Similar presentations


Presentation on theme: "Issues in Bridging DB & IR Announcements: Next class: Interactive Review (Come prepared) Homework III solutions online Demos tomorrow (instructions will."— Presentation transcript:

1 Issues in Bridging DB & IR Announcements: Next class: Interactive Review (Come prepared) Homework III solutions online Demos tomorrow (instructions will be mailed by the end of the class) 4/29

2 First did some discussion of BibFinder—how queries are mapped etc.

3 CEAS Online Evaluations You can do them at https://intraweb.eas.asu.edu/eval Will be available until the end of day May 5 th (so the exam is unfettered by what you might think about it ) Instructors get to see it only after the grades have been given (so you don’t need to feel compelled to be particularly nice) Your feedback would be appreciated (especially the written comments) Last semester I got 2,196 words of comments; let us see if we can break the record ;-)

4 Integration of autonomous data sources Data/information integration Technically has to handle heterogeneous data too But we will sort of assume that the sources are “quasi-relational” Supporting heterogeneous data (combining DB/IR) This can be tackled in the presence of a single database The issues are How to do effective querying in the presence of structured and text data E.g. Stuff I have Seen project How to support IR-style querying on DB Because users seem to know IR/keyword style querying more (notice the irony here— we said structure is good because it supports structured querying) How to support imprecise queries The popularity of Web brings two broad challenges to Databases

5 DB vs. IR DBs allow structured querying Queries and results (tuples) are different objects Soundness & Completeness expected User is expected to know what she is doing IR only supports unstructured querying Queries and results are both documents! High Precision & Recall is hoped for User is expected to be a dunderhead.

6 Some specific problems 1. How to handle textual attributes? 2. How to support keyword-based querying? 3. How to handle imprecise queries? (Ullas Nambiar’s work)

7 1. Handling text fields in data tuples Often you have database relations some of whose fields are “Textual” E.g. a movie database, which has, in addition to year, director etc., a column called “Review” which is unstructured text Normal DB operations ignore this unstructured stuff (can’t join over them). SQL sometimes supports “Contains” constraint (e.g. give me movies that contain “Rotten” in the review

8 Soft Joins..WHIRL [Cohen] We can extend the notion of Joins to “Similarity Joins” where similarity is measured in terms of vector similarity over the text attributes. So, the join tuples are output n a ranked form— with the rank proportional to the similarity Neat idea… but does have some implementation difficulties Most tuples in the cross-product will have non-zero similarities. So, need query processing that will somehow just produce highly ranked tuples

9

10 2. Supporting keyword search on databases How do we answer a query like “Soumen Sunita”? Issues: --the schema is normalized (not everything in one table) --How to rank multiple tuples which contain the keywords?

11 What Banks Does The whole DB seen as a directed graph (edges correspond to foreign keys) Answers are subgraphs Ranked by edge weights

12 BANKSBANKS: Keyword Search in DB

13 3. Supporting Imprecise Queries Increasing number of Web accessible databases E.g. bibliographies, reservation systems, department catalogs etc Support for precise queries only – exactly matching tuples Difficulty in extracting desired information Limited query capabilities provided by form based query interface Lack of schema/domain information Increasing complexity of types of data e.g. hyptertext, images etc Often times user wants ‘about the same’ instead of ‘exact’ Bibliography search — find similar publications Solution: Provide answers closely matching query constraints Want cars priced ‘around’ $7000

14 Relaxing queries… It is obvious how to relax certain type of attribute values E.g. price=7000 is approximately the same as price=7020 But how do we relax categorical attributes? How should we relax Make=Honda? Two possible approaches Assume that domain specific information about similarity of values is available (difficult to satisfy in practice) Attempt to derive the similarity between attribute values directly from the data Qn: How do we compute similarity between “Make=Honda” and “Make=Chevrolet” Idea: Compare the set all tuples where Make=Honda to the set of all tuples where Make=Chevrolet Consider the set of tuples as a vector of bags (where bags correspond to the individual attributes) Use IR similarity techniques to compare the vectors

15 Finding similarities between attribute values

16 5/4

17 Challenges in answering Imprecise Queries We introduce IQE (Imprecise Query Engine):  Uses query workload to identify other precise queries  Extracts additional tuples satisfying a query by issuing similar precise queries  Measures distance between queries using Answerset Similarity Challenges:  Extracting additional tuples with minimal domain knowledge  Estimating similarity with minimal user input

18 Answerset Similarity Answerset A(Q): Set of all answer tuples of query Q given by relation R. Query Similarity: Sim(Q1,Q2) :- Sim(A(Q1), A(Q2)) Measuring answerset similarity Relational model exact match between tuples captures complete overlap Vector space model match keywords also detects partial overlaps Problem: Vector Space model representation for answersets Answer: SuperTuple WidomStream….VLDB2002 WidomOptimize….ICDE1998 ST(Q Author=Widom ) Co-author R. Motwani:3, Molina:6… Titlewarehouse:5, optimizing:2, streams:6.. ConferenceSIGMOD:3, VLDB:4,… Year2000:6,1999:5,…… UllmanOptimize…PODS1998 UllmanMining…VLDB2000 Answerset for Q(Author=Widom) Answerset for Q(Author=Ullman)

19 Similarity Measures Jaccard similarity metric with bag semantics SimJ(Q1,Q2) = |Q1 ∩ Q2| / |Q1 U Q2| Doc-Doc Similarity Equal importance to all attributes Supertuple considered as “single bag” of keywords Sim doc-doc (Q1, Q2) = SimJ(STQ1, STQ2) Weighted-Attribute Similarity Weights assigned to attributes signify importance to user Sim watr (Q1,Q2) = ∑ w i x SimJ(STQ1(Ai), STQ2(Ai)) ST(Q Author=Ullman ) Co-authorC. Li:5, R. Motwani:7,… TitleData-mining:3, optimizing:5,.. ConferenceSIGMOD:5, VLDB:5,… Year2000:6,1999:5,…… ST(Q Author=Widom ) Co-author R. Motwani:3, Molina:6… Titlewarehouse:5, optimizing:2, streams:6.. ConferenceSIGMOD:3, VLDB:4,… Year2000:6,1999:5,……

20 Empirical Evaluation Goal Evaluate the efficiency and effectiveness of our approach Setup A database system extending the bibliography mediator BibFinder projecting relationBibFinder Publications( Author, Title, Conference, Journal, Year) Query log consists of 10K precise queries User study: 3 graduate students 90 test queries - 30 chosen by each student Platform: Java 2 on a Linux Server – Intel Celeron 2.2 Ghz, 512 MB TimeSize Supertuple Generation126 sec21 Mb Similarity Estimation10 hrs6 Mb

21 Answering Imprecise Query Estimating query similarity For each q Є Q log Compute Sim(q,q’) for all q’ Є Q log Sim doc-doc (q, q’) = SimJ(STq, STq’) Sim watr (q,q’) = ∑ w i x SimJ(STq(Ai), STq’(Ai)) Extracting similar answers Given a query Q Map Q to a query q Є Q log Identify ‘k’ queries similar to q Execute the ‘k’ new queries

22 Some Results Imprecise QueryTop Relevant Queries 1 Title=“web-based learning”Title=“e learning” Title=“web technology” Conference=“WISE” 2 Title=“Information Extraction” Title=“Information filtering” Title = “text mining” Title = “relevance feedback” 3 Author=“Abiteboul”Author=“vianu” Author=“Dan Suciu”

23 Relevance of Suggested Answers Are the results precise? Average error in relevance estimation is around 25%

24 User Study – Summary  Precision for top-10 related queries is above 75%  Doc-Doc similarity measure dominates Weighted- attribute similarity  Lessons:  Queries with popular keywords difficult  Efficiently and effectively capturing user interest is difficult  A solution requiring less input more acceptable

25 What’s Next ? Open Issues: Most similar query may not be present in the workload. Answers to a similar query will have varying similarity depending on the affected attributes Solution: Given an imprecise query generate the most similar query. Use attribute importance and value-value similarity to order tuples. Challenges: Estimating attribute importance Estimating value-value similarity

26 Learning the Semantics of the data Estimate for value-value similarity Similarity between values of categorical attribute Sim(v11,v12) = ∑ wi x Sim(Co-related_value(Ai,v11), Co-related_value(Ai,v12)) where Ai Є Attributes(R), Ai <> A Euclidean distance for numerical attributes Use the Model of the database – AFDs, Keys, Value correlations to Identify an implicit structure for the tuple. Show other tuples that least break the structure. CarDb(Make,Model, Year, Price, Mileage, Location, Color) Approximate Keys Model, Mileage, Location – uniquely decides 90% cars in Db Model, Mileage, Color - uniquely decides 84% cars in Db Approximate Functional Dependencies (AFDs) Model -> Make Year -> Price Mileage -> Year

27 Query relaxation

28 Finding similarities between attribute values

29 Summary An approach for answering imprecise queries over Web database Answerset Similarity using Supertuple Workload queries Database unaffected Empirical evaluation showing High relevance of identified similar queries Applicable to any existing database

30 Conclusion Havasu Integration System Introduced and described its salient featuers Elaborated StatMiner – the coverage/overlap learning component Imprecise Query Answering An IR-based solution described Results showing effectiveness presented Open questions and a new solution described

31 Publications Imprecise Queries & Data Integeration Answering Imprecise Database Queries, ACM WIDM, 2003. Mining Coverage Statistics for Websource Selection in a Mediator, CIKM 2002. Mining Source Coverage Statistics for Data Integration, in ACM WIDM, 2001. Optimizing Recursive Information Gathering Plans in EMERAC, to appear in JIIS, February 2004 (to appear). Other The XOO7 Benchmark, EEXTT 2002. Efficient XML Data Management: An Analysis, ECWEB 2002. XOO7: Applying OO7 Benchmark to XML Query Processing Tools, CIKM 2001.


Download ppt "Issues in Bridging DB & IR Announcements: Next class: Interactive Review (Come prepared) Homework III solutions online Demos tomorrow (instructions will."

Similar presentations


Ads by Google