Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.

Slides:

Advertisements

Similar presentations

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.

Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.

Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.

Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

IR Models: Overview, Boolean, and Vector

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

Ch 4: Information Retrieval and Text Mining

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.

The Vector Space Model …and applications in Information Retrieval.

Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.

Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.

WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.

CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Winter Semester 2003/2004Selected Topics in Web IR and Mining7-1 7 Top-k Queries on Web Sources and Structured Data 7.1 Top-k Queries over Autonomous Web.

Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.

Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Vector Space Models.

Combining Fuzzy Information: An Overview Ronald Fagin.

Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.

Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.

Lecture 6: Scoring, Term Weighting and the Vector Space Model

Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

1 VLDB, Background What is important for the user.

VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.

IR 6 Scoring, term weighting and the vector space model.

Indexing & querying text

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Information Retrieval and Web Search

Top-k Query Processing

Chapter 15 QUERY EXECUTION.

Probabilistic Ranking of Database Query Results

Rank Aggregation.

Information Retrieval

Popular Ranking Algorithms

Representation of documents and queries

Structure and Content Scoring for XML

6. Implementation of Vector-Space Retrieval

Probabilistic Latent Preference Analysis

Structure and Content Scoring for XML

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Query Specific Ranking

Probabilistic Ranking of Database Query Results

VECTOR SPACE MODEL Its Applications and implementations

Presentation transcript:

Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer Gupta

Contents Introduction Different ranking functions Breaking ties Implementation Experiments Conclusion

Introduction Automated ranking of the results of the query is popular aspect of IR. Database system support only a boolean query model.  Empty answers - Too selective queries.  Many answers - Too broad queries.

Automated Ranking functions for the ‘Empty Answers Problem’ IDF Similarity - Mimics the TF-IDF concept for Heterogeneous data. QF Similarity - Utilizes workload information. QFIDF Similarity - Combination of QF and IDF.

Inverse Document Frequency (IDF) IR technique Cosine similarity between Q and D is normalized dot product of these two vectors. where Q = set of key words D = set of documents IDF(Inverse Document Frequency) is defined as IDF(w)=log(N/F(w)) where N = number of documents. F(w) = number of documents in which ‘w’ appears. IDF implies most commonly occurring words convey less information The Cosine similarity may be further refined by scaling each component with the IDF of corresponding word

IDF Similarity Database(only categorical attribute) T= Q= Condition of the form “WHERE A 1= q 1 AND … AND A m= q m “ IDF k (t)=log(n/F k (t)) n-number of tuples in database F k (t) -Frequency of tuples in database where A k =t a1 a2 : : : : a4 For a pair of values ‘u’ and ‘v’ in A k, S k (u,v) is defined as t1 IDF k (u) if u =v and 0 otherwise. Similarity between T and Q : : t3 Sum of corresponding similarity coefficients over all attributes: : TF is irrelavant Similarity function known as IDF similarity

IDF Similarity Example Consider a query Select car from automobile_database Where type=“convertible” and manufacturer=“Nissan”; - Convertible is rare and hence IDF is high.

Numerical V/s Categorical Data Consider a query Select house from house_database Where price=$300,000 and no_bedrooms=10;

Numeric IDF Similarity Example Consider a query : Select house from house_database where no_of_rooms=4 Here v=4 Attribute b1No of Rooms(u) Diff ( |u-v|+1)Sim= 1/DiffOutput Attr B1520.5B4 B B1 B B3 B4411B2

Generalizations of IDF similarity For numeric data  Inappropriate to use previous categorical similarity coefficients. frequency of numeric value depends on nearby values.  Discretizing numeric to categorical attribute is problematic.  Solution: {t 1,t 2 …..t n } be the values of attribute A.For every value t, sum of”contributions” of t from every other point t i contributions modeled as gaussian distribution Similarity function is bandwidth parameter

Other Generalisations Let query ‘q’ have a C – condition where C is generalized as “A in Q” where Q is a set of values for categorical attributes or a range [lb,ub] for numerical attributes. The generalized similarity function is as shown :

Problems with IDF Similarity Problem : In a realtor database, more homes are built in recent years such as 2007 and 2008 as compared to 1980 and1981.Thus recent years have small IDF. Yet newer homes have higher demand. Solution : QF Similarity.

QF Similarity : leveraging workloads Importance of attribute values is determined by frequency of their occurrence in workload. In the example above, it is reasonable to assume that more queries are requesting for newer homes than for older homes. Thus the frequency of the year 2008 appearing in the workload will be more than that of year 1981.

QF Similarity For categorical data  query frequency QF(q)= raw frequency of occurrence of value q of attribute A in query strings of workload (RQF(q) _______________________________________________________________ raw frequency of most frequently occuring value in workload (RQFMax) s(t,q)= QF(q), if q=t 0, otherwise

QF similarity example Consider a workload where attribute A= { 1,1,2,3,4,5,5,5,5,2} If now a query requests for A=1, then QF (1) = RQF(1)/RQFMax = 2/4. If a query requests for an attribute value not in the workload, then QF=0.

QF similarity : Different Attributes Similarity between pairs of different categorical attribute values can also be derived from workload eg. To find S(TOYOTA CAMRY,HONDA ACCORD) The similarity coefficient between tuple and query in this case is defined by jaccard coefficient scaled by QF factor as shown below. S(t,q)=J(W(t),W(q))QF(q)

Analyzing workloads  Analyzing IN clauses of queries: If certain pair of values often occur together in the workload,they are similar.e.g. queries with C as “MFR IN {TOYOTA,HONDA,NISSAN}”  Several recent queries in workload by a specific user repeatedly requesting for TOYOTA and HONDA.  Numerical values that occur in the workload can also benefit from query frequency analysis.

QFIDF Similarity QF is purely workload-based. Big disadvantage for insufficient or unreliable workloads. For QFIDF Similarity  S(t,q)=QF(q) *IDF(q) when t=q where QF(q)=(RQF(q)+1)/(RQFMax+1).  Thus we get small non zero value even if value is never referenced in workload model

Breaking ties using QF Problem: Many tuples may tie for the same similarity score and get ordered arbitarily.Arise in empty and many answers problem. Solution: Determine the weights of missing attribute values that reflect their “global importance” for ranking purposes by using workload information.  Extend QF similarity,use quantity to break ties. Consider a query requesting for 4 bedroom houses. - Arlington is less important than dallas.

Problems with Breaking ties using IDF large IDF scenario for missing attributes. - Arlington homes are given more preference than dallas homes since Arlington has a higher IDF, but this scenario is not true in real practice. Small IDF scenario for missing attributes. - Consider homes with decks, but since we are considering smaller IDF preference will be given to homes without decks since they have a smaller IDF which is not true in real practice.

Implementation Pre-processing component Query–processing component

Pre-processing component Compute and store a representation of similarity function in auxiliary database tables.  For categorical data, compute IDF(t) (resp QF(t)),to compute frequency of occurences of values in database and store the results in auxillary database tables.  For numeric data, an approximate representation of smooth function IDF() (resp(QF()) is stored, so that function value is retrieved at runtime.

Query processing component Main task: Given a query Q and an integer K, retrieve Top-K tuples from the database using one of the ranking functions.  Ranking function is extracted in pre-processing phase.  SQL-DBMS functionality used for solving top-K problem. Handling simpler query processing problem  Input: table R with M categorical columns, Key column TID, C is conjunction of form A k =q k..... and integer K.  Output: top-K tuples of R similar to Q.  Similarity function: Overlap Similarity.

Implementation of Top-K operator Traditional approach ? Indexed based approach  overlap similarity function satisfies the following monotonic property. Adapt Fagin’s TA algorithm. If T and U are two tuples such that for all K, S k (t k,q k )< S k (u k,q k ) then SIM(T,Q) < SIM(U,Q)  To adapt TA implement Sorted and random access methods.  Performs sorted access for each attribute, retrieve complete tuples with corresponding TID by random access and maintains buffer of Top-K tuples seen so far.

Read all grades of an object once seen from a sorted access No need to wait until the lists give k common objects Do sorted access (and corresponding random accesses) until you have seen the top k answers. How do we know that grades of seen objects are higher than the grades of unseen objects ? Predict maximum possible grade unseen objects: a: 0.9 b: 0.8 c: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: f: 0.65 d: 0.6 f: 0.6 Seen Possibly unseen Threshold value Threshold Algorithm (TA) T = min(0.72, 0.7) = 0.7

IDA1A1 A2A2 Min(A 1,A 2 ) Step 1: - parallel sorted access to each list (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2) a d T = min(0.9, 0.9) = objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) Step 1 (Again): - parallel sorted access to each list (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer b Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2) a b T = min(0.8, 0.85) = objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 Example – Threshold Algorithm

IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Situation at stopping condition a b T = min(0.72, 0.7) = 0.7 Example – Threshold Algorithm

Indexed-based TA(ITA) Sorted access Random access

Indexed-based TA(ITA) Stopping Condition Hypothetical tuple – current value a 1,…, a p for A 1,… A p, corresponding to index seeks on L 1,…, L p and q p+1,….. q m for remaining columns from the query directly. Termination – Similarity of hypothetical tuple to the query< tuple in Top-k buffer with least similarity.

ITA for Numeric columns Consider a query has condition A k = q k for a numeric column A k. Two index scan is performed on A k. - First retrieve TID’s > q k in incresing order. - Second retrieve TID’s < q k in decreasing order. We then pick TID’s from the merged stream. ITA can be extended for IN,range queries. Additional challenges arise for ranking function over set of tables.

Ranking functions quality v/s workload Experiments

Varying no of attributes Varying K in Top-K Experiments

Conclusion Automated Ranking Infrastructure for SQL databases. Extended TF-IDF based techniques from Information retrieval to numeric and mixed data. Implementation of Ranking function that exploited indexed access (Fagin’s TA)