Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Slides:



Advertisements
Similar presentations
Protein Secondary Structure Prediction Using BLAST and Relaxed Threshold Rule Induction from Coverings Leong Lee Missouri University of Science and Technology,
Advertisements

© Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems Introduction.
Context-Sensitive Query Auto-Completion AUTHORS:NAAMA KRAUS AND ZIV BAR-YOSSEF DATE OF PUBLICATION:NOVEMBER 2010 SPEAKER:RISHU GUPTA 1.
Evaluating the Robustness of Learning from Implicit Feedback Filip Radlinski Thorsten Joachims Presentation by Dinesh Bhirud
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Text Categorization.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Traditional IR models Jian-Yun Nie.
1 Modeling and Simulation: Exploring Dynamic System Behaviour Chapter9 Optimization.
Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi.
AE1APS Algorithmic Problem Solving John Drake
University of Washington Database Group The Complexity of Causality and Responsibility for Query Answers and non-Answers Alexandra Meliou, Wolfgang Gatterbauer,
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Order Statistics Sorted
Fast Algorithms For Hierarchical Range Histogram Constructions
Relevance Feedback Retrieval of Time Series Data Eamonn J. Keogh & Michael J. Pazzani Prepared By/ Fahad Al-jutaily Supervisor/ Dr. Mourad Ykhlef IS531.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.
Modeling Modern Information Retrieval
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.
EFFICIENT COMPUTATION OF DIVERSE QUERY RESULTS Presenting: Karina Koifman Course : DB Seminar.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
1 7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
On information theory and association rule interestingness Loo Kin Kong 5 th July, 2002.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Copyright © 1998, Triola, Elementary Statistics Addison Wesley Longman 1 Estimates and Sample Sizes Chapter 6 M A R I O F. T R I O L A Copyright © 1998,
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Presented by: Sandeep Chittal Minimum-Effort Driven Dynamic Faceted Search in Structured Databases Authors: Senjuti Basu Roy, Haidong Wang, Gautam Das,
Motivation  Methods of local analysis extract information from local set of documents retrieved to expand the query  An alternative is to expand the.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Intelligent Exploration for Genetic Algorithms Using Self-Organizing.
Data Transformation: Normalization
Chapter 7. Classification and Prediction
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Computation of the solutions of nonlinear polynomial systems
A paper on Join Synopses for Approximate Query Answering
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Ying Dai Faculty of software and information science,
Probabilistic Latent Preference Analysis
Parametric Methods Berlin Chen, 2005 References:
Anthony Okorodudu CSE Answering Imprecise Queries over Autonomous Web Databases By Ullas Nambiar and Subbarao Kambhampati Anthony Okorodudu.
Presentation transcript:

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern University Shenyang, , China

3. Approximate Query Results Ranking 4. Algorithmic Solutions 1. Motivation 5. Conclusion and Further Research Organization 2. Approximate Query

1 Motivation Empty answers: When the query over the Web database is too selective, the answer may be empty or too little. In that case, it is desirable to have the option of relaxing the original query for presenting more relevant items that can meet users needs and preferences closely. Approximate query results ranking: Users preferences are usually used to be a hint for ranking. In real applications, however, the preferences are often associated to specified contexts, and the ranking accuracy is affected greatly by the interest degrees of pair-wise preferences.

2 Approximate Query Definition 1. (approximate query) Consider an autonomous Web database R with categorical and numerical attributes A = {A 1, A 2,…, A m } and a query Q over R with a conjunctive selection condition of the form Q = i {1,…, k}(A i θ a i ), where k m and θ {>, <, =,,,, between}. Note that, if is the operator between and a i is an interval which is represented by [a i1, a i2 ], Ai θ a i has the form of A i between a i1 AND a i2. Each A i in the query condition is an attribute from A and a i is a value (or interval) in its domain. By relaxing Q, an approximate query Q which is used to find all tuples of R that show similarity to Q above a threshold Tsim (0, 1) is obtained. Specifically, Q(R) = {t | t R, Similarity (Q, t) > Tsim}. 2.1 Definition of Approximate query

2.2 Attribute Weight Assignment Importance of Specified Categorical Attributes For a point query A i = v, we define IDF i (v) as log(n/F i (v)) which represents the importance of attribute value v in the database, where n is the number of tuples in the database and F i (v) is the frequency of tuples in the database of A i = v. In this paper, the importance of specified categorical attribute value is treated as the importance of its corresponding attribute.

2.2 Attribute Weight Assignment Importance of Specified Numerical Attributes Let {v 1, v 2, …, v n } be the values of attribute A that occur in the database. For specified attribute value v in the query, it defined IDF(v) as shown in Equation (1) where h is the bandwidth parameter. (1) A popular estimate for the bandwidth is h = 1.06 n1/5, where is the standard deviation of {v 1, v 2, …, v n }. Intuitively, the denominator in Equation represents the sum of contributions to v from every the other point v i in the database. These contributions are modeled as (scaled) Gaussian distributions, so that the further v is from v i, the smaller is the contribution from v i. The importance of specified numerical attribute value is also treated as the importance of its corresponding attribute.

2.2 Attribute Weight Assignment Moreover, if query condition is generalized as A i IN Q i, where Q i is a set of values for categorical attributes, or a range [lb, ub] for numeric attributes, We define the maximum log(n/F i (v)) of each different value v in Q i. The generalized importance measuring function is shown in Equation (2). (2) By normalized processing, the weight w i of attribute A i specified by the query can be calculated by (3) in which, k is the number of attributes specified by the query.

2.3 Attribute Values Similarity Assessment Categorical Attribute Values The similarity between two AV-pairs can be measured as the similarity shown by their supertuples. The supertuples contain sets of keywords for each attribute in the relation and the Jaccard Coefficient is used to determine the similarity between two supertuples. Thus, in this paper the similarity coefficient between two categorical values is then calculated as a sum of the Set similarity on each attribute, (4) where C 1, C 2 are supertuples with m attributes, A is the Set corresponding to the i-th attribute, J(,) is the Jaccard Coefficient and is computed as J(A, B) = |A B|/|A B|.

2.3 Attribute Values Similarity Assessment Numerical Attribute Values Because of the continuity of numerical data, we propose an approach to estimate the similarity coefficient between two numerical values. Let {v 1, v 2, …, v n } be the values of numerical attribute A occurring in the database. Then the similarity coefficient NSim(v, q) between v and q can be defined by Equation (5), where h is the same as mentioned in Equation (1). (5) For a numerical condition A i = q of Q, let i be a sub-threshold for A i, according to Equation (5), we can then get the relaxation range of numerical attribute A i as follows: (6)

For each condition C i in the original query Q, by extracting values of its corresponding attribute A i having similarity above the sub-threshold i and adding them into its query range, we can get the relaxed condition Ci. By join all the relaxed conditions, the approximate query Q is formed. Algorithm 1 The query rewriting algorithm Input: Original query Q= {C 1, …,C k }, sub-threshold { 1,…, k }. Output: An approximate query Q = {C 1, …, C k }. 1. For i = 1, …, k 2. Ci Ci 3. If A i is a categorical attribute 4. v Dom(A i ) 5. If VSim(a i, v) > i 6. Add v into the query range of Ci 7. End If 8. End If 9. If A i is a numerical attribute 10. Replace query range of C i with [q–h, q+h] 11. End If 12. Q = Q C i 13. End For 14. Return Q 2.4 Query Relaxation Algorithm

3 Approximate Query Results Ranking Consider a database relation r with n tuples r = {t 1,…, t n } with schema R(A 1,…, A k ). Let Dom(A i ) is the active domain of attribute A i. Definition (contextual preferences) Contextual preferences are of the form {A i = a i1 A i = a i2, d | X}, where X is j l(A j θ a j ), with a i1, a i2 Dom(A i ), l {1,…, k}, θ {>, <, =,,,, between, in} and a j Dom(A j ), d is the interest degree of preference (i.e., compared to a i2, the interest degree of a i1 is d, where 0.5 d 1, and it can be learned from the past data or the users feedback). The left-hand side of a preference specifies the choice and the interest degree while the right-hand side is the context. 3.1 Definition of Contextual Preferences

For collecting the contextual preferences with interest degrees, we employed the association-rules mining to execute on the workload of CarDB. We say that {A 1 = a A 1 = b, d | X } if conf( X a )>conf(X b), where conf(X a) is the confidence of the association rule X a in the workload, i.e., And then, the interest degree d can be obtained by:

For any single preference p and any pair of tuples (t i, t j ), p either prefers t i to t j (denoted by t i p t j ) or t j to t i (denoted by t j p t i ) or it is inapplicable with respect to t i and t j (denoted by t i ~ p t j ). Thus, every preference p defines an interest degree of preference (d pref ) over any pair of tuples t, t that evaluates as follows: 3.2 Contextual Preferences Processing

Problem 1. (approximate query results ranking problem): Assume a set of preferences P ={,…, } and an approximate query Q. The approximate query results ranking problem asks for an order of of Q(R) such that where, The objective of the problem is to find the order over the set of tuples in Q(R) that agrees as much as possible with the input preferences. Additionally, the degree of agreement with a class of preferences is weighted by the similarity between the contexts of those preferences and the approximate query Q. 3.3 Problem of Query Results Ranking

The similarity between context X and approximate query Q using their vector representations as follows: where, the vector representation of an approximate query Q is a vector V Q of size N. The i-th element of the vector corresponds to pair D[i]. If D[i] satisfies one of the conditions of Q, then V Q[i] can be computed as follows:

Proc 1 (creating orders): For each preference class P Xi create a order of the tuples in R such that The output of this procedure is a set of m context, order pairs of the form X i,, where X i is the context for the preferences in P Xi and is the order of the tuples in R that (approximately) satisfies the following Equation. The score of tuple t in that corresponds to X i is: where (t) represents the position of t in order. 3.4 Approach of Query Results Ranking

Proc 2 (clustering orders): In order to reduce the number of context, order pairs, we need to find representative orders,…, for m initial pairs X i,, where, l < m. These orders partition the space of m initial context, order pairs into l groups. Each group i is characterized by order and a disjunction of contexts such that for each order is a representative order for the initial order. The score of tuple t in is given by: where (t) represents the position of t in order.

Proc 3 (ranking the top-k answers) For an approximate query Q over relation R, using the output of step 2, compute the set Qk(R) Q(R) R with | Qk(R)| = k, such that t Qk(R) and t {R- Qk(R)} it holds that score(t, Q) > score(t, Q), with. score(t, Q) =

4 Algorithmic Solutions 4.1 Creating Orders Input: r ={t 1,…, t n }, a set of preferences from a class Output: A pair X, where is an order of the tuples in r such that as many of the preferences in are satisfied. 1. S = {t 1,…,t n }, rank = 0 2. For all i {1,…, n} Do End For 5. While S Do 6. rank + = 1, 7. (t v ) = rank, S = S – { t v } 8. For all t S Do 9. p(t) = p(t) –w X (t t v ) 10. End For 11. End While

4.2 Finding Representative Orders Input: T m = {,…, }, U ={, | T m, T m }, l Output: A set of representative l orders T l = {,..., } 1. Let B = {} be a buffer that can hold m, 2. While T m and l > 0 Do 3. B 4. For each T m Do 5. Pick s i =, with minimum rs i from U i = {, | T m, | | = [2, |T m | - l + 1]} 6. B B +{s i } 7. End For 8. Pick s =, with minimum rs from B 9. T m T m – – { }, T l T l + s, l l – End While 11. Return T l

4.3 Ranking top-k Answers Input: T l = {,…, }, {X 1,…,X m }, an approximate query Q. Output: top-k answer tuples. 1. B ={} be a buffer that can hold k tuples ordered by score 2. L be an l size array storing the last score from each order 3. Repeat 4. For all i {1,…,l} Do 5. Retrieve next tuple t from 6. Compute as the score of t 7. Update L with score of t in 8. If t Q (R) 9. Get score of t from other orders { | T l and j i} via random access 10. score(t, Q) summing up of all the retrieved scores 11. Insert t, score(t, Q) in the correct position in B 12. End If 13. End For 14. Until 15. Return B

Proposed an approximate query method for finding relevant answer items; Given a formal definition of the contextual preference with interested degree; Proposed an approximate query results ranking approach based on contextual preferences; it would be interesting to research how to minimize the updating cost when the database and preferences are varied. Conclusions and Outlook