Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern University Shenyang, 110004, China

3. Approximate Query Results Ranking 4. Algorithmic Solutions 1. Motivation 5. Conclusion and Further Research Organization 2. Approximate Query

1 Motivation Empty answers: When the query over the Web database is too selective, the answer may be empty or too little. In that case, it is desirable to have the option of relaxing the original query for presenting more relevant items that can meet users needs and preferences closely. Approximate query results ranking: Users preferences are usually used to be a hint for ranking. In real applications, however, the preferences are often associated to specified contexts, and the ranking accuracy is affected greatly by the interest degrees of pair-wise preferences.

2 Approximate Query Definition 1. (approximate query) Consider an autonomous Web database R with categorical and numerical attributes A = {A 1, A 2,…, A m } and a query Q over R with a conjunctive selection condition of the form Q = i {1,…, k}(A i θ a i ), where k m and θ {>, <, =,,,, between}. Note that, if is the operator between and a i is an interval which is represented by [a i1, a i2 ], Ai θ a i has the form of A i between a i1 AND a i2. Each A i in the query condition is an attribute from A and a i is a value (or interval) in its domain. By relaxing Q, an approximate query Q which is used to find all tuples of R that show similarity to Q above a threshold Tsim (0, 1) is obtained. Specifically, Q(R) = {t | t R, Similarity (Q, t) > Tsim}. 2.1 Definition of Approximate query

2.2 Attribute Weight Assignment Importance of Specified Categorical Attributes For a point query A i = v, we define IDF i (v) as log(n/F i (v)) which represents the importance of attribute value v in the database, where n is the number of tuples in the database and F i (v) is the frequency of tuples in the database of A i = v. In this paper, the importance of specified categorical attribute value is treated as the importance of its corresponding attribute.

2.2 Attribute Weight Assignment Importance of Specified Numerical Attributes Let {v 1, v 2, …, v n } be the values of attribute A that occur in the database. For specified attribute value v in the query, it defined IDF(v) as shown in Equation (1) where h is the bandwidth parameter. (1) A popular estimate for the bandwidth is h = 1.06 n1/5, where is the standard deviation of {v 1, v 2, …, v n }. Intuitively, the denominator in Equation represents the sum of contributions to v from every the other point v i in the database. These contributions are modeled as (scaled) Gaussian distributions, so that the further v is from v i, the smaller is the contribution from v i. The importance of specified numerical attribute value is also treated as the importance of its corresponding attribute.

2.2 Attribute Weight Assignment Moreover, if query condition is generalized as A i IN Q i, where Q i is a set of values for categorical attributes, or a range [lb, ub] for numeric attributes, We define the maximum log(n/F i (v)) of each different value v in Q i. The generalized importance measuring function is shown in Equation (2). (2) By normalized processing, the weight w i of attribute A i specified by the query can be calculated by (3) in which, k is the number of attributes specified by the query.

2.3 Attribute Values Similarity Assessment Categorical Attribute Values The similarity between two AV-pairs can be measured as the similarity shown by their supertuples. The supertuples contain sets of keywords for each attribute in the relation and the Jaccard Coefficient is used to determine the similarity between two supertuples. Thus, in this paper the similarity coefficient between two categorical values is then calculated as a sum of the Set similarity on each attribute, (4) where C 1, C 2 are supertuples with m attributes, A is the Set corresponding to the i-th attribute, J(,) is the Jaccard Coefficient and is computed as J(A, B) = |A B|/|A B|.

2.3 Attribute Values Similarity Assessment Numerical Attribute Values Because of the continuity of numerical data, we propose an approach to estimate the similarity coefficient between two numerical values. Let {v 1, v 2, …, v n } be the values of numerical attribute A occurring in the database. Then the similarity coefficient NSim(v, q) between v and q can be defined by Equation (5), where h is the same as mentioned in Equation (1). (5) For a numerical condition A i = q of Q, let i be a sub-threshold for A i, according to Equation (5), we can then get the relaxation range of numerical attribute A i as follows: (6)

For each condition C i in the original query Q, by extracting values of its corresponding attribute A i having similarity above the sub-threshold i and adding them into its query range, we can get the relaxed condition Ci. By join all the relaxed conditions, the approximate query Q is formed. Algorithm 1 The query rewriting algorithm Input: Original query Q= {C 1, …,C k }, sub-threshold { 1,…, k }. Output: An approximate query Q = {C 1, …, C k }. 1. For i = 1, …, k 2. Ci Ci 3. If A i is a categorical attribute 4. v Dom(A i ) 5. If VSim(a i, v) > i 6. Add v into the query range of Ci 7. End If 8. End If 9. If A i is a numerical attribute 10. Replace query range of C i with [q–h, q+h] 11. End If 12. Q = Q C i 13. End For 14. Return Q 2.4 Query Relaxation Algorithm

3 Approximate Query Results Ranking Consider a database relation r with n tuples r = {t 1,…, t n } with schema R(A 1,…, A k ). Let Dom(A i ) is the active domain of attribute A i. Definition (contextual preferences) Contextual preferences are of the form {A i = a i1 A i = a i2, d | X}, where X is j l(A j θ a j ), with a i1, a i2 Dom(A i ), l {1,…, k}, θ {>, <, =,,,, between, in} and a j Dom(A j ), d is the interest degree of preference (i.e., compared to a i2, the interest degree of a i1 is d, where 0.5 d 1, and it can be learned from the past data or the users feedback). The left-hand side of a preference specifies the choice and the interest degree while the right-hand side is the context. 3.1 Definition of Contextual Preferences

For collecting the contextual preferences with interest degrees, we employed the association-rules mining to execute on the workload of CarDB. We say that {A 1 = a A 1 = b, d | X } if conf( X a )>conf(X b), where conf(X a) is the confidence of the association rule X a in the workload, i.e., And then, the interest degree d can be obtained by:

For any single preference p and any pair of tuples (t i, t j ), p either prefers t i to t j (denoted by t i p t j ) or t j to t i (denoted by t j p t i ) or it is inapplicable with respect to t i and t j (denoted by t i ~ p t j ). Thus, every preference p defines an interest degree of preference (d pref ) over any pair of tuples t, t that evaluates as follows: 3.2 Contextual Preferences Processing

Problem 1. (approximate query results ranking problem): Assume a set of preferences P ={,…, } and an approximate query Q. The approximate query results ranking problem asks for an order of of Q(R) such that where, The objective of the problem is to find the order over the set of tuples in Q(R) that agrees as much as possible with the input preferences. Additionally, the degree of agreement with a class of preferences is weighted by the similarity between the contexts of those preferences and the approximate query Q. 3.3 Problem of Query Results Ranking

The similarity between context X and approximate query Q using their vector representations as follows: where, the vector representation of an approximate query Q is a vector V Q of size N. The i-th element of the vector corresponds to pair D[i]. If D[i] satisfies one of the conditions of Q, then V Q[i] can be computed as follows:

Proc 1 (creating orders): For each preference class P Xi create a order of the tuples in R such that The output of this procedure is a set of m context, order pairs of the form X i,, where X i is the context for the preferences in P Xi and is the order of the tuples in R that (approximately) satisfies the following Equation. The score of tuple t in that corresponds to X i is: where (t) represents the position of t in order. 3.4 Approach of Query Results Ranking

Proc 2 (clustering orders): In order to reduce the number of context, order pairs, we need to find representative orders,…, for m initial pairs X i,, where, l < m. These orders partition the space of m initial context, order pairs into l groups. Each group i is characterized by order and a disjunction of contexts such that for each order is a representative order for the initial order. The score of tuple t in is given by: where (t) represents the position of t in order.

Proc 3 (ranking the top-k answers) For an approximate query Q over relation R, using the output of step 2, compute the set Qk(R) Q(R) R with | Qk(R)| = k, such that t Qk(R) and t {R- Qk(R)} it holds that score(t, Q) > score(t, Q), with. score(t, Q) =

4 Algorithmic Solutions 4.1 Creating Orders Input: r ={t 1,…, t n }, a set of preferences from a class Output: A pair X, where is an order of the tuples in r such that as many of the preferences in are satisfied. 1. S = {t 1,…,t n }, rank = 0 2. For all i {1,…, n} Do 3. 4. End For 5. While S Do 6. rank + = 1, 7. (t v ) = rank, S = S – { t v } 8. For all t S Do 9. p(t) = p(t) –w X (t t v ) 10. End For 11. End While

4.2 Finding Representative Orders Input: T m = {,…, }, U ={, | T m, T m }, l Output: A set of representative l orders T l = {,..., } 1. Let B = {} be a buffer that can hold m, 2. While T m and l > 0 Do 3. B 4. For each T m Do 5. Pick s i =, with minimum rs i from U i = {, | T m, | | = [2, |T m | - l + 1]} 6. B B +{s i } 7. End For 8. Pick s =, with minimum rs from B 9. T m T m – – { }, T l T l + s, l l – 1 10. End While 11. Return T l

4.3 Ranking top-k Answers Input: T l = {,…, }, {X 1,…,X m }, an approximate query Q. Output: top-k answer tuples. 1. B ={} be a buffer that can hold k tuples ordered by score 2. L be an l size array storing the last score from each order 3. Repeat 4. For all i {1,…,l} Do 5. Retrieve next tuple t from 6. Compute as the score of t 7. Update L with score of t in 8. If t Q (R) 9. Get score of t from other orders { | T l and j i} via random access 10. score(t, Q) summing up of all the retrieved scores 11. Insert t, score(t, Q) in the correct position in B 12. End If 13. End For 14. Until 15. Return B

Proposed an approximate query method for finding relevant answer items; Given a formal definition of the contextual preference with interested degree; Proposed an approximate query results ranking approach based on contextual preferences; it would be interesting to research how to minimize the updating cost when the database and preferences are varied. Conclusions and Outlook

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Similar presentations

Presentation on theme: "Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Similar presentations

Presentation on theme: "Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern."— Presentation transcript:

Similar presentations

About project

Feedback