Query Specific Ranking CSE 6392 02/27/2006 Database Exploration
Content Comparison of FA and TA algorithm Representing ranking problem as a geometric problem Query Specific Ranking Database Exploration
Comparison between FA and TA algorithm TA is faster than FA TA stops as soon as the score of the hypothetical tuple is less than the score of tuples in the top-k buffer. TA is a bounded buffer algorithm TA maintains a top-k buffer FA maintains a set of candidates of all the tuples read until it gets ‘k’ objects in common in these sets. Database Exploration
Comparison between FA and TA TA has to immediately scan as it reads a tuple in order to find the score in an eager manner. FA has 2 phases for calculating score: - sort phase - scan phase TA and FA algorithm requires the scoring function to be monotonic. Database Exploration
Why does TA work? Stopping condition for TA is: Score (hypothetical tuple) < score (k-th tuple in top-k buffer) Idea is that score of unseen tuples will be less that the score of the hypothetical tuple according to the monotonic property. Database Exploration
Closing points on TA and FA FA algorithm stops only when we get ‘k’ common objects/intersections in the set of candidates. TA algorithm makes assumptions of unseen tuples based on the score of the hypothetical tuple in order to stop. Therefore, there is no way FA can stop earlier than TA. Hence, TA is instance optimal. Database Exploration
Query Specific Ranking The ranking function we have discussed so far depends on the assumption of total ordering of attributes. E.g. total ordering of price: - high price is bad - low price is good In reality, this is not always true. Database Exploration
Query Specific Ranking Different people will have a different ideal price in mind. E.g. for one person, an ideal restaurant will be: price = $20 and capacity = 100. In this case, the ranking function can be: Score(<P, C>) = 5*|20-p| + 10*|100-c| Database Exploration
Query Specific Ranking The above ranking function is more realistic than total ranking function. But the above ranking function is not monotonic. How can we find the top-k restaurants in this case without looking at the whole data set? Database Exploration
Solution Assume the data set is sorted on all the attributes of interest. First, create transformed attributes based on the original attributes involved in the ranking function such that the transformed attributes maintains the monotonic property. Secondly, simulate sorted access. Database Exploration
Transformed attributes Consider the restaurant example where: Score(<P, C>) = 5*|20-p| + 10*|100-c| Transformed attributes are: ∆p = differential of price from original price ∆c = differential of capacity from original capacity Suppose tid1 = <$30, 120> then < ∆p, ∆c>=<10,20> tid2 = <$15, 85> then < ∆p, ∆c>=<5, 15> Database Exploration
Simulating sorted access Achieving monotonicity is just part of the problem. Need to achieve sorted access on the transformed (∆p and ∆c) attributes. Suppose if data is presorted on the ‘price’ attribute. Without presorting the whole dataset, we can go directly to the ‘sweet spot’ (i.e. price = $20 & capacity = 100) using B+ tree index. From this point do 2 walks in the opposite directions and find ∆p and ∆c in the sorted order and merge them. Database Exploration
Adding Selection This explains how hard conditions are handled or added to a ranking function. E.g. Look for restaurants in Arlington location =“Arlington” hard condition Database Exploration
Handling hard conditions The query will look like this: Select top[10] From restaurants Where location = “Arlington” Order by 5*abs(120 - price) How to solve this query? Database Exploration
Handling hard conditions Do selection first, then do ranking This method is not the best method for the following reasons: If selection produces a big result, it defeats the purpose of doing ranking If selection produces a small result, then doing ranking on it will be an overkill. The raw data is presorted and doing a selection first on this raw data will destroy the order of tuples. TA requires data to be presorted. Database Exploration
Handling hard conditions The second method is to integrate selection as part of ranking. Score (<L,P,C>) = If L= “Arlington” then 5*|20-P| + 10*|100-C| else 0 Database Exploration
Handling hard conditions Now we are no longer dealing with numeric values alone. Since location = “Arlington”, ranking function is no longer on numeric data but is instead on characterical data. How do we deal with ranking function that have characterical data? Database Exploration