Download presentation

Presentation is loading. Please wait.

Published byClemence Baldwin Modified about 1 year ago

1
Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International University

2
Recap… Exploratory queries on database systems becoming a common phenomenon These queries return a large set of results, most of then irrelevant to the user Categorization (and ranking) help users locate relevant records A user typically does not expend any effort in specifying his/her preferences

3
Motivation Previous work assumed that all users have the same preferences. Not true in most scenarios Ignoring user preferences leads to construction of sub-optimal navigation trees

4

5
Motivation (cont…) Key challenges: 1.How to summarize diverse user preferences from the behaviors of all the users in the system? 2.How to decide the subset of preferences associated with a specific user?

6
System Architecture Query History Cluster Generation Clusters over Data Navigation tree Construction Query Execution Results Query

7
System Architecture (cont…) Pre-processing step: – Analyze query history and generate a set of (non- overlapping) clusters over data. – Each cluster corresponds to one type of user preference – Has an associated probability of users being interested in that cluster Assumption: Individual preferences can be represented as a subset of these clusters

8
System Architecture (cont…) Generation of the navigation tree – Occurs when a specific user asks a query – Intersect the set of clusters generated in the pre- processing step with the answers of the given query – Construct a navigation tree over the intersected clusters on the fly

9
Terminology and Definitions SymbolMeaning DDataset (A relation with n records r 1,…,r n ) and m attributes HQuery History TNavigation Tree AiAi Attribute i QiQi Query i FFrequency of query i DQ i Results of query I Qc i Query Cluster I CiCi Preference cluster I PiPi Probability of users being interested in C i TTree Node

10
Terminology and Definitions Query History (H) is of the form, {(Q 1,U 1,F 1 ),…,(Q k, U k, F k )}, in chronological order – where Q i is a query U i is a user session ID F i is the weight associated with the query Each query Q i is of the form: – Cond(A 1 ) ^ Cond(A 2 ) ^ …. ^ Cond(A n ) Each Cond(A i ) contains only range or equality conditions

11
Terminology and Definitions (cont…) Data (D) is partitioned into disjoint set of clusters C = {C 1, C 2, …, C q } Each C i has an associated probability P i The P i associated with a cluster denotes the probability that the users are interested in that cluster

12
Definition : Navigation Tree Navigation Tree T(V, E, L) Satisfying the following conditions : – Each node v has a label label(v) denoting a Boolean condition. – v contains records that satisfy all conditions on its ancestors including itself – conditions associated with children of a non-leaf node v are on the same attribute (called split attribute)

13
Clusters over Data Two records r i and r j are indistinguishable if they always appear in the same set of queries Define a binary relation R – (r i,r j ) Є iff the above condition is satisfied R is reflexive, symmetric and transitive => R is an equivalence relation and partitions D into equivalence classes (clusters) {C 1,….,C q }

14
Clusters over Data (Example) D = {r 1,….,r 13 } Q 1 = {r 1,…,r 10 } ; Q 2 = {r 1,…,r 9 and r 11 } ; Q3 = {r 12 }

15
Clusters over Data (Heuristics) Problem: Too many clusters! Apply heuristics to decrease the number of clusters: – Prune unimportant queries Remove queries with empty answers Retain the most specific query in a given session – Merge similar queries

16
Clusters over Data (Merge Similar Queries) Algorithm: 1.Compute result DQ i for each query Q i 2.Compute clusters CQ i for each query Q i 3.Repeat until no more merging is possible 1.Compute distance between each pairs of queries 2.Merge two clusters QC i & QC j that have a distance less than B Distance d(Q i,Q j ) =

17
Merge Similar Queries (Example) Let B = 0.2 d(CQ 1,CQ 2 ) = 1 – 9/11 = 0.18, d(CQ 1,CQ 3 ) = 1, d(CQ 2,CQ 3 ) = 1 Merge CQ 1 and CQ 2 Results in 2 query clusters CQ 1 = {r 1,….,r 11 }, CQ 2 = {r 12 }

18
Merge Similar Queries (Complexity Results) O(|H||D| + |H| 3 t d ) – t d is the time to compute distance Can be improved by – Sampling – Pre-computation of distances O(|H||D| + |H| 2 t d + |H| 2 log|H|) – Min-wise Hashing O(|H||D| + |H| 2 k + |H| 2 log|H|) – K is the hash signature size

19
Generate Clusters QC 1,…QC k generated after query pruning and merging For each record r i – Generate a set C i such that one of the queries in the cluster returns r i – Group the records by C i and assign a class-label to C i – Compute P i : Sum of frequencies of query in S i divided by the sum of all queries in H (history) Example: P 1 = 2/3, P 2 = 1/3 and P 3 = 0

20
Navigation Tree Construction Given D, Q and C find a tree T(V, E, L) such that – T contains all records of Q – There does not exist T’ with NCost(T’,C) < NCost(T’,C) NCost(T,C):

21
Navigation Tree Construction Optimal-tree construction problem is NP-Hard Observation: The navigational tree is very similar to a decision tree. So, any decision tree construction algorithm can be used… Decision tree algorithms compute information gain to measure how good and attribute classifies data. Here, the criteria is to minimize navigation cost

22
Navigation Tree Construction (Decision Tree Construction) Precondition: Each record has a class label assigned in the clustering step Algorithm: 1.Create a root R and assign all records to it 2.If all the records have the same class label, stop 3.Select the attribute that maximizes the (global) navigation cost (Information gain) to expand the tree for the next level

23
Navigation Tree Construction (Splitting Criteria) Navigation Cost includes – Cost of visiting Leaf nodes – the results – Cost of visiting intermediate nodes – category labels

24
Splitting Criteria (Example) A1 (C1, C2, C1, C2) A 1 <= v 1 A 1 > v 1 A2 (C1, C2, C2, C2)(C1, C1, C1, C2) A 2 <= v 2 A 2 > v 2 P(C1) = P(C2) = 0.5 Navigation Cost = Which split is better?

25
Splitting Criteria Cost of visiting Leaf-nodes Let t be the node to be split N(t) be the number of records in t Let P i be the probability that users are interested in cluster C i The gain (reduction in navigation cost) when t is split into t 1 and t 2 is:

26
Splitting Criteria Cost of Visiting Intermediate Nodes Observation: – Given a perfect tree T with N records and k classes, where each class C i in T has N i records: approximates the average length of root-to-leaf paths for all records in T

27
Splitting Criteria Cost of Visiting Intermediate Nodes t (C1, C1, C1,…..,C1)(Ck, Ck, Ck,…..,Ck) …… C1 Ck Log N Log N i

28
Splitting Criteria Combining the two costs Gain when a node t is split into t 1 and t 2 Information Gain due to a split: IGain (t, t 1, t 2 ) = E(t) – N 1 /N E(t 1 ) – N 2 /N E(t 2 )

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google