On the Designing of Popular Packages

On the Designing of Popular Packages
Yangjun Chen and Wei Shi Department of Applied Computer Science University of Winnipeg

Outline Motivation - Data mining - Most popular packages
Signature trees and modified signature trees Single package design (SPD) - Basic algorithm - Heuristic signature tree method Multiple package design (MPD) Experiments Conclusion and Future Work

Motivation Given a query log concerning the customers’ preference on items or activities, design a package which satisfies as many customers as possible. A query log by an travel agency: Most popular packge: Hot spring Hiking airlines QueryId Hot Spring Ride Glacier Hiking Airline Boating Q1 1 ? Q2 Q3 Q4 Q5 Q6 It satisfies three queries: Q1, Q3, Q5

Signature Files and Signature Trees
s1: s2: s3: s4: s5: s6: s7: 1 2 4 5 7 s1 s6 s2 s7 s3 s4 s5 1 1 1 1 1 1 Each path represents the identifier of a signature. Identifier(s7): (1, 1)(4, 1)(5, 0)(7, 0) Y. Chen and Y.B. Chen, On the Signature Tree Construction and Analysis, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 9, 2006, pp

Modified signature tree over a query log
Construction of single package signature tree (SPD): Let Q = {q1, …, qm} be a query log. We use qi[j] to represent the value of the jth attribute in qi (i = 1, …, m). Starting from the first attribute value, we divide all queries in Q into two branches. For query qi (1 ≤ j ≤ M), if qi[1] = ‘0’, we put qi into the left branch. If qi[1] = ‘1’, it is put into the right branch. However, if qi[1] = ‘?’, we will put it in both left and right branches, showing a quite different behavior from a traditional signature tree construction.

Single Package Design Tree (SPD)
1 S61 S62 S63 S64 S65 S66 S67 S68 S69 S50 S52 S53 S54 S55 S56 S57 S58 S59 S40 S41 S42 S43 S44 S45 S46 S47 S30 S31 S32 S34 S35 S36 S37 S20 S21 S22 S23 S0 S10 S11 S33 S60 S51 S0 = {q1, q2, q3, q4, q5, q6} S10 = {q3, q4, q5, q6} S11 = {q1, q2, q3, q5, q6} S20 = {q3, q4, q5} S21 = {q4, q5} S22 = {q1, q2, q3, q5} S23 = {q1, q5} S30 = {q3, q5} S31 = {q4} S32 = {q6} S33 = {q4, q6} S34 = {q1, q3, q5} S35 = {q2} S36 = {q1, q6} S37 = {q6} S40 = {q3} S41 = {q4, q5} S42 = {q4, q6} S43 = {q4} S44 = {q1, q5} S45 = {q1, q3, q5} S46 = {q1, q6} S47 = {q1} S50 = {q3} S51 = {q3, q5} S52 = {q6} S53 = {q4, q6} S54 = {q5} S55 = {q1, q5} S56 = {q5} S57 = {q1, q3, q5} S58 = {q1} S59 = {q1, q6} S68 = {q1} S69 = {q1, q6} S60 = {q3, q5} S61 = {q3} S62 = {q4} S63 = {q4, q6} S64 = {q1, q5} S65 = {q1} S67 = {q1, q3} S66 = {q1, q3, q5}

Approximate Algorithm with Heuristics
Computational complexities of SPD: time complexity: O(2m-1n) space complexity: O(mn) where m = number of attributes in Q, n = number of queries in Q. Approximate algorithm with a general rule to cut off subtrees: If the number of queries in any branch in a subtree is smaller than the number of queries in the candidate result, the subtree should be pruned. 7

Derived rules: If for all the queries represented by a node v, the attribute to be checked contains only ‘0’, or ‘?’, the subtree rooted at the right child of v can be pruned. If for all the queries represented by a node v, the attribute to be checked does not contains ‘0’, and at least one of them contains ‘1’, the subtree rooted at the left child of v can be cut off. If for all the queries represented by a node v, the attribute to be checked contains only ‘?’, we cut off the subtree rooted at the right child of v. (Notice that we can also prune the subtree rooted at the left child of v. But the result will be same.)

Choose the attribute with the minimum number of “?” values to minimize the selection for don’t care. If more than one column contains the same number of ‘?’, we continue to calculate the number of 1s and the number of 0s in them. We select the column in which the number of 1s and the number of 0s are mostly closed to each other to keep the tree balanced.

Example: 1 S20 S21 S10, 2 S11 S0, 3 S20, 5 S11, 1 S22 S23 S31 Step 2: Step 3: S10 Step 1: S0 = {q1, q2, q3, q4, q5, q6} S10 = {q1, q3, q5, q6} S11 = {q2, q4, q6} S20 = {q1, q3, q5} S21 = {q1, q5} S22 = {q4, q6} S23 = {q2, q6} S31 = {q1, q3, q5}

Example: 1 S20, 5 S21 S10, 2 S11, 1 S0, 3 S22 S23 S31, 1 S41 S41, 4 S51 S51, 6 S60 Step 4: Step 5: Step 6: S41 = {q1, q3, q5} S51 = {q1, q3, q5} S60 = {q1, q3, q5}

Multiple Package Design Tree (MPD)
algorithm 5. MPD based on modified signature trees Input: a set of queries Q. Output: a set of packages P satisfying all queries. begin P ← ; while (Q ≠ ) { create the root node v; {P, Q}← ConstructSPD(v, Q, 1); Q ← Q\Q; P = P  P; } end

Experiments Signature tree for SPD - It works in two steps. In the first step, we construct a signature-tree-like structure, call a SPD-tree. Then, in the second step, we search the SPD-tree to find the best popular package. Heuristic signature tree for SPD - The basic algorithm presented in can be dramatically improved by integrating the SPD-tree construction and the SPD-tree search into a single process. By doing this, we can achieve an optimization in both response time and package quality. Heuristic SPD - This algorithm was proposed by Miah [6]. This is in fact an algorithm to find an approximate solution to an NP-complete problem, the so-called MINSAT problem: Given a set U of Boolean variables and a collection of disjunctive clauses over U, a truth assignment was found that minimizes the number of satisfied disjunctive clauses. In [6], this algorithm is referred to as MINSAT HeuristicPD. 13

EXperiments All the experiments are performed on a Sony notebook with a 2.53Ghz Inter Core i3 CPU, with 300 GB hard disk and 8.0GB of memory. The code is written in C++ and run on Windows 7 professional with 32-bit operating system. Real data: 100 customers’ favourites at a Chinese restaurant and surveyed during a large party. The investigation was designed with 10 attributes such as lemon chicken, ginger beef, honey garlic shrimp, broccoli with seafood and so on. The customers respond “yes”, “no”, or “don’t care” to each attribute to provide their preferences. Synthetic data: queries with up to 30 attributes. Each query is represented by a string with each position being ‘0’, ‘1’, or ‘?’, evenly populated. We may increase the number of‘?’ to obtain different experimental results. 14 14

Experiments (SPD on real data)
Test results on real data sets for SPD 15 15

Experiments (SPD on synthetic data)
Test results for varying attributes on SPD

Experiments (SPD on synthetic data)
Test results for varying query log size on SPD

Experiments (MPD on real data)
Test results on real data sets for MPD

Experiments (MPD on synthetic data)
Test results on varying attributes for MPD

Experiments (MPD on synthetic data)
Test results on varying query log sizes for MPD

Conclusion and Future Work
Main contribution - Signature tree based method for SPD - Approximate algorithm for SPD - Approximate algorithm for MPD - Extensive tests Future work - Theoretic analysis on the ratio of the approximate solutions to the optimal solution

Thank you!

On the Designing of Popular Packages

Similar presentations

Presentation on theme: "On the Designing of Popular Packages"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On the Designing of Popular Packages

Similar presentations

Presentation on theme: "On the Designing of Popular Packages"— Presentation transcript:

Similar presentations

About project

Feedback