Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lin Lu, Margaret Dunham, and Yu Meng

Similar presentations


Presentation on theme: "Lin Lu, Margaret Dunham, and Yu Meng"— Presentation transcript:

1 Discovery of Significant Usage Patterns from Clusters of Clickstream Data
Lin Lu, Margaret Dunham, and Yu Meng Department of Computer Science and Engineering Southern Methodist University Dallas, Texas WebKDD’

2 Beginning/ending Web page(s)
Introduction Significant Usage Patterns (SUP) - SUP is extracted from clusters of abstracted user sessions - Use a unique two-phase abstraction technique - With desired beginning and/or ending Web pages - With normalized probability Clustering Abstraction Beginning/ending Web page(s) Normalized Sequential Pattern N Y* - Maximal Frequent Sequence Maximal Frequent Forward Sequence User Preferred Navigational Trail [1,2] Significant Usage Pattern Y WebKDD’

3 Model WebKDD’05 3 Sessionized Web Log Abstraction Hierarchy
Sub-abstract URLs Sub-Abstracted Sessions Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery SUPs per Cluster WebKDD’

4 Alignment of Web sessions
Create sub-abstracted Web sessions URL -> {<Concept hierarchy keyword> <Unique ID> <|>} JCPenney Homepage D1 Dn C1 Cn I1 In Department level Category level Item level Fig 1. Hierarchy of J.C. Penney Web site D2 Example: D0|C875|I D0|C875|I P P P WebKDD’

5 Alignment of Web sessions
Computing the similarity between any two Web pages The higher the level in the hierarchy, the more importance in determining the similarity of two Web pages, should give more weight. Scoring scheme - step 1: determine the longer page representation string in the two Web page representations. - step 2: weight is assigned to each level in the hierarchy: the lowest level in longer page representation string is given weight 2 to its abstract level, the second to the lowest level is given weight 4 to its abstract level, and so on. The corresponding ID is always given weight 1. WebKDD’

6 Alignment of Web sessions
Computing the similarity between any two Web pages - step 1: compare the two Web page representation strings from the left to the right and stopped at the first pair which they are different. - step 2: compute the ratio of the sum of the weights of those matching parts to the weight of longer page representation string. Example: Page 1: D0|C875|I Weight= =14 Page 2: D0|C875 Weight= =12 Similarity=12/14=0.857 WebKDD’

7 Model WebKDD’05 7 Sessionized Web Log Abstraction Hierarchy
Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’

8 Alignment of Web sessions
Computing optimal alignment of two sequences using Needleman-Wunsch algorithm Y1 Yj-1 Yj Yn -d -(j-1)d -jd -nd X1 Xi-1 -(i-1)d Xi -id Xm -md A(m, n) A(i-1, j-1) A(i-1, j) A(i, j-1) A(i, j) A(i, j) = max[A(i-1, j-1)+s(Xi, Yj); A(i-1, j)-d; A(i, j-1)-d] where s(Xi, Yj) is the similarity between Xi and Yj, d is the score of aligning Xi (Yj) with a gap WebKDD’

9 Alignment of Web sessions
Apply Needleman-Wunsch global alignment algorithm Scoring scheme [3] if (matching) score = 20; //a pair of Web pages with similarity 1 else if (mis-matching) score = –10; //a pair of Web pages with similarity 0 else if (gap) score = –10; //a Web page aligns with a gap else score = –10 ~ 20; //the pair of Web pages with similarity between 0 and 1 Example: P D0|C0|I D469|C469 D2652|C2652 D469|C16758|I D0|C0|I D469|C469 P47104 D0|C0|I D469|C469 D2652|C2652 -10 -20 -30 -40 D469|C16758|I 5.7 -4.3 -14.3 10 17.1 7.1 30 32.1 Thus, session similarity = 32.1/4 = 8.025 WebKDD’

10 Model WebKDD’05 10 Sessionized Web Log Abstraction Hierarchy
Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’

11 Model WebKDD’05 11 Sessionized Web Log Abstraction Hierarchy
Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’

12 Create Concept-based Abstracted Sessions
Represent the abstracted page accesses in a session as a sequence like: P1 D1 C1 I1 P2 D2 C2 I2 … In a session, the same Pi, Di, Ci, and Ii (i=1, 2…) represents the same page. However, in different sessions, the same page may be represented by different elements. Example: Original session: D7107|C7121 D7107|C7126|I076bdf3 D7107|C7131|I084fc96 D7107|C7131 P55730 P96 P27 P14 P27592 P28 P33711 Abstracted session: C1 I1 I2 C2 P1 P2 P3 P4 P5 P6 P WebKDD’

13 Generating Significant Usage Patterns
Use Markov model to represent sessions in each cluster Example: 0.4 0.17 0.2 0.5 0.33 0.25 0.75 1 S 2 5 3 4 E (1) 1, 2, 3, 5, 4 (2) 2, 4, 3, 5 (3) 3, 2, 4, 5 (4) 1, 3, 4, 3 (5) 4, 2, 3, 4, 5 The probability of a path normalized where Pti is the transition probability between two adjacent states WebKDD’

14 Generating Significant Usage Patterns
Example:  > 0.4, end state is 4  > 0.4, beginning state is 1, end state is 4 SUP S1234 0.45 1234 0.46 S12354 0.53 12354 0.56 S124 124 0.5 S134 0.43 134 S1354 1354 0.58 S2354 S354 WebKDD’

15 Experimental Result sessions without purchase WebKDD’05 15
On average purchase sessions are longer than those sessions without purchase - review the information, compare the price, the quality and etc. - fill out the billing and shipping information to commit the purchase WebKDD’

16 Average Session Length
Experimental Result SUPs in non-purchase cluster Cluster No. No. of Sessions Threshold () Average Session Length No. of States SUPs 1 1746 0.3 9.6 98 1. S-C1-C1-C2-C3-C4-C5-C6-C7-E 2. S-C1-C1-C2-C3-C4-C5-E 3. S-C1-C1-C2-C3-E 4. S-C1-C2-C3-C3-C4-C5-C6-C7-E 5. S-C1-C2-C3-C4-C4-C5-C6-C7-E 2 241 0.37 6.6 38 1. S-P1-P2-P3-P3-E 2. S-P1-P2-P3-P4-P4-P5-E 3. S-P1-P2-P3-P4-P4-E 4. S-P1-P2-P3-P4-P5-P4-E 5. S-P1-P2-P3-P4-P5-P5-E 3 13 3.0 6 1. S-C1-P1-P2-E 2. S-C1-P1-E 3. S-I1-P1-P1-P2-E 4. S-I1-P1-P1-E 5. S-I1-P1-E Interested in gathering information of products in different categories. S-C1-C1-C2-C3-C4-C5-C5-I1-E S-C1-C1-I1-C1-C2-C3-C4-C5-E S-I1-C1-C2-C3-C4-C5-C6-C7-E Interested in reviewing general pages (to gather general information). Not serious visitors (the average session length is 3) WebKDD’

17 Experimental Result WebKDD’05 17 Cluster No. No. of Sessions Average
Length States Threshold () Beginning Web page SUPs in BNF Notation Non- Purchase 1 1746 9.6 98 0.3 S S-{C}-E 0.25 P86806 P86806-{C}-E 2 241 6.6 38 0.37 S-{P}-[C]-E 0.34 P86806-[I]-{P}-E 3 13 3.0 6 S-<C | I>-{P}-E 0.2 P86806-[{P}- [P86806]]-E 1858 14.9 55 0.47 S-[C]-[I]-{P}-E 0.51 132 39.1 100 0.457 S -[{{C}|{I}}]-{P}-E 0.434 P86806-[{C }]-{P}-E 10 31.6 47 0.52 S-{P}-[{I}]-[{P}]-{C}-E 0.43 P86806-[I]-[{P}]-{C}-E review the information, compare among products, and fill out the payment and shipping information The average length of SUPs is longer in the purchase cluster than in non-purchase cluster SUPs in the purchase cluster have higher probability than those in non-purchase cluster. have purchase in mind vs. random browsing behavior WebKDD’

18 Conclusion and Future Work
Summary - By applying clustering to abstracted user sessions, it is more likely to find groups of users with similar motivations for visiting a specific website. - By giving the flexibility for user to specify the beginning and/or ending Web page(s), users can have more control in generating patterns of their interests. Future - Scalability - Cluster to identify different user groups - Online identification of user to predefined cluster WebKDD’

19 References [1] J. Borges and M. Levene, “Data Mining of User Navigation Patterns”, In Proc. the Workshop on Web Usage Analysis and User Profiling (WEBKDD'99), 31-36, San Diego, August 15, 1999. [2] J. Borges and M. Levene, “An average linear time algorithm for web data mining”, International Journal of Information Technology and Decision Making, 3, (2004), [3] W. Wang and O. R. Zaïane, “Clustering Web Sessions by Sequence Alignment”, Third International Workshop on Management of Information on the Web in conjunction with 13th International Conference on Database and Expert Systems Applications DEXA'2002, pp , Aix en Provence, France, September 2-6, 2002.

20 Thank you Questions? WebKDD’


Download ppt "Lin Lu, Margaret Dunham, and Yu Meng"

Similar presentations


Ads by Google