Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovery of Significant Usage Patterns from Clickstream Data

Similar presentations


Presentation on theme: "Discovery of Significant Usage Patterns from Clickstream Data"— Presentation transcript:

1 Discovery of Significant Usage Patterns from Clickstream Data
Margaret H. Dunham, Lin Lu CSE Department Southern Methodist University Dallas, Texas This material is based upon work supported by the National Science Foundation under Grant No. IIS 05/04/05 , Travelocity

2 Web Usage Mining Overview Our Work: Significant Usage Patterns
OUTLINE Web Usage Mining Overview Our Work: Significant Usage Patterns Ongoing/Future Research 05/04/05 , Travelocity

3 Web Usage Mining Applications
Personalization Improve structure of a site’s Web pages Aid in caching and prediction of future page references Improve design of individual pages Improve effectiveness of e-commerce (sales and advertising) 05/04/05 , Travelocity

4 Web Usage Mining Activities
Preprocessing Web log Cleanse Remove extraneous information Sessionize Session: Sequence of pages referenced by one user at a sitting. Pattern Discovery Count patterns that occur in sessions Pattern is sequence of pages referenced in session. Pattern Analysis 05/04/05 , Travelocity

5 Pattern Types Association Rules None of the properties hold Episodes
Only ordering holds Sequential Patterns Ordered and maximal Forward Sequences Ordered, consecutive, and maximal Maximal Frequent Sequences All properties hold User Preferred Navigation Trail Not a true pattern, but representative of many 05/04/05 , Travelocity

6 Web Usage Mining Issues
Identification of exact user not possible. Exact sequence of pages referenced by a user not possible due to caching. Session not well defined Security, privacy, and legal issues 05/04/05 , Travelocity

7 CAN’T SEE THE FOREST FOR THE TREES
The BIG PICTURE :49:                :40:                :55:                :43:                :49: a39        :23:                :30:                corduroy+coats CAN’T SEE THE FOREST FOR THE TREES S-P1-P2-P3-P4-P5-P6-C1-C2-E S-P1-P2-P3-P4-P5-C4-I6-I7-I8-E 05/04/05 , Travelocity

8 SIGNIFICANT USAGE PATTERNS
Solution Clustering Abstraction User Preferred Navigation Trails SIGNIFICANT USAGE PATTERNS 05/04/05 , Travelocity

9 Interests… Motivations…
Web Log Web Server Preprocess Web Data: Cleanse Sessionize Markov Model per Cluster Markov Model URL Abstraction User defined beginning/ending Web pages Significant Usage Pattern User Preferred Navigation Trail Cluster Web Sessions Normalized Probability

10 Significant Usage Pattern (SUP):
SUP is a path that is extracted from a Markov model with user defined starting and ending states, and its corresponding normalized product of probabilities along the path satisfies a given threshold. Differences from previous research: - SUP is extracted from clusters of user sessions - user sessions are abstracted sessions - starting and ending with specific Web pages of user interests Need not be an exact pattern found in any session, but rather is representative of patterns found. 05/04/05 , Travelocity

11 Model 05/04/05 , Travelocity Sessionized Web Log Abstraction Hierarchy
Sub-Abstracted Sessions Clusters of User Sessions Similarity Matrix Concept-based Abstracted Sessions per Cluster Apply Needleman-Wunsch global alignment algorithm Apply Nearest neighbor clustering algorithm Concept-based Abstracted URLs Transition Matrix per Cluster Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Patterns per Cluster Pattern Discovery Build Markov model for each cluster 05/04/05 , Travelocity

12 Abstract Web session data
JCPenney Homepage D1 D2 Dn C1 Cn I1 In Department level Category level Item level Fig 2. Hierarchy of JCPenney Web site Web session example: D0|C875|I D0|C875|I P27593 P27592 P 05/04/05 , Travelocity

13 Alignment of Web Sessions
Compute the similarity between any two Web pages The higher the level in the hierarchy, the more importance it is in determining the similarity of two Web pages, should give more weight. - step 1: compare the two Web page representation strings from left to right and stop at the first pair where they are different. - step 2: compute the ratio of sum of the weights of those matching parts to the sum of total weights . Example Page 1: D0|C875|I weight= =14 Page 2: D0|C875 weight= =12 Similarity=12/14=0.857 05/04/05 , Travelocity

14 Generating Significant Usage Patterns
1 2 5 4 3 0.4 0.2 0.5 E 0.6 05/04/05 , Travelocity

15  > 0.4, beginning state is 1, end state is 4
Examples  > 0.4, end state is 4  > 0.4, beginning state is 1, end state is 4 SUP S1234 0.45 1234 0.46 S12354 0.53 12354 0.56 S124 124 0.5 S134 0.43 134 S1354 1354 0.58 S2354 S354 05/04/05 , Travelocity

16 Average Session Length
Experimental Result Cluster Cluster No. No. of Sessions Average Session Length No. of States Threshold () Beginning Web page SUPs in BNF Notation Non-Purchase 1 1746 9.6 98 0.3 S S-{C}-E 0.25 P86806 P86806-{C}-E 2 241 6.6 38 0.37 S-{P}-[C]-E 0.34 P86806-[I]-{P}-E 3 13 3.0 6 S-<C | I>-{P}-E 0.2 P86806-[{P}- [P86806]]-E Purchase 1858 14.9 55 0.47 S-[C]-[I]-{P}-E 0.51 132 39.1 100 0.457 S -[{{C}|{I}}]-{P}-E 0.434 P86806-[{C }]-{P}-E 10 31.6 47 0.52 S-{P}-[{I}]-[{P}]-{C}-E 0.43 P86806-[I]-[{P}]-{C}-E 05/04/05 , Travelocity

17 Future/Ongoing Research
Scalability Fewer patterns Smaller patterns MM less space than table Clusters to identify Behaviors Business vs Leisure Cloaked Crawler Online Identification of Cluster 05/04/05 , Travelocity


Download ppt "Discovery of Significant Usage Patterns from Clickstream Data"

Similar presentations


Ads by Google