Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge.

Similar presentations


Presentation on theme: "A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge."— Presentation transcript:

1 A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge and Data Engineering, 2008 Ozer Ozdikis Huseyin Candan 1

2  Extraction of User Profiles using ◦ Web Usage Data ◦ Web site hierarchy ◦ External data etc…  Evolution of User Profiles in time ◦ Introducing new profiles, killing invalid ones… ◦ validation of the profile evolution 2

3 1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 3

4 1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 4

5 Key features of the paper  Dynamic content (a portal for companies)  Clustering of user sessions extracted from web logs into homogenous groups of similar activities  Session similarity is calculated using navigated URL’s and website hierarchy (from URL and site taxonomy)  Generate mass user profiles  Repeat this generation periodically  Track the changes between the previous profiles and new profiles, and evaluate their evolution 5

6 1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 6

7 Web Usage Mining stages: 1. Collect usage data/clickstreams 2. Preprocess (reformat, filter irrelevant data) 3. Analyze and discover interesting patterns 4. Evaluate discovered profiles 5. Track the evolution of profiles  Web Usage Mining has been used for personalization, predicting navigation patterns, building datacubes to apply OLAP etc… 7

8  Previous studies related with evolution ◦ Machine learning based (another dimension for learning evolving concepts) ◦ Time-based forgetting approaches ◦ Separate user profiles for short-term and long-term interests 8

9  Some concepts related to profile evolution ◦ Evolutionary / Revolutionary / Hybrid Learning regarding the adaptation to change ◦ No-memory / Partial Memory / Full Memory ◦ Supervised / Unsupervised ◦ Single user / mass user 9

10  For user modeling, web usage data can be supported with ◦ Keywords representing web page content ◦ Website’s hierarchical structure (different pages but semantically relevant, e.g. under the same group) ◦ Semantic enrichment of navigated URLs (semantically enhanced web logs -> C-Logs) ◦ Taxonomy can be “defined explicitly” or “inferred implicitly” via URL tokenization (http://a/b/c.htm) 10

11 1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 11

12  Preprocess the weblog to identify sessions and produce their vector representations  Produce profiles using H-UNC (Hierarchical Unsupervised Niche Clustering) -> a GA approach  Enrich profiles with additional facets (external knowledge)  Track profile evolution, and measure the validation of discovered profiles 12

13 1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 13

14 Session Identification  Sessions are extracted using just the weblog files (no login data, no cookies)  Access time, IP Address, URL viewed, REFERRER are used for session identification Session Representation  Each valid URL in the web site is given a unique number j ε 1,2,…N u  Each session is represented as a binary vector of size N u. Navigation order is not considered.  Example (number of valid URLs=4): 1001 -> user accessed URL 1 and 4 14

15 1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 15

16 Unsupervised Niche Clustering (UNC)  The goal is to find ◦ profiles p i (a set of URLs representing session clusters) and ◦ scales σ i (variance/dispersion of sessions in a cluster around the cluster representative profile)  w ij : robust weight of a profile p i on a session item s j. If this value is large (i.e. a profile is “close” to a session), p i is a strong representative of s j, which has a positive effect on the fitness value of p i. 16

17  Randomly select N p sessions as initial p i ’s  Initialize the variable to some small value  Repeat : ◦ Calculate distance(!) d ij between every profile p i and every session s j ◦ Calculate robust weight w ij for every profile p i and every session s j ◦ Calculate scale σ i for every profile p i ◦ Calculate fitness f i for every profile ◦ Repeat (GA loop) :  Randomly select parent profiles  Generate child profiles (through crossover and mutation)  Calculate fitness values of the child profiles  Apply deterministic crowding for replacement policy 17

18 Hierarchical Unsupervised Niche Clustering  A divisive hierarchical version of UNC  Repeat dividing clusters into smaller clusters hierarchically considering ◦ the required hierarchy level (L max ) ◦ Maximum allowed cluster cardinality (N split ) ◦ Maximum allowed scale (σ split )  As a result, we have profile vectors and their scales. Sessions are assigned to the closest profiles. 18

19 1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 19

20  Cosine similarity:  Web Session Similarity (web site structure):  S u (l,k) : URL-to-URL similarity*  Distance used in UNC is then d = (1-S web ) 2 20

21  URL structure (tokenized URL paths P) http://a/b/c.html  For dynamic content, relations with an externally defined taxonomy (“is-a” relation). http://products.php?id=1&category=x 21

22  For dynamic content (dynamic URLs), preprocess the data and map the dynamic URLs to strings separated by “/” using ontology.  If we have such a table (taxonomy data), we can define a hierarchical structure even for the dynamic URLs. 22

23 1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 23

24  After H-UNC, we have clusters of sessions.  Summarize the sessions in each cluster as a profile vector p i, where p ik is the frequency with which URL k was accessed in sessions belonging to cluster i. Example : ◦ For cluster 1, let s 11 = 1001, s 12 =1100, s 13 =1001 ◦ Then p 1 = (1)(0.33)(0)(0.66)  Convert p i ’s to binary vectors so that only URLs with some minimum weight remain. Example : ◦ let minimum URL weight be 0.5, ◦ then p 1 = 1001 24

25 Extend for Robust Profiles  Calculate weights w ij for all sessions in a cluster (between profile i and session j) like in UNC  Assign sessions with high weights (robust weights), to the cluster’s “core”.  So, a cluster’s “core” is the group of sessions that are very similar to the representative profile.  Thus, noisy sessions are eliminated. 25

26  Enrich the profiles with facets (additional profile descriptors) like: ◦ Search queries ◦ Inquiring companies ◦ Inquired companies using IP Addresses, whois.com, registration database etc… for the sessions in the cluster 26

27 1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 27

28  Profile boundaries : p i vectors and σ i (scale/variance/dispersion) values are used to determine the boundaries  Profile compatibility : how much the boundaries of two profiles overlap  Algorithm to “TrackProfiles”. The idea is: ◦ Divide the time into time periods, and generate profiles for each time period. ◦ Compare the similarity of profile vectors for consecutive time periods T i and T i+1 using S web ◦ If distance (i.e. 1-similarity) is <σ profile1 then two profiles found in T i and T i+1 are related. 28

29  Birth : New profile incompatible with old profiles  Persistence : New profile compatible with an old profile ◦ One-to-one ◦ Bifurcation (splitting) ◦ mergal  Death : No new profile found for an old profile  Atavism (reappearance) : Old profile disappears then reappears  Volatile : Dead profiles that have never been persistent 29

30  Example for profile merge 30

31 1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 31

32  How close are the profiles to the original input data ◦ Precision : a profile with high precision should include “only” the true items ◦ Coverage(Recall) : a profile with a high coverage should include all data items  Example : let session=1001, then profiles ◦ 1000, 0001 : high precision, low coverage ◦ 1101, 1011 : low precision, high coverage ◦ 1001 : high precision, high coverage (ideal but unrealistic case -> every session must be a profile) 32

33  So we need to balance precision and coverage with some small number of profiles to get high quality Q ij for session j and profile i.  Define ◦ Precision Prec ij = |s j ∩ p i | / |p i | ◦ Coverage Cov ij = |s j ∩ p i | / |s j |  A combined measure for quality is defined as Q ij = F 1,ij = 2*prec ij *cov ij / (prec ij + cov ij ) 33

34  So, we defined the quality measure between a profile and a session.  Now, how do we capture the concept drift?  The meaning is: ◦ Decide a minimum quality threshold Q min to be satisfied ◦ Discover the profiles at time period T 2 ◦ Take the sessions at the next time period T 1, and for each session s j find the maximum quality Q ij using a profile from the previous time frame ◦ If the quality is higher than Q min, add this session s j to our quality sessions set denoted by s*(T 1, T 2 ) 34

35  As a result, we can measure quality by evaluating the equation below  As long as most of the sessions at T 1 are successfully represented by profiles found at T 2, this rate will be high.  If the minimum threshold quality Q min is taken low, the rate will be high. The best case is 1. If Q min is increased, number of sessions satisfying this quality decreases. 35

36  Example graph showing ◦ the sessions satisfying the minimum quality (y-axis) ◦ minimal F1 quality (x-axis) ◦ Dark line (lower) is the cross-period validation. ◦ Light line (upper) is the validation of profiles with the sessions in the same time frame. 36

37  Example graph showing ◦ the sessions satisfying the minimum quality (y-axis) ◦ minimal F1 quality (x-axis) ◦ Dark line (lower) is the cross-period validation. ◦ Light line (upper) is the validation of profiles with the sessions in the same time frame. 37

38 1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 38

39  They generate profiles with facets (like search queries, inquired companies, inquiring companies etc…) 39

40  Profiles are generated at first half of September  Light lines compare profiles with the sessions in the same time frame, i.e. first half of September (they are identical at all graphs)  Dark lines compare the same profiles with the sessions in the following time frames (cross validations). 40

41  In this experiment, profiles from T are validated against the sessions in the immediately following time period T+1  For Figure 1: ◦ Profiles are generated using the sessions in the first half of September ◦ Light line shows the validation using the sessions in the same time period ◦ Dark line shows the validation using the sessions in the following time period, i.e. second half of September 41

42  There is one more experiment, but the only difference with the previous one is that they use a shorter time period (1 week) in their observations. The idea is the same. 42

43  The work presented in this paper is an unsupervised learning that tries to learn mass anonymous user profiles  The profiles are mined in a no-memory revolutionary scheme.  The evolving profiles are validated in a full- memory mode. 43

44  In the paper, facets are used to support profiles with additional information. But it is not mentioned how they are used. (e.g. the most searched companies etc) 44


Download ppt "A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge."

Similar presentations


Ads by Google