Download presentation
Presentation is loading. Please wait.
Published byEsther McDaniel Modified over 8 years ago
1
A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge and Data Engineering, 2008 Ozer Ozdikis Huseyin Candan 1
2
Extraction of User Profiles using ◦ Web Usage Data ◦ Web site hierarchy ◦ External data etc… Evolution of User Profiles in time ◦ Introducing new profiles, killing invalid ones… ◦ validation of the profile evolution 2
3
1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 3
4
1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 4
5
Key features of the paper Dynamic content (a portal for companies) Clustering of user sessions extracted from web logs into homogenous groups of similar activities Session similarity is calculated using navigated URL’s and website hierarchy (from URL and site taxonomy) Generate mass user profiles Repeat this generation periodically Track the changes between the previous profiles and new profiles, and evaluate their evolution 5
6
1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 6
7
Web Usage Mining stages: 1. Collect usage data/clickstreams 2. Preprocess (reformat, filter irrelevant data) 3. Analyze and discover interesting patterns 4. Evaluate discovered profiles 5. Track the evolution of profiles Web Usage Mining has been used for personalization, predicting navigation patterns, building datacubes to apply OLAP etc… 7
8
Previous studies related with evolution ◦ Machine learning based (another dimension for learning evolving concepts) ◦ Time-based forgetting approaches ◦ Separate user profiles for short-term and long-term interests 8
9
Some concepts related to profile evolution ◦ Evolutionary / Revolutionary / Hybrid Learning regarding the adaptation to change ◦ No-memory / Partial Memory / Full Memory ◦ Supervised / Unsupervised ◦ Single user / mass user 9
10
For user modeling, web usage data can be supported with ◦ Keywords representing web page content ◦ Website’s hierarchical structure (different pages but semantically relevant, e.g. under the same group) ◦ Semantic enrichment of navigated URLs (semantically enhanced web logs -> C-Logs) ◦ Taxonomy can be “defined explicitly” or “inferred implicitly” via URL tokenization (http://a/b/c.htm) 10
11
1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 11
12
Preprocess the weblog to identify sessions and produce their vector representations Produce profiles using H-UNC (Hierarchical Unsupervised Niche Clustering) -> a GA approach Enrich profiles with additional facets (external knowledge) Track profile evolution, and measure the validation of discovered profiles 12
13
1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 13
14
Session Identification Sessions are extracted using just the weblog files (no login data, no cookies) Access time, IP Address, URL viewed, REFERRER are used for session identification Session Representation Each valid URL in the web site is given a unique number j ε 1,2,…N u Each session is represented as a binary vector of size N u. Navigation order is not considered. Example (number of valid URLs=4): 1001 -> user accessed URL 1 and 4 14
15
1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 15
16
Unsupervised Niche Clustering (UNC) The goal is to find ◦ profiles p i (a set of URLs representing session clusters) and ◦ scales σ i (variance/dispersion of sessions in a cluster around the cluster representative profile) w ij : robust weight of a profile p i on a session item s j. If this value is large (i.e. a profile is “close” to a session), p i is a strong representative of s j, which has a positive effect on the fitness value of p i. 16
17
Randomly select N p sessions as initial p i ’s Initialize the variable to some small value Repeat : ◦ Calculate distance(!) d ij between every profile p i and every session s j ◦ Calculate robust weight w ij for every profile p i and every session s j ◦ Calculate scale σ i for every profile p i ◦ Calculate fitness f i for every profile ◦ Repeat (GA loop) : Randomly select parent profiles Generate child profiles (through crossover and mutation) Calculate fitness values of the child profiles Apply deterministic crowding for replacement policy 17
18
Hierarchical Unsupervised Niche Clustering A divisive hierarchical version of UNC Repeat dividing clusters into smaller clusters hierarchically considering ◦ the required hierarchy level (L max ) ◦ Maximum allowed cluster cardinality (N split ) ◦ Maximum allowed scale (σ split ) As a result, we have profile vectors and their scales. Sessions are assigned to the closest profiles. 18
19
1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 19
20
Cosine similarity: Web Session Similarity (web site structure): S u (l,k) : URL-to-URL similarity* Distance used in UNC is then d = (1-S web ) 2 20
21
URL structure (tokenized URL paths P) http://a/b/c.html For dynamic content, relations with an externally defined taxonomy (“is-a” relation). http://products.php?id=1&category=x 21
22
For dynamic content (dynamic URLs), preprocess the data and map the dynamic URLs to strings separated by “/” using ontology. If we have such a table (taxonomy data), we can define a hierarchical structure even for the dynamic URLs. 22
23
1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 23
24
After H-UNC, we have clusters of sessions. Summarize the sessions in each cluster as a profile vector p i, where p ik is the frequency with which URL k was accessed in sessions belonging to cluster i. Example : ◦ For cluster 1, let s 11 = 1001, s 12 =1100, s 13 =1001 ◦ Then p 1 = (1)(0.33)(0)(0.66) Convert p i ’s to binary vectors so that only URLs with some minimum weight remain. Example : ◦ let minimum URL weight be 0.5, ◦ then p 1 = 1001 24
25
Extend for Robust Profiles Calculate weights w ij for all sessions in a cluster (between profile i and session j) like in UNC Assign sessions with high weights (robust weights), to the cluster’s “core”. So, a cluster’s “core” is the group of sessions that are very similar to the representative profile. Thus, noisy sessions are eliminated. 25
26
Enrich the profiles with facets (additional profile descriptors) like: ◦ Search queries ◦ Inquiring companies ◦ Inquired companies using IP Addresses, whois.com, registration database etc… for the sessions in the cluster 26
27
1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 27
28
Profile boundaries : p i vectors and σ i (scale/variance/dispersion) values are used to determine the boundaries Profile compatibility : how much the boundaries of two profiles overlap Algorithm to “TrackProfiles”. The idea is: ◦ Divide the time into time periods, and generate profiles for each time period. ◦ Compare the similarity of profile vectors for consecutive time periods T i and T i+1 using S web ◦ If distance (i.e. 1-similarity) is <σ profile1 then two profiles found in T i and T i+1 are related. 28
29
Birth : New profile incompatible with old profiles Persistence : New profile compatible with an old profile ◦ One-to-one ◦ Bifurcation (splitting) ◦ mergal Death : No new profile found for an old profile Atavism (reappearance) : Old profile disappears then reappears Volatile : Dead profiles that have never been persistent 29
30
Example for profile merge 30
31
1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 31
32
How close are the profiles to the original input data ◦ Precision : a profile with high precision should include “only” the true items ◦ Coverage(Recall) : a profile with a high coverage should include all data items Example : let session=1001, then profiles ◦ 1000, 0001 : high precision, low coverage ◦ 1101, 1011 : low precision, high coverage ◦ 1001 : high precision, high coverage (ideal but unrealistic case -> every session must be a profile) 32
33
So we need to balance precision and coverage with some small number of profiles to get high quality Q ij for session j and profile i. Define ◦ Precision Prec ij = |s j ∩ p i | / |p i | ◦ Coverage Cov ij = |s j ∩ p i | / |s j | A combined measure for quality is defined as Q ij = F 1,ij = 2*prec ij *cov ij / (prec ij + cov ij ) 33
34
So, we defined the quality measure between a profile and a session. Now, how do we capture the concept drift? The meaning is: ◦ Decide a minimum quality threshold Q min to be satisfied ◦ Discover the profiles at time period T 2 ◦ Take the sessions at the next time period T 1, and for each session s j find the maximum quality Q ij using a profile from the previous time frame ◦ If the quality is higher than Q min, add this session s j to our quality sessions set denoted by s*(T 1, T 2 ) 34
35
As a result, we can measure quality by evaluating the equation below As long as most of the sessions at T 1 are successfully represented by profiles found at T 2, this rate will be high. If the minimum threshold quality Q min is taken low, the rate will be high. The best case is 1. If Q min is increased, number of sessions satisfying this quality decreases. 35
36
Example graph showing ◦ the sessions satisfying the minimum quality (y-axis) ◦ minimal F1 quality (x-axis) ◦ Dark line (lower) is the cross-period validation. ◦ Light line (upper) is the validation of profiles with the sessions in the same time frame. 36
37
Example graph showing ◦ the sessions satisfying the minimum quality (y-axis) ◦ minimal F1 quality (x-axis) ◦ Dark line (lower) is the cross-period validation. ◦ Light line (upper) is the validation of profiles with the sessions in the same time frame. 37
38
1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 38
39
They generate profiles with facets (like search queries, inquired companies, inquiring companies etc…) 39
40
Profiles are generated at first half of September Light lines compare profiles with the sessions in the same time frame, i.e. first half of September (they are identical at all graphs) Dark lines compare the same profiles with the sessions in the following time frames (cross validations). 40
41
In this experiment, profiles from T are validated against the sessions in the immediately following time period T+1 For Figure 1: ◦ Profiles are generated using the sessions in the first half of September ◦ Light line shows the validation using the sessions in the same time period ◦ Dark line shows the validation using the sessions in the following time period, i.e. second half of September 41
42
There is one more experiment, but the only difference with the previous one is that they use a shorter time period (1 week) in their observations. The idea is the same. 42
43
The work presented in this paper is an unsupervised learning that tries to learn mass anonymous user profiles The profiles are mined in a no-memory revolutionary scheme. The evolving profiles are validated in a full- memory mode. 43
44
In the paper, facets are used to support profiles with additional information. But it is not mentioned how they are used. (e.g. the most searched companies etc) 44
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.