A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge.

A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge and Data Engineering, 2008 Ozer Ozdikis Huseyin Candan 1

 Extraction of User Profiles using ◦ Web Usage Data ◦ Web site hierarchy ◦ External data etc…  Evolution of User Profiles in time ◦ Introducing new profiles, killing invalid ones… ◦ validation of the profile evolution 2

1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 3

Key features of the paper  Dynamic content (a portal for companies)  Clustering of user sessions extracted from web logs into homogenous groups of similar activities  Session similarity is calculated using navigated URL’s and website hierarchy (from URL and site taxonomy)  Generate mass user profiles  Repeat this generation periodically  Track the changes between the previous profiles and new profiles, and evaluate their evolution 5

Web Usage Mining stages: 1. Collect usage data/clickstreams 2. Preprocess (reformat, filter irrelevant data) 3. Analyze and discover interesting patterns 4. Evaluate discovered profiles 5. Track the evolution of profiles  Web Usage Mining has been used for personalization, predicting navigation patterns, building datacubes to apply OLAP etc… 7

 Previous studies related with evolution ◦ Machine learning based (another dimension for learning evolving concepts) ◦ Time-based forgetting approaches ◦ Separate user profiles for short-term and long-term interests 8

 Some concepts related to profile evolution ◦ Evolutionary / Revolutionary / Hybrid Learning regarding the adaptation to change ◦ No-memory / Partial Memory / Full Memory ◦ Supervised / Unsupervised ◦ Single user / mass user 9

 For user modeling, web usage data can be supported with ◦ Keywords representing web page content ◦ Website’s hierarchical structure (different pages but semantically relevant, e.g. under the same group) ◦ Semantic enrichment of navigated URLs (semantically enhanced web logs -> C-Logs) ◦ Taxonomy can be “defined explicitly” or “inferred implicitly” via URL tokenization (http://a/b/c.htm) 10

 Preprocess the weblog to identify sessions and produce their vector representations  Produce profiles using H-UNC (Hierarchical Unsupervised Niche Clustering) -> a GA approach  Enrich profiles with additional facets (external knowledge)  Track profile evolution, and measure the validation of discovered profiles 12

Session Identification  Sessions are extracted using just the weblog files (no login data, no cookies)  Access time, IP Address, URL viewed, REFERRER are used for session identification Session Representation  Each valid URL in the web site is given a unique number j ε 1,2,…N u  Each session is represented as a binary vector of size N u. Navigation order is not considered.  Example (number of valid URLs=4): 1001 -> user accessed URL 1 and 4 14

Unsupervised Niche Clustering (UNC)  The goal is to find ◦ profiles p i (a set of URLs representing session clusters) and ◦ scales σ i (variance/dispersion of sessions in a cluster around the cluster representative profile)  w ij : robust weight of a profile p i on a session item s j. If this value is large (i.e. a profile is “close” to a session), p i is a strong representative of s j, which has a positive effect on the fitness value of p i. 16

 Randomly select N p sessions as initial p i ’s  Initialize the variable to some small value  Repeat : ◦ Calculate distance(!) d ij between every profile p i and every session s j ◦ Calculate robust weight w ij for every profile p i and every session s j ◦ Calculate scale σ i for every profile p i ◦ Calculate fitness f i for every profile ◦ Repeat (GA loop) :  Randomly select parent profiles  Generate child profiles (through crossover and mutation)  Calculate fitness values of the child profiles  Apply deterministic crowding for replacement policy 17

Hierarchical Unsupervised Niche Clustering  A divisive hierarchical version of UNC  Repeat dividing clusters into smaller clusters hierarchically considering ◦ the required hierarchy level (L max ) ◦ Maximum allowed cluster cardinality (N split ) ◦ Maximum allowed scale (σ split )  As a result, we have profile vectors and their scales. Sessions are assigned to the closest profiles. 18

 Cosine similarity:  Web Session Similarity (web site structure):  S u (l,k) : URL-to-URL similarity*  Distance used in UNC is then d = (1-S web ) 2 20

 URL structure (tokenized URL paths P) http://a/b/c.html  For dynamic content, relations with an externally defined taxonomy (“is-a” relation). http://products.php?id=1&category=x 21

 For dynamic content (dynamic URLs), preprocess the data and map the dynamic URLs to strings separated by “/” using ontology.  If we have such a table (taxonomy data), we can define a hierarchical structure even for the dynamic URLs. 22

 After H-UNC, we have clusters of sessions.  Summarize the sessions in each cluster as a profile vector p i, where p ik is the frequency with which URL k was accessed in sessions belonging to cluster i. Example : ◦ For cluster 1, let s 11 = 1001, s 12 =1100, s 13 =1001 ◦ Then p 1 = (1)(0.33)(0)(0.66)  Convert p i ’s to binary vectors so that only URLs with some minimum weight remain. Example : ◦ let minimum URL weight be 0.5, ◦ then p 1 = 1001 24

Extend for Robust Profiles  Calculate weights w ij for all sessions in a cluster (between profile i and session j) like in UNC  Assign sessions with high weights (robust weights), to the cluster’s “core”.  So, a cluster’s “core” is the group of sessions that are very similar to the representative profile.  Thus, noisy sessions are eliminated. 25

 Enrich the profiles with facets (additional profile descriptors) like: ◦ Search queries ◦ Inquiring companies ◦ Inquired companies using IP Addresses, whois.com, registration database etc… for the sessions in the cluster 26

 Profile boundaries : p i vectors and σ i (scale/variance/dispersion) values are used to determine the boundaries  Profile compatibility : how much the boundaries of two profiles overlap  Algorithm to “TrackProfiles”. The idea is: ◦ Divide the time into time periods, and generate profiles for each time period. ◦ Compare the similarity of profile vectors for consecutive time periods T i and T i+1 using S web ◦ If distance (i.e. 1-similarity) is <σ profile1 then two profiles found in T i and T i+1 are related. 28

 Birth : New profile incompatible with old profiles  Persistence : New profile compatible with an old profile ◦ One-to-one ◦ Bifurcation (splitting) ◦ mergal  Death : No new profile found for an old profile  Atavism (reappearance) : Old profile disappears then reappears  Volatile : Dead profiles that have never been persistent 29

 Example for profile merge 30

 How close are the profiles to the original input data ◦ Precision : a profile with high precision should include “only” the true items ◦ Coverage(Recall) : a profile with a high coverage should include all data items  Example : let session=1001, then profiles ◦ 1000, 0001 : high precision, low coverage ◦ 1101, 1011 : low precision, high coverage ◦ 1001 : high precision, high coverage (ideal but unrealistic case -> every session must be a profile) 32

 So we need to balance precision and coverage with some small number of profiles to get high quality Q ij for session j and profile i.  Define ◦ Precision Prec ij = |s j ∩ p i | / |p i | ◦ Coverage Cov ij = |s j ∩ p i | / |s j |  A combined measure for quality is defined as Q ij = F 1,ij = 2*prec ij *cov ij / (prec ij + cov ij ) 33

 So, we defined the quality measure between a profile and a session.  Now, how do we capture the concept drift?  The meaning is: ◦ Decide a minimum quality threshold Q min to be satisfied ◦ Discover the profiles at time period T 2 ◦ Take the sessions at the next time period T 1, and for each session s j find the maximum quality Q ij using a profile from the previous time frame ◦ If the quality is higher than Q min, add this session s j to our quality sessions set denoted by s*(T 1, T 2 ) 34

 As a result, we can measure quality by evaluating the equation below  As long as most of the sessions at T 1 are successfully represented by profiles found at T 2, this rate will be high.  If the minimum threshold quality Q min is taken low, the rate will be high. The best case is 1. If Q min is increased, number of sessions satisfying this quality decreases. 35

 Example graph showing ◦ the sessions satisfying the minimum quality (y-axis) ◦ minimal F1 quality (x-axis) ◦ Dark line (lower) is the cross-period validation. ◦ Light line (upper) is the validation of profiles with the sessions in the same time frame. 36

 Example graph showing ◦ the sessions satisfying the minimum quality (y-axis) ◦ minimal F1 quality (x-axis) ◦ Dark line (lower) is the cross-period validation. ◦ Light line (upper) is the validation of profiles with the sessions in the same time frame. 37

 They generate profiles with facets (like search queries, inquired companies, inquiring companies etc…) 39

 Profiles are generated at first half of September  Light lines compare profiles with the sessions in the same time frame, i.e. first half of September (they are identical at all graphs)  Dark lines compare the same profiles with the sessions in the following time frames (cross validations). 40

 In this experiment, profiles from T are validated against the sessions in the immediately following time period T+1  For Figure 1: ◦ Profiles are generated using the sessions in the first half of September ◦ Light line shows the validation using the sessions in the same time period ◦ Dark line shows the validation using the sessions in the following time period, i.e. second half of September 41

 There is one more experiment, but the only difference with the previous one is that they use a shorter time period (1 week) in their observations. The idea is the same. 42

 The work presented in this paper is an unsupervised learning that tries to learn mass anonymous user profiles  The profiles are mined in a no-memory revolutionary scheme.  The evolving profiles are validated in a full- memory mode. 43

 In the paper, facets are used to support profiles with additional information. But it is not mentioned how they are used. (e.g. the most searched companies etc) 44

A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge.

Similar presentations

Presentation on theme: "A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge.

Similar presentations

Presentation on theme: "A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge."— Presentation transcript:

Similar presentations

About project

Feedback