A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge.

Slides:



Advertisements
Similar presentations
Learning to Suggest: A Machine Learning Framework for Ranking Query Suggestions Date: 2013/02/18 Author: Umut Ozertem, Olivier Chapelle, Pinar Donmez,
Advertisements

WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEBSITE DONE BY: AYESHA NUSRATH 07L51A0517 FIRDOUSE AFREEN 07L51A0522.
Ali Husseinzadeh Kashan Spring 2010
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.
Data Mining Techniques: Clustering
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Chapter 12: Web Usage Mining - An introduction
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Aki Hecht Seminar in Databases (236826) January 2009
Towards Semantic Web Mining Bettina Berndt Andreas Hotho Gerd Stumme.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
1 DECK: Detecting Events from Web Click-through Data Ling Chen, Yiqun Hu, Wolfgang Nejdl Presented by Sebastian Föllmer.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Basic concepts of Data Mining, Clustering and Genetic Algorithms Tsai-Yang Jea Department of Computer Science and Engineering SUNY at Buffalo.
Discovery of Aggregate Usage Profiles for Web Personalization
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Web Usage Mining - W hat, W hy, ho W Presented by:Roopa Datla Jinguang Liu.
Overview of Web Data Mining and Applications Part I
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
CS 401 Paper Presentation Praveen Inuganti
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Friends and Locations Recommendation with the use of LBSN
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Presented by Tienwei Tsai July, 2005
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
Dr. Susan Gauch When is a rock not a rock? Conceptual Approaches to Personalized Search and Recommendations Nov. 8, 2011 TResNet.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Web Usage Mining for Semantic Web Personalization جینی شیره شعاعی زهرا.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Algorithmic Detection of Semantic Similarity WWW 2005.
Learning URL Patterns for Webpage De-duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu /12/5.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
Post-Ranking query suggestion by diversifying search Chao Wang.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Chaoyang University of Technology Clustering web transactions using rough approximation Source : Fuzzy Sets and Systems 148 (2004) 131–138 Author : Supriya.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Principles in the Evolutionary Design of Digital Circuits J. F. Miller, D. Job, and V. K. Vassilev Genetic Programming and Evolvable Machines.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
A WEB USAGE MINING FRAMEWORK FOR MINING EVOLVING USER PROFILES IN DYNAMIC WEB SITES.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Data mining in web applications
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Data Mining K-means Algorithm
Web Mining Research: A Survey
Presentation transcript:

A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites Nasraoui, Soliman, Saka, Badia, Germain IEEE Transactions on Knowledge and Data Engineering, 2008 Ozer Ozdikis Huseyin Candan 1

 Extraction of User Profiles using ◦ Web Usage Data ◦ Web site hierarchy ◦ External data etc…  Evolution of User Profiles in time ◦ Introducing new profiles, killing invalid ones… ◦ validation of the profile evolution 2

1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 3

1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 4

Key features of the paper  Dynamic content (a portal for companies)  Clustering of user sessions extracted from web logs into homogenous groups of similar activities  Session similarity is calculated using navigated URL’s and website hierarchy (from URL and site taxonomy)  Generate mass user profiles  Repeat this generation periodically  Track the changes between the previous profiles and new profiles, and evaluate their evolution 5

1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 6

Web Usage Mining stages: 1. Collect usage data/clickstreams 2. Preprocess (reformat, filter irrelevant data) 3. Analyze and discover interesting patterns 4. Evaluate discovered profiles 5. Track the evolution of profiles  Web Usage Mining has been used for personalization, predicting navigation patterns, building datacubes to apply OLAP etc… 7

 Previous studies related with evolution ◦ Machine learning based (another dimension for learning evolving concepts) ◦ Time-based forgetting approaches ◦ Separate user profiles for short-term and long-term interests 8

 Some concepts related to profile evolution ◦ Evolutionary / Revolutionary / Hybrid Learning regarding the adaptation to change ◦ No-memory / Partial Memory / Full Memory ◦ Supervised / Unsupervised ◦ Single user / mass user 9

 For user modeling, web usage data can be supported with ◦ Keywords representing web page content ◦ Website’s hierarchical structure (different pages but semantically relevant, e.g. under the same group) ◦ Semantic enrichment of navigated URLs (semantically enhanced web logs -> C-Logs) ◦ Taxonomy can be “defined explicitly” or “inferred implicitly” via URL tokenization ( 10

1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 11

 Preprocess the weblog to identify sessions and produce their vector representations  Produce profiles using H-UNC (Hierarchical Unsupervised Niche Clustering) -> a GA approach  Enrich profiles with additional facets (external knowledge)  Track profile evolution, and measure the validation of discovered profiles 12

1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 13

Session Identification  Sessions are extracted using just the weblog files (no login data, no cookies)  Access time, IP Address, URL viewed, REFERRER are used for session identification Session Representation  Each valid URL in the web site is given a unique number j ε 1,2,…N u  Each session is represented as a binary vector of size N u. Navigation order is not considered.  Example (number of valid URLs=4): > user accessed URL 1 and 4 14

1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 15

Unsupervised Niche Clustering (UNC)  The goal is to find ◦ profiles p i (a set of URLs representing session clusters) and ◦ scales σ i (variance/dispersion of sessions in a cluster around the cluster representative profile)  w ij : robust weight of a profile p i on a session item s j. If this value is large (i.e. a profile is “close” to a session), p i is a strong representative of s j, which has a positive effect on the fitness value of p i. 16

 Randomly select N p sessions as initial p i ’s  Initialize the variable to some small value  Repeat : ◦ Calculate distance(!) d ij between every profile p i and every session s j ◦ Calculate robust weight w ij for every profile p i and every session s j ◦ Calculate scale σ i for every profile p i ◦ Calculate fitness f i for every profile ◦ Repeat (GA loop) :  Randomly select parent profiles  Generate child profiles (through crossover and mutation)  Calculate fitness values of the child profiles  Apply deterministic crowding for replacement policy 17

Hierarchical Unsupervised Niche Clustering  A divisive hierarchical version of UNC  Repeat dividing clusters into smaller clusters hierarchically considering ◦ the required hierarchy level (L max ) ◦ Maximum allowed cluster cardinality (N split ) ◦ Maximum allowed scale (σ split )  As a result, we have profile vectors and their scales. Sessions are assigned to the closest profiles. 18

1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 19

 Cosine similarity:  Web Session Similarity (web site structure):  S u (l,k) : URL-to-URL similarity*  Distance used in UNC is then d = (1-S web ) 2 20

 URL structure (tokenized URL paths P)  For dynamic content, relations with an externally defined taxonomy (“is-a” relation). 21

 For dynamic content (dynamic URLs), preprocess the data and map the dynamic URLs to strings separated by “/” using ontology.  If we have such a table (taxonomy data), we can define a hierarchical structure even for the dynamic URLs. 22

1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 23

 After H-UNC, we have clusters of sessions.  Summarize the sessions in each cluster as a profile vector p i, where p ik is the frequency with which URL k was accessed in sessions belonging to cluster i. Example : ◦ For cluster 1, let s 11 = 1001, s 12 =1100, s 13 =1001 ◦ Then p 1 = (1)(0.33)(0)(0.66)  Convert p i ’s to binary vectors so that only URLs with some minimum weight remain. Example : ◦ let minimum URL weight be 0.5, ◦ then p 1 =

Extend for Robust Profiles  Calculate weights w ij for all sessions in a cluster (between profile i and session j) like in UNC  Assign sessions with high weights (robust weights), to the cluster’s “core”.  So, a cluster’s “core” is the group of sessions that are very similar to the representative profile.  Thus, noisy sessions are eliminated. 25

 Enrich the profiles with facets (additional profile descriptors) like: ◦ Search queries ◦ Inquiring companies ◦ Inquired companies using IP Addresses, whois.com, registration database etc… for the sessions in the cluster 26

1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 27

 Profile boundaries : p i vectors and σ i (scale/variance/dispersion) values are used to determine the boundaries  Profile compatibility : how much the boundaries of two profiles overlap  Algorithm to “TrackProfiles”. The idea is: ◦ Divide the time into time periods, and generate profiles for each time period. ◦ Compare the similarity of profile vectors for consecutive time periods T i and T i+1 using S web ◦ If distance (i.e. 1-similarity) is <σ profile1 then two profiles found in T i and T i+1 are related. 28

 Birth : New profile incompatible with old profiles  Persistence : New profile compatible with an old profile ◦ One-to-one ◦ Bifurcation (splitting) ◦ mergal  Death : No new profile found for an old profile  Atavism (reappearance) : Old profile disappears then reappears  Volatile : Dead profiles that have never been persistent 29

 Example for profile merge 30

1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 31

 How close are the profiles to the original input data ◦ Precision : a profile with high precision should include “only” the true items ◦ Coverage(Recall) : a profile with a high coverage should include all data items  Example : let session=1001, then profiles ◦ 1000, 0001 : high precision, low coverage ◦ 1101, 1011 : low precision, high coverage ◦ 1001 : high precision, high coverage (ideal but unrealistic case -> every session must be a profile) 32

 So we need to balance precision and coverage with some small number of profiles to get high quality Q ij for session j and profile i.  Define ◦ Precision Prec ij = |s j ∩ p i | / |p i | ◦ Coverage Cov ij = |s j ∩ p i | / |s j |  A combined measure for quality is defined as Q ij = F 1,ij = 2*prec ij *cov ij / (prec ij + cov ij ) 33

 So, we defined the quality measure between a profile and a session.  Now, how do we capture the concept drift?  The meaning is: ◦ Decide a minimum quality threshold Q min to be satisfied ◦ Discover the profiles at time period T 2 ◦ Take the sessions at the next time period T 1, and for each session s j find the maximum quality Q ij using a profile from the previous time frame ◦ If the quality is higher than Q min, add this session s j to our quality sessions set denoted by s*(T 1, T 2 ) 34

 As a result, we can measure quality by evaluating the equation below  As long as most of the sessions at T 1 are successfully represented by profiles found at T 2, this rate will be high.  If the minimum threshold quality Q min is taken low, the rate will be high. The best case is 1. If Q min is increased, number of sessions satisfying this quality decreases. 35

 Example graph showing ◦ the sessions satisfying the minimum quality (y-axis) ◦ minimal F1 quality (x-axis) ◦ Dark line (lower) is the cross-period validation. ◦ Light line (upper) is the validation of profiles with the sessions in the same time frame. 36

 Example graph showing ◦ the sessions satisfying the minimum quality (y-axis) ◦ minimal F1 quality (x-axis) ◦ Dark line (lower) is the cross-period validation. ◦ Light line (upper) is the validation of profiles with the sessions in the same time frame. 37

1. Introduction 2. Web Usage Mining ◦ Handling Profile Evolution ◦ Integrating Semantics 3. Profile Discovery based on Usage ◦ Preprocessing ◦ Clustering sessions ◦ Similarity measure in clustering ◦ Post processing and Enrichment 4. Profile Evolution ◦ Tracking the Evolving User Profiles ◦ Validating the Profile Evolution ◦ Experimental Results 38

 They generate profiles with facets (like search queries, inquired companies, inquiring companies etc…) 39

 Profiles are generated at first half of September  Light lines compare profiles with the sessions in the same time frame, i.e. first half of September (they are identical at all graphs)  Dark lines compare the same profiles with the sessions in the following time frames (cross validations). 40

 In this experiment, profiles from T are validated against the sessions in the immediately following time period T+1  For Figure 1: ◦ Profiles are generated using the sessions in the first half of September ◦ Light line shows the validation using the sessions in the same time period ◦ Dark line shows the validation using the sessions in the following time period, i.e. second half of September 41

 There is one more experiment, but the only difference with the previous one is that they use a shorter time period (1 week) in their observations. The idea is the same. 42

 The work presented in this paper is an unsupervised learning that tries to learn mass anonymous user profiles  The profiles are mined in a no-memory revolutionary scheme.  The evolving profiles are validated in a full- memory mode. 43

 In the paper, facets are used to support profiles with additional information. But it is not mentioned how they are used. (e.g. the most searched companies etc) 44