Overview on Web Mining and Recommendation June 13, 20161 CENG 770.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)
Google News Personalization: Scalable Online Collaborative Filtering
Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Recommender Systems Aalap Kohojkar Yang Liu Zhan Shi March 31, 2008.
Chapter 12: Web Usage Mining - An introduction
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Discovery of Aggregate Usage Profiles for Web Personalization
Link Structure and Web Mining Shuying Wang
Recommender systems Ram Akella November 26 th 2008.
Information Retrieval
Overview of Web Data Mining and Applications Part I
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
WEB ANALYTICS Prof Sunil Wattal. Business questions How are people finding your website? What pages are the customers most interested in? Is your website.
Presented By: - Chandrika B N
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
LOGO Recommendation Algorithms Lecturer: Dr. Bo Yuan
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
1 CS 430: Information Discovery Lecture 5 Ranking.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
User Modeling and Recommender Systems: recommendation algorithms
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Web Mining and Recommendation June 25, C.Eng 714 Spring 2010.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Data Mining: Concepts and Techniques
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Text & Web Mining 9/22/2018.
Web Mining and Recommendation
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
Web Mining Research: A Survey
Presentation transcript:

Overview on Web Mining and Recommendation June 13, CENG 770

 Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services Web Mining

Examples of Discovered Patterns  Association rules  75% of Facebook users also have FourSquare accounts  Classification  People with age less than 40 and salary > 40k trade on-line  Clustering  Users A and B access similar URLs  Outlier Detection  User A spends more than twice the average amount of time surfing on the Web

Why is Web Mining Different?  The Web is a huge collection of documents except for  Hyper-link information  Access and usage information  The Web is very dynamic  New pages are constantly being generated  Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to  Exploit hyper-links and access patterns  Be incremental

Web Mining Web Content Mining – Web page content mining – Search result mining Web Structure Mining – Search Web Usage Mining – Access patterns – Customized Usage patterns June 13, 20165

Web Content Mining Crawler – A program that traverses the hypertext structure in the Web – Seed URL: page/set of pages that the crawler starts with – Links from visited page saved in a queue – Build an index Focused crawlers June 13, 20166

Basic Measures for Text Retrieval Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved 6/13/20167 Relevant Relevant & Retrieved Retrieved All Documents

Information Retrieval Techniques Basic Concepts – A document can be described by a set of representative keywords called index terms. – Different index terms have varying relevance when used to describe document contents. – This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf) DBMS Analogy – Index Terms  Attributes – Weights  Attribute Values 6/13/20168

Indexing Inverted index – A data structure for supporting text queries, similar to index in a book document_table: a set of document records term_table: a set of term records, – Answer query: Find all docs associated with one or a set of terms – + easy to implement – – do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large) 6/13/20169

Inverted index inverted index aalborg 3452, 11437, …... arm 4, 19, 29, 98, 143,... armada 145, 457, 789,... armadillo 678, 2134, 3970,... armani 90, 256, 372, 511,.... zz 602, 1189, 3209,... disks with documents indexing

Vector Space Model Documents and user queries are represented as m- dimensional vectors, where m is the total number of index terms in the document collection. The degree of similarity of the document d with regard to the query q is calculated as the correlation between the vectors that represent them, using measures such as the Euclidian distance or the cosine of the angle between these two vectors. 6/13/201611

Vector Space Model Represent a doc by a term vector – Term: basic concept, e.g., word or phrase – Each term defines one dimension – N terms define a N-dimensional space – Element of vector corresponds to term weight – E.g., d = (x 1,…,x N ), x i is “importance” of term i 6/13/201612

VS Model: Illustration 6/13/ Java Microsoft Starbucks C2C2 Category 2 C1C1 Category 1 C3C3 Category 3 new doc New document is assigned to the most likely category based on vector similarity.

Issues to be handled How to select terms to capture “basic concepts” – Word stopping e.g. “a”, “the”, “always”, “along” – Word stemming e.g. “computer”, “computing”, “computerize” => “compute” – Latent semantic indexing How to assign weights – Not all words are equally important: Some are more indicative than others e.g. “algebra” vs. “science” How to measure the similarity 6/13/201614

Latent Semantic Indexing Basic idea – Similar documents have similar word frequencies – Difficulty: the size of the term frequency matrix is very large – Use a singular value decomposition (SVD) techniques to reduce the size of frequency table – Retain the K most significant rows of the frequency table Method – Create a term x document weighted frequency matrix A – SVD construction: A = U * S * V’ – Define K and obtain U k,, S k, and V k. – Create query vector q’. – Project q’ into the term-document space: Dq = q’ * U k * S k -1 – Calculate similarities: cos α = Dq. D / ||Dq|| * ||D|| 6/13/201615

How to Assign Weights Two-fold heuristics based on frequency – TF (Term frequency) More frequent within a document  more relevant to semantics – IDF (Inverse document frequency) Less frequent among documents  more discriminative 6/13/201616

TF Weighting Weighting: – More frequent => more relevant to topic Raw TF= f(t,d): how many times term t appears in doc d Normalization: – Document length varies => relative frequency preferred e.g., Maximum frequency normalization 6/13/201617

IDF Weighting Ideas: – Less frequent among documents  more discriminative Formula: IDF(t) = 1+ log (n/k) n: total number of docs k: # docs with term t appearing 6/13/201618

TF-IDF Weighting TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t) – Frequent within doc  high tf  high weight – Selective among docs  high idf  high weight Recall VS model – Each selected term represents one dimension – Each doc is represented by a feature vector – Its t-term coordinate of document d is the TF-IDF weight Many complex and more effective weighting variants exist in practice 6/13/201619

How to Measure Similarity? Given two documents Similarity definition – dot product – normalized dot product (or cosine) 6/13/201620

Illustrative Example 6/13/ text mining travel map search engine govern president congress IDF doc12(4.8) 1(4.5) 1(2.1) 1(5.4) doc21(2.4 ) 2 (5.6) 1(3.3) doc3 1 (2.2) 1(3.2) 1(4.3) newdoc1(2.4) 1(4.5) doc3 text mining search engine text travel text map travel government president congress doc1 doc2 …… Sim(newdoc,doc1)=2*4.8* *4.5 Sim(newdoc,doc2)=2.4*2.4 Sim(newdoc,doc3)=0

Web Structure Mining PageRank (Google ’00) Clever (IBM ’99) June 13,

Search Engine – Two Rank Functions 6/13/ Ranking based on link structure analysis Similarity based on content or text

The PageRank Algorithm PR(A) = p + (1-p)(PR(T1)/C(T1) + … + PR(Tn)/C(Tn)) where, C(Ti) = # out-links of page i Parameter p is probability that the surfer gets bored and starts on a new random page (1-p) is the probability that the random surfer follows a link on current page 6/13/ Intuition: PageRank can be seen as the probability that a “random surfer” visits a page Brin, S. and Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proc. WWW Conference, pages 107–117 Basic idea: significance of a page is determined by the significance of the pages linking to it Link i → j : i considers j important. the more important i, the more important j becomes. if i has many out-links: links are less important. Initially: all importances pi = 1. Iteratively, pi is refined.

6/13/ The HITS Algorithm  Hyperlink-induced topic search (HITS) Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632.  Basic idea: Sufficiently broad topics contain communities consisting of two types of hyperlinked pages: Authority: best source for requested info, highly- referenced pages on a topic Hub: contains links to authoritative pages

6/13/ The HITS Algorithm Collect seed set of pages S (returned by search engine) Expand seed set to contain pages that point to or are pointed to by pages in seed set (removes links inside a site) Iteratively update hub weight h(p) and authority weight a(p) for each page: a (p )= ∑ h (q ), for all q  p h (p )= ∑ a (q), for all p  q After a fixed number of iterations, pages with highest hub/authority weights form core of community

Problems with Web Search Today Today’s search engines are plagued by problems: – the abundance problem (99% of info of no interest to 99% of people) – limited coverage of the Web (internet sources hidden behind search interfaces) Largest crawlers cover < 18% of all web pages – limited query interface based on keyword- oriented search – limited customization to individual users

Problems with Web Search Today Today’s search engines are plagued by problems: – Web is highly dynamic Lot of pages added, removed, and updated every day – Very high dimensionality

Web Usage Mining Pages contain information Links are ‘roads’ How do people navigate the Internet –  Web Usage Mining (clickstream analysis) Information on navigation paths available in log files Logs can be mined from a client or a server perspective

Website Usage Analysis Why analyze Website usage? Knowledge about how visitors use Website could – Provide guidelines to web site reorganization; Help prevent disorientation – Help designers place important information where the visitors look for it – Pre-fetching and caching web pages – Provide adaptive Website (Personalization) – Questions which could be answered What are the differences in usage and access patterns among users? What user behaviors change over time? How usage patterns change with quality of service (slow/fast)? What is the distribution of network traffic over time?

Website Usage Analysis

There are analysis services such as Analog ( Google analytics Gives basic statistics such as number of hits average hits per time period what are the popular pages in your site who is visiting your site what keywords are users searching for to get to you what is being downloaded

Web Usage Mining Process

June 13, 2016Data Mining: Concepts and Techniques34 Data cleaning By checking the suffix of the URL name, for example, all log entries with filename suffixes such as, gif, jpeg, etc User identification If a page is requested that is not directly linked to the previous pages, multiple users are assumed to exist on the same machine Other heuristics involve using a combination of IP address, machine name, browser agent, and temporal information to identify users Transaction identification All of the page references made by a user during a single visit to a site Size of a transaction can range from a single page reference to all of the page references Data Preparation

June 13, 2016Data Mining: Concepts and Techniques35 Main Questions: how to identify unique users how to identify/define a user transaction Problems: user ids are often suppressed due to security concerns individual IP addresses are sometimes hidden behind proxy servers client-side & proxy caching makes server log data less reliable Standard Solutions/Practices: user registration – practical ???? client-side cookies – not fool proof cache busting - increases network traffic Sessionizing

June 13, 2016Data Mining: Concepts and Techniques36 Time oriented By total duration of session not more than 30 minutes By page stay times (good for short sessions) not more than 10 minutes per page Navigation oriented (good for short sessions and when timestamps unreliable) Referrer is previous page in session, or Referrer is undefined but request within 10 secs, or Link from previous to current page in web site Sessionizing

Web Usage Mining Different Types of Traversal Patterns Association Rules – Which pages are accessed together – Support(X) = freq(X) / no of transactions Episodes – Frequent partially order set of pages – Support(X) = freq(X) / no of time windows Sequential Patterns – Frequent ordered set of pages – Support(X) = freq(X) / no of sessions/customers Forward Sequences – Removes backward traversals, reloads, refreshes – E.g.  and – Support(X) = freq(X) / no of forward sequences Maximal Forward Sequences – Support(X) = freq(X) / no of clicks Clustering – User clusters (similar navigational behaviour) – Page clusters (grouping conceptually related pages) June 13, 2016Data Mining: Concepts and Techniques37

Recommender Systems June 13, 2016Data Mining: Concepts and Techniques38

Recommender Systems RS – problem of information filtering RS – problem of machine learning seeks to predict the 'rating' that a user would give to an item she/he had not yet considered. Enhance user experience – Assist users in finding information – Reduce search and navigation time

Types of RS Three broad types: 1.Content based RS 2.Collaborative RS 3.Hybrid RS

Types of RS – Content based RS Content based RS highlights – Recommend items similar to those users preferred in the past – User profiling is the key – Items/content usually denoted by keywords – Matching “user preferences” with “item characteristics” … works for textual information – Vector Space Model widely used

Types of RS – Content based RS Content based RS - Limitations – Not all content is well represented by keywords, e.g. images – Items represented by the same set of features are indistinguishable – Users with thousands of purchases is a problem – New user: No history available

Types of RS – Collaborative RS Collaborative RS highlights – Use other users recommendations (ratings) to judge item’s utility – Key is to find users/user groups whose interests match with the current user – Vector Space model widely used (directions of vectors are user specified ratings) – More users, more ratings: better results – Can account for items dissimilar to the ones seen in the past too – Example: Movielens.orgMovielens.org

Types of Collaborative Filtering User-based collaborative filtering Item-based collaborative filtering

User-based Collaborative Filtering Idea: People who agreed in the past are likely to agree again To predict a user’s opinion for an item, use the opinion of similar users Similarity between users is decided by looking at their overlap in opinions for other items

Example: User-based Collaborative Filtering Item 1Item 2Item 3Item 4Item 5 User 181 ? 27 User 22 ? 575 User User User User

Similarity between users Item 1Item 2Item 3Item 4Item 5 User 181?27 User 22?575 User How similar are users 1 and 2? How similar are users 1 and 5? How do you calculate similarity?

Similarity between users: simple way Item 1Item 2Item 3Item 4Item 5 User 181?27 User 22?575 Only consider items both users have rated For each item: Calculate difference in the users’ ratings Take the average of this difference over the items Average j : Item j rated by User 1 and User 2: | rating (User 1, Item j) – rating (User 2, Item j) |

Algorithm 1: using entire matrix Aggregation function: often weighted sum Weight depends on similarity

Algorithm 2: K-Nearest-Neighbour Aggregation function: often weighted sum Weight depends on similarity Neighbours are people who have historically had the same taste as our user

Item-based Collaborative Filtering Idea: a user is likely to have the same opinion for similar items [same idea as in Content-Based Filtering] Similarity between items is decided by looking at how other users have rated them [different from Content-based, where item features are used] Advantage (compared to user-based CF): – Prevents User Cold-Start problem – Improves scalability (similarity between items is more stable than between users)

Example: Item-based Collaborative Filtering Item 1Item 2Item 3Item 4Item 5 User 181 ? 27 User 22 ? 575 User User User User

Similarity between items Item 3Item 4Item 5 ? How similar are items 3 and 4? How similar are items 3 and 5? How do you calculate similarity?

Similarity between items: simple way Item 3Item 4 ? Only consider users who have rated both items For each user: Calculate difference in ratings for the two items Take the average of this difference over the users Average i : User i has rated Items 3 and 4: | rating (User i, Item 3) – rating (User i, Item 4) |

Algorithms As User-Based: can use nearest-neighbours or all Item Aggregation function: often weighted sum Weight depends on similarity Item 5 Item 4 Item 2 Item 1

Types of RS – Collaborative RS Collaborative RS - Limitations – Different users might use different scales. Possible solution: weighted ratings, i.e. deviations from average rating – Finding similar users/user groups isn’t very easy – New user: No preferences available (user cold start problem) – New item: No ratings available (item cold start problem) – Demographic filtering is required

Some ways to make a Hybrid RS Weighted. Ratings of several recommendation techniques are combined together to produce a single recommendation Switching. The system switches between recommendation techniques depending on the current situation Mixed. Recommendations from several different recommenders are presented simultaneously (e.g. Amazon) Cascade. One recommender refines the recommendations given by another

Model-based collaborative filtering Instead of using ratings directly, develop a model of user ratings Use the model to predict ratings for new items To build the model: – Bayesian network (probabilistic) – Clustering (classification) – Rule-based approaches (e.g., association rules between co-purchased items)