Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,

Slides:



Advertisements
Similar presentations
Markov Models.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
TrustRank Algorithm Srđan Luković 2010/3482
Center for E-Business Technology Seoul National University Seoul, Korea Socially Filtered Web Search: An approach using social bookmarking tags to personalize.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Web Markov Skeleton Processes and Applications Zhi-Ming Ma 10 June, 2013, St.Petersburg
Web Markov Skeleton Processes and their Applications Zhi-Ming Ma 18 April, 2011, BNU.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Introduction to Graph  A graph consists of a set of vertices, and a set of edges that link together the vertices.  A graph can be: Directed: Edges are.
Link Analysis, PageRank and Search Engines on the Web
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
The effect of New Links on Google Pagerank By Hui Xie Apr, 07.
Presented By: - Chandrika B N
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Liang Xiang, Quan Yuan, Shiwan Zhao, Li Chen, Xiatian Zhang, Qing Yang and Jimeng Sun Institute of Automation Chinese Academy of Sciences, IBM Research.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
User Browsing Graph: Structure, Evolution and Application Yiqun Liu, Yijiang Jin, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
EigenRank: A Ranking-Oriented Approach to Collaborative Filtering IDS Lab. Seminar Spring 2009 강 민 석강 민 석 May 21 st, 2009 Nathan.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
CompSci 100E 3.1 Random Walks “A drunk man wil l find his way home, but a drunk bird may get lost forever”  – Shizuo Kakutani Suppose you proceed randomly.
Center for E-Business Technology Seoul National University Seoul, Korea Social Ranking: Uncovering Relevant Content Using Tag-based Recommender Systems.
Algorithmic Detection of Semantic Similarity WWW 2005.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
9 Algorithms: PageRank. Ranking After matching, have to rank:
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Post-Ranking query suggestion by diversifying search Chao Wang.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR
Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
To Personalize or Not to Personalize: Modeling Queries with Variation in User Intent Presented by Jaime Teevan, Susan T. Dumais, Daniel J. Liebling Microsoft.
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
Computation on Graphs. Graphs and Sparse Matrices Sparse matrix is a representation of.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,
The PageRank Citation Ranking: Bringing Order to the Web
Search Engines and Link Analysis on the Web
PageRank and Markov Chains
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
9 Algorithms: PageRank.
PageRank algorithm based on Eigenvectors
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Presentation transcript:

Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, Hang Li SIGIR Summarized & presented by Babar Tareen, IDS Lab., Seoul National University

Copyright  2008 by CEBT Introduction  Page importance is a key factor for web search  Currently page importance is measured by using the link graph HITS PageRank  If many important pages link to a page then the page is also likely to be important 2

Copyright  2008 by CEBT PageRank Drawbacks 3  Link graph is not reliable Links can easily be created and deleted on the web Can easily be manipulated by web spammers using link farms  PageRank does not considers the length of time which a web surfer spends on the web page

Copyright  2008 by CEBT BrowseRank  Utilize user browsing graph Generated from user behavior data Behavior data can be recorded by Internet browsers at web clients and collected at web servers Behavior data includes – URL – Time – Method of visiting (URL input or hyperlink click) 4

Copyright  2008 by CEBT BrowseRank (2)  More visits of the page and longer time spent on a page indicates that the page is important  Uses continuous-time Markov process as model on user browsing graph  Markov process is a process in which the likelihood of a given future state, at any given moment, depends only on its present state, and not on any past states 5 PastPresentFuture

Copyright  2008 by CEBT Originality  Propose the use of browsing graph for computing page importance  Propose the use of continuous-time Markov process to model a random walk on the user browsing graph 6

Copyright  2008 by CEBT User Behavior Data  When user surfs on the web Can input the URL Choose to click on a hyperlink  Behavior data can be stored as triples 7

Copyright  2008 by CEBT User Behavior Data (2)  Session Segmentation Time Rule: If time of current record is 30 minutes behind that of previous record, then current record is considered as new session Type Rule: If the type of the record is ‘INPUT’ we will consider it as new session  URL Pair construction Within session, URL’s are placed in adjacent records Indicates that the user transits from the first page to the second page 8

Copyright  2008 by CEBT User Behavior Data (3)  Reset probability estimation For sessions segmented by type rule, the first URL is input by the user Assign reset probabilities to those URL’s  Staying time extraction For each URL pair, use the time difference of second and first page as staying time For last session either use random time [for time rule] or time difference from next session [for type rule] 9

Copyright  2008 by CEBT User Browsing Graph  Vertex: Represent a URL Metadata: Reset Probabilities, Staying Time  Directed Edge: Represents Transition between pages  Edge Weight: Number of transitions

Copyright  2008 by CEBT Model  Continuous-time time-homogeneous Markov Process model  Assumptions Independence of users ad sessions Markov Property Time-homogeneity 11

Copyright  2008 by CEBT Continuous-time Markov Model 12  Xs represents page which the surfer is visiting at time s, s > 0  Continuous-time time-homogenous Markov Process  P ij (t) denotes the transition probability from page i to page j for time interval t  Stationary probability distribution Π unique and independent of t  Computing matrix P is difficult because it is hard to get information for all time intervals  Algorithm is based on

Copyright  2008 by CEBT Algorithm 13

Copyright  2008 by CEBT Experiments  Website-Level BrowseRank Finding important websites and depressing spam sites  Page-Level BrowseRank Improving relevance ranking  Dataset 3 billion records 950 million unique URL’s Website Level Graph – 5.6 million vertices – 53 million edges – 40 million websites 14

Copyright  2008 by CEBT Top-20 Websites 15

Copyright  2008 by CEBT Spam fighting  2714 websites labeled spam by human experts 16

Copyright  2008 by CEBT Page Level Testing 17  Adopted 3 measures to evaluate performance MAP Precission Normalized Discounted Cummulative Gain

Copyright  2008 by CEBT Results (1) 18

Copyright  2008 by CEBT Results (2) 19

Copyright  2008 by CEBT Technical Issues  User behavior data tends to be sparse  User behavior data can lead to reliable importance calculation for the head web pages, but not for the tail web pages  Time homogeneity assumption is mainly for technical convenience  Content information and metadata was not used in BrowseRank 20

Copyright  2008 by CEBT Discussion  Better approach to find page importance  Already highlights technical issues  Spammers can alter BrowseRank by sending fake user behavior data. This will be easy too as behavior data is collected from client. 21