User Browsing Graph: Structure, Evolution and Application Yiqun Liu, Yijiang Jin, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology.

Slides:



Advertisements
Similar presentations
Predicting User Interests from Contextual Information
Advertisements

Accurately Interpreting Clickthrough Data as Implicit Feedback Joachims, Granka, Pan, Hembrooke, Gay Paper Presentation: Vinay Goel 10/27/05.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TrustRank Algorithm Srđan Luković 2010/3482
Natural Language Processing WEB SEARCH ENGINES August, 2002.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data Misha Bilenko and Ryen White presented by Matt Richardson.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Web Markov Skeleton Processes and Applications Zhi-Ming Ma 10 June, 2013, St.Petersburg
Web Markov Skeleton Processes and their Applications Zhi-Ming Ma 18 April, 2011, BNU.
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Click Evidence Signals and Tasks Vishwa Vinay Microsoft Research, Cambridge.
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
The PageRank Citation Ranking “Bringing Order to the Web”
Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
LinkSelector: Select Hyperlinks for Web Portals Prof. Olivia Sheng Xiao Fang School of Accounting and Information Systems University of Utah.
The Web is perhaps the single largest data source in the world. Due to the heterogeneity and lack of structure, mining and integration are challenging.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Relevance Propagation for Web Search Dr. Tie-Yan Liu Web Search and Mining Group Microsoft Research Asia Joint Work with Tao Qin, Tsinghua University.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Information Retrieval
Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Modern Retrieval Evaluations Hongning Wang
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Gradual Adaption Model for Estimation of User Information Access Behavior J. Chen, R.Y. Shtykh and Q. Jin Graduate School of Human Sciences, Waseda University,
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Using Hyperlink structure information for web search.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Hao Wu Nov Outline Introduction Related Work Experiment Methods Results Conclusions & Next Steps.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Lecture 2 Jan 15, 2008 Social Search. What is Social Search? Social Information Access –a stream of research that explores methods for organizing users’
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
Adish Singla, Microsoft Bing Ryen W. White, Microsoft Research Jeff Huang, University of Washington.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Modern Retrieval Evaluations Hongning Wang
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Natural Language Processing Lab National Taiwan University The splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin et al.
CSE326: Data Structures World Wide What? Hannah Tang and Brian Tjaden Summer Quarter 2002.
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR
Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Data mining in web applications
22C:145 Artificial Intelligence
Evaluation Anisio Lacerda.
A Comparative Study of Link Analysis Algorithms
The Recommendation Click Graph: Properties and Applications
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Presentation transcript:

User Browsing Graph: Structure, Evolution and Application Yiqun Liu, Yijiang Jin, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Technology and Systems Tsinghua University, Beijing, China 2009/02/10

Search Engine vs. Users How many pages can search engine provide –1 trillion pages in the index ( official Google blog 2008/07 ) How many pages can user consume? –235 M searches per day for Google ( comScore 2008/07 ) –7 billion searches per month –Even if all searches are unique (NOT possible!) –Tens of billions of pages can meet all user requests –For the foreseeable future, what people can consume is millions, not billions pages ( Mei et al, WSDM 2008 ) Page quality estimation is important for all search engines

Web Page Quality Estimation Previous Research –Hyperlink analysis algorithms PageRank, Topic-sensitive Pagerank, TrustRank … –Two assumptions proposed by Craswell et al 2001 ABAB Recommendation Topic locality

Web Page Quality Estimation Web graph may be mis-leading

Web Page Quality Estimation Improve with the help of user behavior analysis –Implicit feedback information from Web users –Objective and reliable, without interrupting users –Information source: Web access log Record of user’s Web browsing history Mining the search trails of surfing crowds: identifying relevant websites from user activity. (Bilenko et al, WWW 2008) BrowseRank: letting web users vote for page importance. (Liu et al, SIGIR 2008)

Web Page Quality Estimation Construct user browsing graph with Web access log –Hyperlink graph filtering –User accessed part is more reliable

Web access log Data preparation –With the help of a commercial search engine in China using browser toolbar software –Collected from Aug.3rd, 2008 to Oct 6th, 2008 –Over 2.8 billion click-through events NameDescription Session IDA random assigned ID for each user session Source URLURL of the page which the user is visiting Destination URLURL of the page which the user navigates to Time StampDate/Time of the click event

Construction of User Browsing Graph Construction Process For each record in the Web access log, if the source URL is A and the destination URL is B, then

Structure of User Browsing Graph User Browsing Graph UG(V,E) –Constructed with Web access log collected by a search engine from Aug.3 rd to Sept. 2 nd –Vertex set: 4,252,495 Web sites –Edge set: 10,564,205 edges –Much smaller than whole hyperlink graph –Possible to perform PageRank/TrustRank within a few hours (very efficient!)

Structure of User Browsing Graph Comparison: Hyperlink Graph HG(V,E) –Same vertex set as UG(V,E) –Edge set: extracted from a hyperlink graph composed of over 3 billion Web pages

Structure of User Browsing Graph 10.5M edges 139M edges 24.53% 1.86% Links not clicked by users Search engine result page links Links in protected sessions Links which are not crawled 2.6M edges Part of the user browsing graph is user accessed part of hyperlink graph User browsing graph contains some other important information Hyperlink Graph User Browsing Graph

Evolution of User Browsing Graph Why should we look into the evolution over time? –Whether information collected from the first N days can cover most of user requests on (N+1) th day Time Browsing info on the 1 st day New info on the 2 nd day New info on the 3 rd day New info on the N th day User Browsing Graph constructed with information from the first N days User request on (N+1) th day Pages without previous browsing information

Evolution of User Browsing Graph How many percentage of vertexes are newly- appeared on each day? Most of these pages are low quality and few users visit them (>80% of them are visited only once per day)

Evolution of User Browsing Graph Evolution of the graph –It takes tens of days to construct a stable graph –After that, small part of the graph changes each day and newly-appeared pages are mostly not important ones. –User browsing graph constructed with data collected from the first N days can be adopted for the (N+1) th day

Page Quality Estimation Experiment settings –Performance of page quality estimation –How does traditional algorithms (PageRank / TrustRank) perform on user browsing graph? –Is it possible to use user browsing graph to replace hyperlink graph?

Page Quality Estimation Graph construction –How PageRank/TrustRank perform on these graphs GraphDescription User Graph UG(V,E) Constructed with web access data from Aug.3 rd, 2008 to Sept.2 nd, Hyperlink Graph extracted-HG(V,E) Vertexes are from UG(V,E). Edges among them are extracted from hyperlink relations in whole-HG(V,E). Combined Graph CG(V,E) Vertexes are from UG(V,E). Edges among them are from UG(V,E) combined with those from extracted-HG(V,E). Hyperlink Graph whole-HG(V,E) Constructed with over 3 billion pages (all pages in a certain search engine ’ s index) and all hyperlinks among them Same Vertex set (User accessed part) Each represents a kind of User Browsing Graph

Page Quality Estimation Performance Evaluation –Metrics: ROC/AUC, pair wise orderedness accuracy –Test set: Page TypeAmountPercentage High Quality % Low Quality % N/A pages579.05% Spam223.49% NON-GB2312 Pages % Illegel Pages % Total630

Experimental Results High quality page identification Spam/illegal page identification Graph PageRankTrustRank UG(V,E) extracted-HG(V,E) CG(V,E) whole-HG(V,E) Graph PageRankTrustRank UG(V,E) extracted-HG(V,E) CG(V,E) whole-HG(V,E) User browsing graph TrustRank performs better Change in edge set doesn’t affect much Combination of edge set sometimes helps

Experimental Results Pair wise orderedness accuracy test –Firstly proposed by Gyöngyi et al –700 pairs of Web sites: [A, B],Q(A)>Q(B) –Annotated by product managers from a survey company –Performance of PageRank algorithm on these graphs Graph Pairwise Orderedness Accuracy UG(V,E) extracted-HG(V,E) CG(V,E) whole-HG(V,E)

Conclusions Important Findings –User browsing graph can be regarded as user-accessed part of Web, but it also contains information usually not collected by search engines. –The size of user browsing graph is significantly smaller than whole hyperlink graph –User browsing graph constructed with logs collected from first N days can be adopted for the (N+1) th day –Traditional link analysis algorithms perform better on user browsing graph than on hyperlink graph

Future works How will query-dependent link analysis algorithms (e.g. HITS) perform on the user browsing graph? What happens if we extract anchor text information from the user browsing graph and adopt this into retrieval? …

Thank you!

Evolution of User Browsing Graph Why should we look into the evolution over time? –It takes time to … Construct a user browsing graph Calculate page importance scores –During this time period, New pages may appear People may visit new pages These pages are not included in the browsing graph

Structure of User Browsing Graph Sites with most out-degrees in HG(V,E) RankURL Out-degree HG(V,E)UG(V,E) 1cang.baidu.com cache.baidu.com zhidao.baidu.com www.mapbar.com blog.sina.com.cn sq.qq.com shuqian.qq.com shuqian.soso.com tieba.baidu.com map.sogou.com

Structure of User Browsing Graph Sites with most out-degrees in UG(V,E) RankURL Out-degree HG(V,E)UG(V,E) imgcache.qq.com zhidao.baidu.com blog.163.com image.baidu.com

Structure of User Browsing Graph Search engine oriented edges Search EngineNumber of Edges in UG(V,E) Baidu1,518,109 Google1,169,647 Sogou291,829 Soso147,034 Yahoo143,860 Gougou47,099 Yodao24,171 Total3,341,749 (41.92%)