Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining.

Slides:



Advertisements
Similar presentations
Link Analysis Mark Levene (Follow the links to learn more!)
Advertisements

Web Usage Mining Web Usage Mining (Clickstream Analysis) Mark Levene (Follow the links to learn more!)
Peer-to-Peer and Social Networks An overview of Gnutella.
Markov Models.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Traffic-driven model of the World-Wide-Web Graph A. Barrat, LPT, Orsay, France M. Barthélemy, CEA, France A. Vespignani, LPT, Orsay, France.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Web Markov Skeleton Processes and their Applications Zhi-Ming Ma 18 April, 2011, BNU.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Social Networks 101 P ROF. J ASON H ARTLINE AND P ROF. N ICOLE I MMORLICA.
Jump to first page Web Facts and Fantasy presented by Andreas Anagnostatos CSE 291 Feb. 29, 2000 Stephen Manley, Network Appliance Margo Seltzer, Harvard.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
Link Analysis, PageRank and Search Engines on the Web
Decoding the Structure of the WWW : A Comparative Analysis of Web Crawls AUTHORS: M.Angeles Serrano Ana Maguitman Marian Boguna Santo Fortunato Alessandro.
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
12/11/01 Matt Bridges Advisor: Ralph Morelli. What is Web Analytics? In traditional commerce, store owners can observe their customers habits: What time.
Web Caching Robert Grimm New York University. Before We Get Started  Illustrating Results  Type Theory 101.
Information Retrieval
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
P.1Service Control Technologies for Peer-to-peer Traffic in Next Generation Networks Part2: An Approach of Passive Peer based Caching to Mitigate P2P Inter-domain.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
CPSC 534L Notes based on the Data Mining book by A. Rajaraman and J. Ullman: Ch. 5.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
2010 © University of Michigan 1 DivRank: Interplay of Prestige and Diversity in Information Networks Qiaozhu Mei 1,2, Jian Guo 3, Dragomir Radev 1,2 1.
Microsoft Research1 Characterizing Alert and Browse Services for Mobile Clients Atul Adya, Victor Bahl, Lili Qiu Microsoft Research USENIX Annual Technical.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Structural Mining of Large-Scale Behavioral Data from the Internet Thesis Defense Mark Meiss April 30, 2010.
INDIANAUNIVERSITYINDIANAUNIVERSITY FlowRank Presentation by ANML July 2004.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
9 Algorithms: PageRank. Ranking After matching, have to rank:
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.
Introduction Web analysis includes the study of users’ behavior on the web Traffic analysis – Usage analysis Behavior at particular website or across.
Understanding Online Social Network Usage from a Network Perspective F. Schneider et al (T-Labs, AT&T) Internet Measurement Conference 2009 Networking.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Dynamic Network Analysis Case study of PageRank-based Rewiring Narjès Bellamine-BenSaoud Galen Wilkerson 2 nd Second Annual French Complex Systems Summer.
Data mining in web applications
The PageRank Citation Ranking: Bringing Order to the Web
Search Engines and Link Analysis on the Web
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
DTMC Applications Ranking Web Pages & Slotted ALOHA
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Building Networks from Networks
9 Algorithms: PageRank.
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Peer-to-Peer Information Systems Week 6: Performance
Peer-to-Peer and Social Networks
Graph and Link Mining.
Presentation transcript:

Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining Stanford, California February 11, 2008

Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Vespignani Alessandro Flammini CNLL

Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns

Sources for Ranking Data: The Link Graph

Sources for Ranking Data: Dynamic Sources Network flow data Web server logs Toolbars and plugins

Sources for Ranking Data: Web Server Logs

Sources for Ranking Data: Toolbars and Plugins

ISP ~100 K users Sources for Ranking Data: Packet Inspection

Data Collection HostPathRefererUser-AgentTimestamp HTTP (80) peak anonymizer GET requests from IU only FULLh/p/r/a/t HUMANh/p/r/a/t {

Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns

Structural properties: Degree

Caveat: Sampling Bias

Structural properties: Strength (Site Traffic)

Structural properties: Weights (Link Traffic)

Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns

Behavioral patterns (HUMAN) (Proportion of total out-strength)

Ratios are stable Requests (x 10 6 )

Ratios are stable

In-degree ~ PageRank Page traffic Googlearchy: search engines amplify rich- get-richer bias of the Web Surfing without search engines: popularity reflects rich-get- richer bias of the Web Data: search mitigates rich- get-richer bias of the Web PNAS 2006

Does search mitigate the rich-get-richer dynamics?

Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns

Validation of PageRank PR is a stationary distribution of visit frequency by a modified random walk (with jumps) on the Web graph Compare with actual site traffic (in-strength) From an application perspective, we care about the resulting ranking of sites rather than the actual values

Kendall’s Rank Correlation

PageRank Assumptions 1. Equal probability of teleporting to each of the nodes 2. Equal probability of teleporting from each of the nodes 3. Equal probability of following each link from any given node

Kendall’s Rank Correlation

Local Link Heterogeneity perfect concentration perfect homogeneity HH Index of concentration or disparity

Teleportation Target Heterogeneity

Teleportation Source Heterogeneity (“hubness”) s out < s in teleport sources browsing sinks -2 s out > s in popular hubs

Navigation vs. Jumps: Sources of Popularity

Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns

How predictable are traffic patterns? -- Cache refreshing (e.g. proxies) -- Capacity allocation (e.g. peering and provisioning for spikes) -- Site design (e.g. expose content based on time of day)

Predict future host graph (clicks) from current one, as a function of delay Generalized temporal precision and recall: Temporal patterns

HUMAN host graph (FULL is about 10% more predictable)

Summary Heterogeneity: incoming and outgoing site traffic, link traffic Less than half of traffic is from following links Only 5% of traffic is directly from search engines High temporal regularity PageRank is a poor predictor of traffic: random walk and random teleportation assumptions violated

Next Sampling bias and search bias From host graph to page graph Modeling traffic: Beyond random walk?

THANKS! Mark Meiss Filippo Menczer Santo Fortunato Alessandro Vespignani Alessandro Flammini CNLL?