Presentation on theme: "Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns."— Presentation transcript:
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns
The Web as Network Consider the web as a network –vertices: individual (html) pages –edges: hyperlinks between pages –will view as both a directed and undirected graph What is the structure of this network? –connected components –degree distributions –etc. What does it say about the people building and using it? –page and link generation –visitation statistics What are the algorithmic consequences? –web search –community identification
Graph Structure in the Web [Broder et al. paper] Report on the results of two massive “web crawls” Executed by AltaVista in May and October 1999 Details of the crawls: –automated script following hyperlinks (URLs) from pages found –large set of starting points collected over time –crawl implemented as breadth-first search –have to deal with webspam, infinite paths, timeouts, duplicates, etc. May ’99 crawl: –200 million pages, 1.5 billion links Oct ’99 crawl: –271 million pages, 2.1 billion links Unaudited, self-reported Sep ’03 stats:Sep ’03 stats: –3 major search engines claim > 3 billion pages indexed
Five Easy Pieces Authors did two kinds of breadth-first search: –ignoring link direction weak connectivity –only following forward links strong connectivity They then identify five different regions of the web: –strongly connected component (SCC): can reach any page in SCC from any other in directed fashion –component IN: can reach any page in SCC in directed fashion, but not reverse –component OUT: can be reached from any page in SCC, but not reverse –component TENDRILS: weakly connected to all of the above, but cannot reach SCC or be reached from SCC in directed fashion (e.g. pointed to by IN) –SCC+IN+OUT+TENDRILS form weakly connected component (WCC) –everything else is called DISC (disconnected from the above) –here is a visualization of this structurevisualization
Size of the Five SCC: ~56M pages, ~28% IN: ~43M pages, ~ 21% OUT: ~43M pages, ~21% TENDRILS: ~44M pages, ~22% DISC: ~17M pages, ~8% WCC > 91% of the web --- the giant component One interpretation of the pieces: –SCC: the heart of the web –IN: newer sites not yet discovered and linked to –OUT: “insular” pages like corporate web sites
Diameter Measurements Directed worst-case diameter of the SCC: –at least 28 Directed worst-case diameter of IN SCC OUT: –at least 503 Over 75% of the time, there is no directed path between a random start and finish page in the WCC –when there is a directed path, average length is 16 Average undirected distance in the WCC is 7 Moral: –web is a “small world” when we ignore direction –otherwise the picture is more complex
Degree Distributions They are, of course, heavy-tailedheavy-tailed Power law distribution of component size –consistent with the Erdos-Renyi model Undirected connectivity of web not reliant on “connectors” –what happens as we remove high-degree vertices?remove high-degree vertices?
Here is a 2005 update on all this stuff.2005 update
Your consent to our cookies if you continue to use this website.