Download presentation
Presentation is loading. Please wait.
1
Sampling a web subgraph Paraskevas V. Lekeas Proceedings of the 5 th Algorithms, Scientific Computing, Modeling and Simulation (ASCOMS), Web conference, New York, USA, Sept. 15-17, 2003.
2
Web Sampling In order to study the web we have to crawl it We can’t exhaustively crawl the whole web because i) it is very big ii) it grows exponentially We rather use sampling techniques to collect representative samples (pages) of the web and then study these pages 2 main methods of web sampling i) “stochastic sampling” (random walks) ii) “deterministic sampling” (IP sampling)
3
Stochastic Sampling A Stochastic sampler starts from a node of the web graph ( pages-nodes, links-edges ), picks ( with some probability ) a link in that node, follows it and visits another node etc. The sampler stops when it reaches equilibrium distribution ( if the transition matrix of the process is P and the sampler is at state π, then equilibrium distribution is a state which π=πP ) and outputs the sample ( all the visited nodes ) Problems are i) We need connectivity ( links ) between nodes ii) We don’t know how to choose a node uniformly at random to start the stochastic sampler iii) We don’t know how long does it take to reach equilibrium distribution iv) There is statistical dependency among the nodes that the sampler visits ( no clean statistics )
4
Deterministic Sampling A deterministic sampler does not sample the web graph but the IPv4 ( Internet Protocol version 4 ) adress space The sampler collects IPs from the IPv4 space ( pre-sample ) and converts them into their web representation ( final-sample ) Problems are i) difficulties in accessing many hosts when converting the IP addresses into web nodes ii) multihosting ( one IP may belong to various web nodes but the resolution mechanism shows only one node ) iii) scalability problems ( the new internet IPv6 )
5
Sampling a web Subgraph 1/4 Say we want to study a web subgraph ( say a country code Top Level Domain.gr,.uk etc. ) We can’t use a stochastic sampler since if we start it from a node inside the domain the sampler is not going to stay there ( also if we force the sampler to stay inside we ruin the stochasticity of the process) We can’t also use as it is a deterministic sampler since IPv4 is a huge pool of IPs and our subgraph contains only a small part of them In this work we built a modified deterministic sampler that solves the above problem
6
Sampling a web Subgraph 2/4 random number generator IP addresses of web subgraph pre-sample (IP addresses) Resolve r final-sample (web nodes) The sampler gets as input the IP addresses of the subgraph ( population ). The IPs of the subgraph are collected from Regional Internet Registries ( such as RIPE )
7
Sampling a web Subgraph 3/4 random number generator IP addresses of web subgraph pre-sample (IP addresses) Resolve r final-sample (web nodes) The sampler uses sampling theory to compute the size of the sample, produces the appropriate amount of random numbers and draw a pre-sample of IP addresses
8
Sampling a web Subgraph 4/4 random number generator IP addresses of web subgraph pre-sample (IP addresses) Resolve r final-sample (web nodes) The sampler resolves the pre-sample and outputs the final sample that contains web nodes ( pages )
9
Testing the Sampler (test 1) Define a variable Then is the total number of web nodes in N An estimator of the percentage of web nodes in N is The size n of the sample we need to draw in order to estimate p with error of magnitude B is ( q=1-p ) We want to predict the % of web nodes in a domain (.gr ) and say that inside this domain there exist N IPs. Some of them are web nodes while some other are not From above we estimate that in late 2002 which agrees with RIPE statistics for the same period
10
Testing the Sampler (test 2) The out degree distribution of the sample obeys a power law which is an intrinsic property of the web graph Out degrees, InTree links chopped Fit: 11,2456-0.0085x (x,y)=(Log degree, Log rank) The roughly linear plot is skewed in y=4 and this is due to a porn site with hundreds of repetitions of the same links
11
Uses of the sampler The above sampler ii) can be used as input to stochastic samplers which need to start from random sets of web nodes iii) can be used as a crawler if we force it not to draw samples, but to exhaustively visit all the IP addresses that we give to it i) can help us collect information about web communities or validate laws in internet domains
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.