Presentation on theme: "Web Basics Slides adapted from"— Presentation transcript:
1Web Basics Slides adapted from Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar RaghavanCS345A, Winter 2009: Data Mining. Stanford University, Anand Rajaraman, Jeffrey D. Ullman
2Web searchDue to the large size of the Web, it is not easy to find the needle in the hay.SolutionsClassificationEarly search enginesModern search engines…
3Early solutions to web search Classification of web pagesYahooMostly done by humans. Difficult to scale.Early keyword-based engines caAltavista, Excite, Infoseek, Inktomi, LycosDecide how queries match pagesMost queries match large amount of pageswhich page is more authoritative?Paid search ranking: Goto.com (aka overture.com, acquired by yahoo)Your search ranking depended on how much you paidAuction for keywords: casino was expensive!
4Ranking of web pages 1998+: Link-based ranking pioneered by Google Blew away all early engines save InktomiGreat user experience in search of a business modelMeanwhile Goto/Overture’s annual revenues were nearing $1 billion
6Key components in web search GraphUserUserCrawlCrawlRankRankSpamKey components in web searchLinks and graph: The web is a hyperlinked document collection, a graph.Queries: Web queries are different, more varied and there are a lot of them. How many?108 every day, approaching 109Users: Users are different, more varied and there are a lot of them. How many?109Documents: Documents are different, more varied and there are a lot of them. How many?1011. Indexed: 1010Context: Context is more important on the web than in many other IR applications.Ads and spam
7Web as graph Web Graph Node: web page Edge: hyperlink Rank Crawl User SpamWeb as graphWeb GraphNode: web pageEdge: hyperlink
8Why web graph Example of a large, dynamic and distributed graph RankCrawlUserGraphSpamWhy web graphExample of a large, dynamic and distributed graphPossibly similar to other complex graphs in social, biological and other systemsReflects how humans organize information (relevance, ranking) and their societiesEfficient navigation algorithmsStudy behavior of users as they traverse the web graph (e- commerce)
9In-degree and out-degree RankCrawlUserGraphSpamIn-degree and out-degreeIn-degree: number of in-coming edges of a nodeOut-degree: number of out-going edges of a nodeE.g.,Node 8 has 3 in-degrees, 0 out-degreeNode 2 has 2 in-degrees, and 4 out-degreesDegree distribution
10GraphUserCrawlRankSpamDegree distributionDegree distribution is the fraction of the nodes that have degree i, i.e.Degree of Web graph obeys power law distributionStudy at Notre Dame University reporteda = 2.45 for out-degree distributiona = 2.1 for in-degree distributionRandom graphs have Poisson distribution
11Graph example, matlab (or Octave) ;;;;;];indegree=sum(G)outdegree=sum(G')bin=0:4;h=hist(indegree,bin);subplot(1,2,1);bar(bin,h);title('indegree');h=hist(outdegree,bin);subplot(1,2,2);title('outdegree');
12GraphUserCrawlRankSpamPower law plotted500 random numbers are generated, following power law with xmin=1, alpah=2Subplots C and D are produced using equal bin size (bin size=5)To remove the noise in the tail of subplot (D), we need to use log bin sizeSubplot (F) shows a straight line as desired.Try the matlab program to experience with the power law
13Generate random numbers Generate uniform random numbersrand(n,1)Generate power law random numbers using transformation methodn=500;alpha=2;xmin=1;%generate n random numbers following power lawrawData = xmin*(1-rand(n,1)).^(-1/(alpha-1));
14Plot the power law datasubplot(3,2,1); scatter(1:n, rawData); title('(A) Scatter plot of 500 random data'); subplot(3,2,2); scatter(1:n, rawData, rawData.^(0.5),rawData); title('(B) Crowded dots are plotted in smaller size'); b=5; bins=1:b:n; h=hist(rawData, bins); subplot(3,2,3); plot(h, 'o'); xlabel('value'); ylabel('frequency'); title('(C) Histogram of equal bin size');
15Loglog plot subplot(3,2,4); Loglog(bins, h, 'o'); xlabel('value'); ylabel('frequency');binslog(1)=1;for j=1:7b2(j)=2^jbinslog(j+1)=binslog(j)+b2(j);end;subplot(3,2,5);h=hist(rawData, binslog);plot(binslog, h, 'o');title('(E)Histogram of log bin size');subplot(3,2,6);h=hist(rawData, binslog);plot(log10(binslog), log10(h), 'o');xlabel('value');ylabel('frequency');title('(F) log-log plot of (E)');
16Power law of web graph in 1999 Note that the in/out distributions are slightly differentOut-degree may be better fitted by Mandelbrot lawWhat about the current web?clueWeb data consist of 4 billion web pages.
17GraphUserCrawlRankSpamScale-free networksA network is scale free if the degree distribution follows power lawMathematical model behind: Preferential attachmentMany networks obey power lawInternet at the router and inter domain levelCitation network/co-author networkCollaboration network of actorsNetworks formed by interacting genes and proteins… …Web graphOnline social networkSemantic web
18Other graph properties UserCrawlRankSpamOther graph propertiesDistance from A to B: the length of the shortest path connecting A to BDistance from node 0 to node 9: 1Length: the average of the distances between all the pairs of nodesDiameter: the maximum of the distancesStrongly connected: for any pair of nodes, there is a path connecting them
19Small world It is a ‘small world’ Mathematically RankCrawlUserGraphSpamSmall worldIt is a ‘small world’Millions of people. Yet, separated by “six degrees” of acquaintance relationshipsPopularized by Milgram’s famous experiment (1967)MathematicallyDiameter of graph is small as compared to overall size NLength is proportional to ln (N)For a fixed average degreeThe diameter of a complete graph never grows (always 1)This property also holds in random graphs
20Bow tie structure of Web GraphUserCrawlRankSpamBow tie structure of WebStudy of 200 million nodes & 1.5 billion linksSCC: Strongly connected component (SCC) in the center.Up Stream: Lots of pages that link to other pages, but don’t get linked to (IN)Down stream: Lots of pages that get linked to, but don’t link (OUT)Tendrils, tubes, islandsSmall-world property not applicable to the entire webSome parts unreachableOthers have long pathsPower-law connectivity holds thoughPage in-degree (alpha = 2.1),out-degree (alpha = 2.72)
21Empirical numbers for bow-tie RankCrawlUserGraphSpamEmpirical numbers for bow-tieMaximal diameter28 for SCC, 500 for entire graphProbability of a path between any 2 nodes~1 quarter (0.24)Average length16 (directed path exists), 7 (undirected)Shortest directed path between 2 nodes in SCC: links on average
22Component properties Each component is roughly same size RankCrawlUserGraphSpamComponent propertiesEach component is roughly same size~50 million nodesTendrils not connected to SCCBut reachable from IN and can reach OUTTubes: directed paths IN->Tendrils- >OUTDisconnected componentsMaximal and average diameter is infinite
23Statistics of web graph RankCrawlUserGraphSpamStatistics of web graphDistribution of incoming and outgoing connectionsDiameter of the graph: Average and maximal length of the shortest path between any two verticesWeb site and distribution of pages per siteConsider in project: Concetps/classes distribution per file/site in semantic web?Size of the web graphConsider in project: What is the size of the semantic web?
25Web site size Simple estimates suggest over billions nodes RankCrawlUserGraphSpamWeb site sizeSimple estimates suggest over billions nodesDistribution of site sizes measured by the number of pages follow a power law distributionNote that degree distribution also follows power lawObserved over several orders of magnitude with an exponent a in the range
26Web Size The web keeps growing. But growth is no longer exponential? RankCrawlUserGraphSpamWeb SizeThe web keeps growing.But growth is no longer exponential?Who cares?Media, and consequently the userEngine designEngine crawl policy. Impact on recall.What is size?Number of web servers/web sites?Number of pages?Terabytes of data available?Size of search engine index?
27Difficulties in defining the web size RankCrawlUserGraphSpamSec. 19.5Difficulties in defining the web sizeSome servers are seldom connected.Example: Your laptop running a web serverIs it part of the web?The “dynamic” web is infinite.Soft 404: is a valid pageDynamic content, e.g.,Whether forecastcalendarAny sum of two numbers is its own dynamic page on Google. Example: “2+4”Deep web contentE.g., all the articles in nytimes.DuplicatesStatic web contains syntactic duplication, mostly due to mirroring (~30%)
28What can we attempt to measure? RankCrawlUserGraphSpamSec. 19.5What can we attempt to measure?The relative sizes of search enginesThe notion of a page being indexed is still reasonably well defined.Already there are problemsDocument extension: e.g. engines index pages not yet crawled, by indexing anchor text.Document restriction: All engines restrict what is indexed (first n words, only relevant words, etc.)Anchor textBottom of a doc
29“Search engine index contains N pages”: Issues RankCrawlUserGraphSpam“Search engine index contains N pages”: IssuesCan I claim a page is in the index if I only index the first bytes?Usually long documents are not fully indexed. Bottom parts are ignored.Can I claim a page is in the index if I only index anchor text pointing to the page?E.g., Apple web site may not contain the key word ‘computer’, but many anchor text pointing to Apple contains ‘computer’.Hence when people search for ‘computer’, Apple page may be returnedThere used to be (and still are?) billions of pages that are only indexed by anchor text.
30RankCrawlUserGraphSpamSec. 19.5Indexable webThe statically indexable web is whatever search engines index.Different engines have different preferencesmax url depth, max count/host, anti-spam rules, priority rules, etc.Different engines index different things under the same URL:Frames (e.g., some frames are navigational, should be indexed in a different way)meta-keywords, e.g., put more weight on the titledocument restrictions, document extensions, ...
31Relative Size from overlap of engines A and B RankCrawlUserGraphSpamSec. 19.5Relative Size from overlap of engines A and BA Ç BSample URLs randomly from ACheck if contained in B and vice versaA Ç B = (1/2) * Size AA Ç B = (1/6) * Size B(1/2)*Size A = (1/6)*Size B\ Size A / Size B =(1/6)/(1/2) = 1/3Each test involves: (i) Sampling (ii) Checking
32RankCrawlUserGraphSpamSec. 19.5Sampling URLsIdeal strategy: Generate a random URL and check for containment in each index.Problem: Random URLs are hard to find!Enough to generate a random URL contained in a given Engine.Approach 1: Generate a random URL contained in a given engineSuffices for the estimation of relative sizeApproach 2: Random walks / IP addressesIn theory: might give us a true estimate of the size of the web (as opposed to just relative sizes of indexes)
33Random URLs from random queries RankCrawlUserGraphSpamSec. 19.5Random URLs from random queriesGenerate random query: how?Lexicon: 400,000+ words from a web crawlConjunctive Queries: w1 and w2e.g., vocalists AND rsiGet 100 result URLs from engine AChoose a random URL as the candidate to check for presence in engine BDownload D. Get list of words.Use 8 low frequency words as AND query to BCheck if D is present in result set.Not an Englishdictionary
34Biases induced by random query RankCrawlUserGraphSpamSec. 19.5Biases induced by random queryQuery Bias: Large documents have higher probability being captured by queriesSolution: reject some large documents using, e.g., rejection sampling methodRanking Bias: Search engine ranks the matched documents and returns only top-k documents.Solution: Use conjunctive queries & fetch allAnother solution: modify the estimatorChecking Bias: Duplicates, impoverished pages omittedDocument or query restriction bias:engine might not deal properly with 8 words conjunctive queryMalicious Bias:Sabotage by engineOperational Problems:Time-outs, failures, engine inconsistencies, index modification.
35Random IP addresses Generate random IP addresses RankCrawlUserGraphSpamSec. 19.5Random IP addressesGenerate random IP addressesFind a web server at the given addressIf there’s oneCollect all pages from serverFrom this, choose a page at random
36RankCrawlUserGraphSpamSec. 19.5Random IP addressesIgnored: empty or authorization required or excluded[Lawr99] Estimated from observing 2500 servers2.8 million IP addresses running crawlable web servers16 million total servers800 million pagesAlso estimated use of metadata descriptors:Meta tags (keywords, description) in 34% of home pages, Dublin core metadata in 0.3%OCLC using IP sampling found 8.7 M hosts in 2001Netcraft [Netc02] accessed 37.2 million hosts in July 2002
37Advantages & disadvantages RankCrawlUserGraphSpamSec. 19.5Advantages & disadvantagesAdvantagesClean statisticsIndependent of crawling strategiesDisadvantagesDoesn’t deal with duplicationMany hosts might share one IP, or not accept requestsNo guarantee all pages are linked to root page.Eg: employee pagesPower law for # pages/hosts generates bias towards sites with few pages.But bias can be accurately quantified IF underlying distribution understoodPotentially influenced by spamming (multiple IP’s for same server to avoid IP block)
38Random walks View the Web as a directed graph RankCrawlUserGraphSpamSec. 19.5Random walksView the Web as a directed graphBuild a random walk on this graphIncludes various “jump” rules back to visited sitesDoes not get stuck in spider traps!Can follow all links!Converges to a stationary distributionMust assume graph is finite and independent of the walk.Conditions are not satisfied (cookie crumbs, flooding)Time to convergence not really known (may be too long)Sample from stationary distribution of walk
39Advantages & disadvantages RankCrawlUserGraphSpamSec. 19.5Advantages & disadvantagesAdvantages“Statistically clean” method at least in theory!Could work even for infinite web (assuming convergence) under certain metrics.DisadvantagesList of seeds is a problem.Practical approximation might not be valid.Non-uniform distributionSubject to link spamming
40Conclusions No sampling solution is perfect. Lots of new ideas ... RankCrawlUserGraphSpamSec. 19.5ConclusionsNo sampling solution is perfect.Lots of new ideas .......but the problem is getting harderQuantitative studies are fascinating and a good research problem
41Another estimation method RankCrawlUserGraphSpamAnother estimation methodOR-query of frequent words in a number of languagesAccording to such query:Size of web > 21,450,000,000 on> 25,350,000,000 onBut page counts of google search results are only rough estimates.
42The Web document collection No design/co-ordinationDistributed content creation, linking, democratization of publishingContent includes truth, lies, obsolete information, contradictions …Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)…Scale much larger than previous text collections … but corporate records are catching upGrowth – slowed down from initial “volume doubling every few months” but still expandingContent can be dynamically generatedSee the next slideThe Web
43Documents Dynamically generated content (deep web) Dynamic pages are generated from scratch when the user requests them – usually from underlying data in a database.Example: current status of flight LH 454Most (truly) dynamic content is ignored by web spiders.It’s too much to index it all.Actually, a lot of “static” content is also assembled on the fly (asp, php etc.: headers, date, ads etc)
46Queries Queries have a power law distribution GraphUserCrawlRankSpamQueriesQueries have a power law distributionPower law again !a few very frequent queries, a large number of very rare queriesExamples of rare queries: search for names, towns, books etc
47GraphUserCrawlRankSpamTypes of queriesInformational user needs: I need information on something. (~40% / 65%)“web service”, “information retrieval”Navigational user needs: I want to go to this web site. (~25% / 15%)“hotmail”, “myspace”, “United Airlines”Transactional user needs: I want to make a transaction. (~35% / 20%)Buy something: “MacBook Air”Download something: “Acrobat Reader”Chat with someone: “live soccer chat”Gray areasFind a good hubExploratory search “see what’s there”Difficult problem: How can the search engine tell what the user need or intent for a particular query is?
48How far do people look for results? GraphUserCrawlRankSpamHow far do people look for results?40% users look at first page only(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)
49User’s evaluation on result GraphUserCrawlRankSpamUser’s evaluation on resultClassic IR relevance (as measured by F, or precision and recall) can also be used for web IR.Precision: fraction of retrieved instances that are relevant,Recall: fraction of relevant instances that are retrievedrelevant items are to the left of the straight linethe retrieved items are within the oval.The red regions represent errors. On the left these are the relevant items not retrieved (false negatives), while on the right they are the retrieved items that are not relevant (false positives).Precision and recall are the quotient of the left green region by respectively the oval (horizontal arrow) and the left region (diagonal arrow).
50Users’ empirical evaluation of results (cont.) GraphUserCrawlRankSpamUsers’ empirical evaluation of results (cont.)On the web, precision is more important than recall.Precision is relative to the top k resultsPrecision at page 1 or page 10? Precision for the first 20 results?Comprehensiveness – must be able to deal with obscure queriesRecall matters when the number of matches is very smallQuality of pages varies widelyRelevance is not enoughOther desirable qualities (non IR!!)Content: Trustworthy, objective, diverse, non-duplicated, well maintained, coverage of topics for polysemic queriesWeb readability: display correctly & fastNo annoyances: pop-ups, etcUser perceptions may be unscientific, but are significant over a large aggregate
51Users’ empirical evaluation of engines GraphUserCrawlRankSpamUsers’ empirical evaluation of enginesRelevance and validity of results (discussed)UI – Simple, no clutter, error tolerantPre/Post process tools providedMitigate user errors (auto spell check, search assist,…)Explicit: Search within results, more like this, refine ...Anticipative: related searchesDeal with idiosyncrasiesWeb specific vocabularyImpact on stemming, spell-check, etcWeb addresses typed in the search box