1 Web Basics Slides adapted from –Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan –CS345A, Winter.

Slides:

Advertisements

Similar presentations

Scale Free Networks.

Advertisements

CS276 Information Retrieval and Web Search

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Analysis and Modeling of Social Networks Foudalis Ilias.

Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.

Web Basics Slides adapted from

Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Search Engines and Information Retrieval

WEB GRAPHS (Chap 3 of Baldi) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2005/10/6.

1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.

CS 728 Lecture 4 It’s a Small World on the Web. Small World Networks It is a ‘small world’ after all –Billions of people on Earth, yet every pair separated.

Web as Graph – Empirical Studies The Structure and Dynamics of Networks.

The PageRank Citation Ranking “Bringing Order to the Web”

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

Near Duplicate Detection

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.

Link Structure and Web Mining Shuying Wang

CS 345 Data Mining Lecture 1 Introduction to Web Mining.

Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.

1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.

Overview of Search Engines

An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.

Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine WEB GRAPHS.

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Large-scale organization of metabolic networks Jeong et al. CS 466 Saurabh Sinha.

(Social) Networks Analysis III Prof. Dr. Daning Hu Department of Informatics University of Zurich Oct 16th, 2012.

CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.

Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:

Crawling Slides adapted from

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Social Networking Algorithms related sections to read in Networked Life: 2.1,

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 

1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Information Retrieval (9) Prof. Dragomir R. Radev

1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

WEB SEARCH BASICS By K.KARTHIKEYAN. Web search basics The Web Ad indexes Web spider Indexer Indexes Search User Sec

Information Retrieval in Practice

Web Basics Slides adapted from

Near Duplicate Detection

Uniform Sampling from the Web via Random Walks

Lecture 22 SVD, Eigenvector, and Web Search

CS246 Web Characteristics.

Data Mining Chapter 6 Search Engines

Graph and Link Mining.

Lecture 22 SVD, Eigenvector, and Web Search

Lecture 22 SVD, Eigenvector, and Web Search

Presentation transcript:

1 Web Basics Slides adapted from –Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan –CS345A, Winter 2009: Data Mining. Stanford University, Anand Rajaraman, Jeffrey D. Ullman

2 Web search Due to the large size of the Web, it is not easy to find the needle in the hay. Solutions –Classification –Early search engines –Modern search engines –Semantic web – …

3 Early solutions to web search Classification of web pages –Yahoo –Mostly done by humans. Difficult to scale. Paid search ranking: GOTO –Your search ranking depended on how much you paid –Auction for keywords: casino was expensive! Ranking page by its relevance to the query –Early keyword-based engines ca –Altavista, Excite, Infoseek, Inktomi, Lycos –Decide how queries match pages, mostly based on vector space model –Most queries match large amount of pages –which page is more authoritative?

4 Ranking of web pages by popularity Originated from graph theory and social network analysis Jon Kleinberg at IBM developed HITS (Hypertext Induced Topic Search) in 1998 Larry Page and Sergey Brin developed PageRank algorithm in 1998 –Blew away all early engines save Inktomi –Great user experience in search of a business model

5 Web search overall picture The Web Ad indexes Web spider Indexer Indexes Search User Sec links queries

6 Key components in web search Links and graph: The web is a hyperlinked document collection, a graph. Queries: Web queries are different, more varied and there are a lot of them. How many? –10 8 every day, approaching 10 9 Users: Users are different, more varied and there are a lot of them. How many? –10 9 Documents: Documents are different, more varied and there are a lot of them. How many? – Indexed: Context: Context is more important on the web than in many other IR applications. Ads and spam CrawlUserRank CrawlUserGraphSpam

7 Web as graph Web Graph –Node: web page –Edge: hyperlink RankCrawlUserGraphSpam

8 Why web graph Example of a large, dynamic and distributed graph Possibly similar to other complex graphs in social, biological and other systems Reflects how humans organize information (relevance, ranking) and their societies Efficient navigation algorithms Study behavior of users as they traverse the web graph (e- commerce) RankCrawlUserGraphSpam

9 In-degree and out-degree In-degree: number of in-coming edges of a node Out-degree: number of outgoing edges of a node E.g., –Node 8 has 3 in-degrees, 0 out- degree –Node 2 has 2 in-degrees, and 4 out-degrees Degree distribution RankCrawlUserGraphSpam

10 Degree distribution Degree distribution is the fraction of the nodes that have degree i, i.e. Degree of Web graph obeys power law distribution Study at Notre Dame University reported –  = 2.45 for out-degree distribution –  = 2.1 for in-degree distribution Random graphs have Poisson distribution RankCrawlUserGraphSpam

11 Power law plotted 500 random numbers are generated, following power law with xmin=1, alpah=2 Subplots C and D are produced using equal bin size (bin size=5) To remove the noise in the tail of subplot (D), we need to use log bin size Subplot (F) shows a straight line as desired. You can download the matlab program to experience with power law RankCrawlUserGraphSpam

12 Power law of web graph in 1999 Note that the in/out distributions are slightly different Out-degree may be better fitted by Mandelbrot law What about current web? –clueWeb data consist of 4 billion web pages. RankCrawlUserGraphSpam

13 Scale-free networks A network is scale free if the degree distribution follows power law –Mathematical model behind: Preferential attachment Many networks obey power law –Internet at the router and inter domain level –Citation network/co-author network –Collaboration network of actors –Networks associated with metabolic pathways –Networks formed by interacting genes and proteins –Web graph –microblogs such as twitter –Semantic web RankCrawlUserGraphSpam

14 Other graph properties –Distance from A to B: the length of the shortest path connecting A to B –Distance from node 0 to node 9: 1 –Length: the average of the distances between all the pairs of nodes –Diameter: the maximum of the distances –Strongly connected: for any pair of nodes, there is a path connecting them –Clustering coefficient –Betweeness RankCrawlUserGraphSpam

15 Small world It is a ‘small world’ –Millions of people. Yet, separated by “six degrees” of acquaintance relationships –Popularized by Milgram’s famous experiment (1967) Mathematically –Diameter of graph is small (log N) as compared to overall size –For a fixed average degree –The diameter of a complete graph never grows (always 1) –This property also holds in random graphs RankCrawlUserGraphSpam

16 Bow tie structure of Web Study of 200 million nodes & 1.5 billion links –SCC: Strongly connected component (SCC) in the center –Up Stream: Lots of pages that link to other pages, but don’t get linked to (IN) –Down stream: Lots of pages that get linked to, but don’t link (OUT) –Tendrils, tubes, islands Small-world property not applicable to entire web –Some parts unreachable –Others have long paths Power-law connectivity holds though –Page in-degree (alpha = 2.1), out- degree (alpha = 2.72) RankCrawlUserGraphSpam

17 Empirical numbers for bow-tie Maximal diameter – 28 for SCC, 500 for entire graph Probability of a path between any 2 nodes –~1 quarter (0.24) Average length –16 (directed path exists), 7 (undirected) Shortest directed path between 2 nodes in SCC: links on average RankCrawlUserGraphSpam

18 Component properties Each component is roughly same size –~50 million nodes Tendrils not connected to SCC – But reachable from IN and can reach OUT Tubes: directed paths IN->Tendrils->OUT Disconnected components –Diameter/length is infinite RankCrawlUserGraphSpam

19 Where we are in web graph Distribution of incoming and outgoing connections Power law, scale free network Small world, diameter and length of the graph Web site and distribution of pages per site Size of the graph RankCrawlUserGraphSpam

20 Web site Simple estimates suggest over billions nodes Distribution of site sizes measured by the number of pages follow a power law distribution –Note that degree distribution also follows power law Observed over several orders of magnitude with an exponent  in the range RankCrawlUserGraphSpam

21 Web Size The web keeps growing. But growth is no longer exponential? Who cares? –Media, and consequently the user –Engine design –Engine crawl policy. Impact on recall. What is size? –Number of web servers/web sites? –Number of pages? –Terabytes of data available? –Size of search engine index? RankCrawlUserGraphSpam

22 Difficulties in defining the web size Some servers are seldom connected. –Example: Your laptop running a web server –Is it part of the web? The “dynamic” web is infinite. –Soft 404: is a valid page –Dynamic content, e.g., –Whether forecast –calendar –Any sum of two numbers is its own dynamic page on Google. Example: “2+4” Deep web content –E.g., all the articles in nytimes. Duplicates –Static web contains syntactic duplication, mostly due to mirroring (~30%) Sec RankCrawlUserGraphSpam

23 Two sizes (web and search engine index) The (relative) sizes of search engines –The notion of a page being indexed is still reasonably well defined. –Already there are problems –Document extension: e.g. engines index pages not yet crawled, by indexing anchor text. –Document restriction: All engines restrict what is indexed (first n words, only relevant words, etc.) Sec RankCrawlUserGraphSpam Anchor text Bottom of a doc

24 “Search engine index contains N pages”: Issues Can I claim a page is in the index if I only index the first 4000 bytes? –Usually long documents are not fully indexed. Bottom parts are ignored. Can I claim a page is in the index if I only index anchor text pointing to the page? –E.g., Apple web site may not contain the key word ‘computer’, but many anchor text pointing to Apple contains ‘computer’. –Hence when people search for ‘computer’, Apple page may be returned There used to be (and still are?) billions of pages that are only indexed by anchor text. RankCrawlUserGraphSpam

25 Size of search engine The statically indexable web is whatever search engines index. Large index is not everything –Different engines have different preferences – max url depth, max count/host, anti-spam rules, priority rules, etc. –Different engines index different things under the same URL: –Frames (e.g., some frames are navigational, should be indexed in a different way) –meta-keywords, e.g., put more weight on the title –document restrictions, document extensions,... Sec RankCrawlUserGraphSpam

Estimate index size by queries Basic idea: send two random queries and count the number of returns, and the duplicates The size can be estimated by MLE (Maximum likelihood Estimator) ni is the matches of the query i, where i=1,2. d is the duplicate between the two matches. It is called the capture-recapture method, inspired from ecology. This model can be extended to multiple queries (multiple capture- recapture) It is unbiased if the data is homogeneous, i.e., every document has equal probability of being matched (and returned), 26

27 Biases induced by random query Query Bias: Large documents have higher probability being captured by queries –Solution 1: produce uniform sample by some sampling methods –e.g., rejection sampling method, reject large documents with some probability –Solution 2: modify the estimator Ranking Bias: Search engine ranks the matched documents and returns only top-k documents. –Try to use queries whose size is commensurate to k. Operational Problems –Time-outs, failures, engine inconsistencies, index modification. Sec RankCrawlUserGraphSpam

28 Random IP addresses Generate random IP addresses Find a web server at the given address –If there’s one Collect all pages from server –From this, choose a page at random Sec RankCrawlUserGraphSpam

29 Random IP addresses Ignored: empty or authorization required or excluded [Lawr99] Estimated from observing 2500 servers –2.8 million IP addresses running crawlable web servers –16 million total servers –800 million pages –Also estimated use of metadata descriptors: –Meta tags (keywords, description) in 34% of home pages, Dublin core metadata in 0.3% OCLC using IP sampling found 8.7 M hosts in 2001 Netcraft [Netc02] accessed 37.2 million hosts in July 2002 Sec RankCrawlUserGraphSpam

Question: estimate social network size Some microblog account number is a (random) number E.g., Thus we can obtain a random sample and estimate the –Size, average degree, etc. 30

31 Advantages & disadvantages Advantages –Clean statistics –Independent of crawling strategies Disadvantages –Doesn’t deal with duplication –Many hosts might share one IP, or not accept requests –No guarantee all pages are linked to root page. –Eg: employee pages –Power law for # pages/hosts generates bias towards sites with few pages. –But bias can be accurately quantified IF underlying distribution understood –Potentially influenced by spamming (multiple IP’s for same server to avoid IP block) Sec RankCrawlUserGraphSpam

32 Random walks View the Web as a directed graph Build a random walk on this graph –Start from one or more seed page –Follow the links randomly –There are several strategies to select the link –Better to follow the less ‘important’ link with higher probability –Includes various “jump” rules back to visited sites –Mimic the behavior of a web surfer –Avoid being stuck in spider traps –Converges to a stationary distribution (ref pageRank and Markov chain) –Time to convergence may be long Sample from stationary distribution of walk Sec RankCrawlUserGraphSpam

33 Advantages & disadvantages Advantages –“Statistically clean” method at least in theory –Could work even for infinite web (assuming convergence) under certain metrics. Disadvantages –The web may (is) not connected –Isolated components can not sampled if seeds are not in those components –List of seeds is a problem. –Each page does not has the probability being sampled. –Subject to link spamming Sec RankCrawlUserGraphSpam

34 The Web document collection Architecture –No design/co-ordination –Distributed content creation, linking, democratization of publishing Content –includes truth, lies, obsolete information, contradictions.. Structure –Unstructured (text, html, …), semi-structured (XML, annotated photos), structured (Databases)… Scale –much larger than previous text collections … but corporate records are catching up Growth –slowed down from initial “volume doubling every few months” but still expanding Semantics –Mostly no semantic descriptions Dynamic –Content can be dynamically generated The Web

35 Documents Dynamically generated content (deep web) –Dynamic pages are generated from scratch when the user requests them – usually from underlying data in a database. –Example: current status of flight LH 454 –Most (truly) dynamic content is ignored by web spiders. –It’s too much to index it all. –Actually, a lot of “static” content is also assembled on the fly (asp, php etc.: headers, date, ads etc)

36 Web search overall picture The Web Ad indexes Web spider Indexer Indexes Search User Sec links queries

37 Users Use short queries (average < 3) Rarely use operators Don’t want to spend a lot of time on composing a query Only look at the first couple of results Want a simple UI, not a search engine start page overloaded with graphics Extreme variability in terms of user needs, user expectations, experience, knowledge,... –Industrial/developing world, English/Estonian, old/young, rich/poor, differences in culture and class One interface for hugely divergent needs RankCrawl GraphUser Spam

38 User’s evaluation on search engines Classic IR relevance (as measured by F, or precision and recall) can also be used for web IR. Equally important: Trust, duplicate elimination, readability, loads fast, no pop-ups On the web, precision is more important than recall. –Precision at 1, precision at 10, precision on the first 2-3 pages –But there is a subset of queries where recall matters. RankCrawl GraphUser Spam

39 Users’ empirical evaluation of engines Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for polysemic queries Pre/Post process tools provided –Mitigate user errors (auto spell check, search assist,…) –Explicit: Search within results, more like this, refine... –Anticipative: related searches Deal with idiosyncrasies –Web specific vocabulary –Impact on stemming, spell-check, etc –Web addresses typed in the search box “The first, the last, the best and the worst …” RankCrawl GraphUser Spam

40 Queries Queries have a power law distribution –Power law again ! Same here: a few very frequent queries, a large number of very rare queries Examples of rare queries: search for names, towns, books etc RankCrawl GraphUser Spam

41 Types of queries Informational user needs: I need information on something. (~40% / 65%) –“web service”, “information retrieval” Navigational user needs: I want to go to this web site. (~25% / 15%) –“hotmail”, “myspace”, “United Airlines” Transactional user needs: I want to make a transaction. (~35% / 20%) –Buy something: “MacBook Air” –Download something: “Acrobat Reader” –Chat with someone: “live soccer chat” Gray areas –Find a good hub –Exploratory search “see what’s there” Difficult problem: How can the search engine tell what the user need or intent for a particular query is? RankCrawl GraphUser Spam

42 How far do people look for results? (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)iprospect.com RankCrawl GraphUser Spam

43 Web search overall picture The Web Ad indexes Web spider Indexer Indexes Search User Sec links queries