Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work.

Presentation on theme: "Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work."— Presentation transcript:

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work How do we measure relevance of a search result to a query? Search engine evaluation. –Content relevance (TF-IDF). –Link-based metrics. –PageRank. –Hits, hubs and authorities. Search engine evaluation.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.2 Content Relevance - Vector Space Model

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.3 Term Frequency (TF) Count number of occurrences of each term. Bag of words approach. Ignore stopwords such as is, a, of, the, … Stemming - computer is replaced by comput, as are its variants: computers, computing computation,computer and computed. Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag. chess computer programming chess game chess game is a

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.4 Inverse Document Frequency (IDF) N is number of documents in the corpus. ni is number of docs in which word i appears. Log dampens the effect of IDF. IDF is also number of bits to represent the term.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.5 Ranking with TF-IDF i – refers to document i j – refers to word (or term) j in doc i q – is the query which is a sequence of terms scorej - is the score for document j given q Rank results according to the scoring function.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.6 Content Relevance Phrase matching. Synonyms. URL analysis. Date last updated. Spell checking. Home page detection.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.7 Link Text (Anchor Text) Include link text for a link pointing to a web page, say P, as part of the content of P. Link text is very useful in finding home pages. Link text behaves like user queries –They act as short summaries. –They often match query terms.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.8 HTML Weighting Class NameHTML tags 1) Plain TextNone of the above 2) StrongSTRONG, B, EM, I, U 3) ListDL, OL, UL 4) HeaderH1, H2, H3, H4, H5, H6 5) AnchorA 6) TitleTITLE Normal retrieval = (111101) ranking with TF-IDF (181882) – 39.6% improvement. (181782) – 48.3% improvement – C2, C4 and C5. (181582) - 43.5% improvement Meta tag text is mostly ignored by search engines

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.9 Link-Based Metrics A link from A to B can be viewed as a recommendation, a vote or a citation. Links can be –referential, or –informational Links effect the ranking of web pages and thus have commercial value.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.10 Web site to explain PageRank b1 a1 b3 b4 d1 d2 e1 e2 c1 b2

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.11 PageRank - Motivation The number incoming links to a page is a measure of importance and authority of the page. Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing links are important.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.12 The Random Surfer Assume the web is a Markov chain. Surfers randomly click on links, where the probability of an outlink from page A is 1/m, where m is the number of outlinks from A. The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page. Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.13 Dangling Pages Problem: A and B have no outlinks. Solution: Assume A and B have links to all web pages with equal probability.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.14 Rank Sink Problem: Pages in a loop accumulate rank but do not distribute it. Solution: Teleportation, i.e. with a certain probability the surfer can jump to any other web page to get out of the loop.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.15 PageRank (PR)PageRank (PR) - Definition W is a web page Wi are the web pages that have a link to P O(Wi) is the number of outlinks from Pi T is the teleportation probability N is the size of the web

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.16 Example web site

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.17 Iteratively Computing PageRank Replace T/N in the def. of PR(W) by T, so PR will take values between 1 and N. T is normally set to 0.15, but for simplicity lets set it to 0.5 Set initial PR values to 1 Solve the following equations iteratively:

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.18 Example Computation of PR IterationPR(A)PR(B)PR(C) 0111 110.751.125 21.06250.7656251.1484375 31.074218750.768554691.15283203 41.076416020.769104001.15365601 51.076828000.769207001.15381050 ………… 121.076923080.769230771.15384615

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.19 The Largest Matrix Computation in the World Computing PageRank can be done via matrix multiplication, where the matrix has over 8 billion rows and columns. The matrix is sparse as average number of outlinks is between 7 and 8. Setting T = 0.15 or above requires about 100 iterations to convergence. Researchers are still trying to speed-up the computation.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.20 Factor in Link Metrics to Relevance of Page Multilply by PageRank of document (web page). We do not know exactly how Google factors in the PR, it may be that log(PR) is used.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.21 HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.22 Pre-processing for HITS 1)Collect the top t pages (say t = 200) based on the input query; call this the root set. 2)Extend the root set into a base set as follows, for all pages p in the root set: 1) add to the root set all pages that p points to, and 2)add to the root set up-to q pages that point to p (say q = 50). 3)Delete all links within the same web site in the base set resulting in a focused sub-graph.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.23 Expanding the Root Set

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.24 HITS Algorithm – Iterate until Convergence B is the base set q and p are web pages in B A(p) is the authority score for p H(p) is the hub score for p

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.25 Applications of HITS Search engine querying (speed is an issue). Finding web communities. Finding related pages. Populating categories in web directories. Citation analysis.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.26 Communities on the Web A densely linked focused sub-graph of hubs and authorities is called a community. Over 100,000 emerging web communities have been discovered from a web crawl (a process called trawling). Alternatively, a community is a set of web pages W having at least as many links to pages in W as to pages outside W.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.27 Weblogs influence on PageRank A weblog (or blog) is a frequently updated web site on a particular topic, made up of entries in reverse chronological order. Blogs are a rich source of links, and therfore their links influence PageRank. A google bomb is an attempt to influence the ranking of a web page for a given phrase by adding links to the page with the phrase as its anchor text.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.28 Link Spamming to Improve PageRank Spam is the act of trying unfairly to gain a high ranking on a search engine for a web page without improving the user experience. Link farms - join the farm by copying a hub page which links to all members. Selling links from sites with high PageRank.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.29 Popularity Based Metrics Factor in users opinions as represented in the query logs. Document space modification adjusts the weights of keywords in popular pages. Clickthrough data can also be taken into account to improve the ranking of search engine query results.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.30 Evaluating Search Engines Precision – top-n precision most important, say for n = 10 (i.e. a page of query results). Recall – related to search engine coverage. Mean reciprocal rank for Q&A systems. Evaluation can be carried out on test collections, e.g. TREC.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.31 Typical Recall-Precision Curve Top-n precision – proportion of relevant pages from top n ranked results. Measure top-n precision at fixed recall point for n being 0% to 100% of the ranked results.

Similar presentations