Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work.

Slides:



Advertisements
Similar presentations
Copyright © 2003 Pearson Education, Inc.
Advertisements

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
OvidSP Flexible. Innovative. Precise. Introducing OvidSP Resources.
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Link Analysis Mark Levene (Follow the links to learn more!)
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 4.1 Chapter 4 : Searching the Web The mechanics.
Solve Multi-step Equations
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
Hash Tables.
Text Categorization.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
© 2012 National Heart Foundation of Australia. Slide 2.
Boolean and Vector Space Retrieval Models
25 seconds left…...
DIKLA GRUTMAN 2014 Databases- presentation and training.
CSE3201/4500 Information Retrieval Systems
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
Chapter 13 Web Page Design Studio
South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.
Application of Ensemble Models in Web Ranking
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Link Analysis: PageRank
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Searching the Web Mark Levene (Follow the links to learn more!)
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Link Analysis, PageRank and Search Engines on the Web
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Chapter 6: Information Retrieval and Web Search
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 CS 430: Information Discovery Lecture 5 Ranking.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
Search Engines and Link Analysis on the Web
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
CS 440 Database Management Systems
Junghoo “John” Cho UCLA
Presentation transcript:

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.1 Chapter 5 : How Does a Search Engine Work How do we measure relevance of a search result to a query? Search engine evaluation. –Content relevance (TF-IDF). –Link-based metrics. –PageRank. –Hits, hubs and authorities. Search engine evaluation.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.2 Content Relevance - Vector Space Model

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.3 Term Frequency (TF) Count number of occurrences of each term. Bag of words approach. Ignore stopwords such as is, a, of, the, … Stemming - computer is replaced by comput, as are its variants: computers, computing computation,computer and computed. Normalise TF by dividing by doc length, byte size of doc or max num of occurrences of a word in the bag. chess computer programming chess game chess game is a

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.4 Inverse Document Frequency (IDF) N is number of documents in the corpus. ni is number of docs in which word i appears. Log dampens the effect of IDF. IDF is also number of bits to represent the term.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.5 Ranking with TF-IDF i – refers to document i j – refers to word (or term) j in doc i q – is the query which is a sequence of terms scorej - is the score for document j given q Rank results according to the scoring function.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.6 Content Relevance Phrase matching. Synonyms. URL analysis. Date last updated. Spell checking. Home page detection.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.7 Link Text (Anchor Text) Include link text for a link pointing to a web page, say P, as part of the content of P. Link text is very useful in finding home pages. Link text behaves like user queries –They act as short summaries. –They often match query terms.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.8 HTML Weighting Class NameHTML tags 1) Plain TextNone of the above 2) StrongSTRONG, B, EM, I, U 3) ListDL, OL, UL 4) HeaderH1, H2, H3, H4, H5, H6 5) AnchorA 6) TitleTITLE Normal retrieval = (111101) ranking with TF-IDF (181882) – 39.6% improvement. (181782) – 48.3% improvement – C2, C4 and C5. (181582) % improvement Meta tag text is mostly ignored by search engines

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.9 Link-Based Metrics A link from A to B can be viewed as a recommendation, a vote or a citation. Links can be –referential, or –informational Links effect the ranking of web pages and thus have commercial value.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.10 Web site to explain PageRank b1 a1 b3 b4 d1 d2 e1 e2 c1 b2

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.11 PageRank - Motivation The number incoming links to a page is a measure of importance and authority of the page. Also take into account the quality of recommendation, so a page is more important if the sources of its incomoing links are important.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.12 The Random Surfer Assume the web is a Markov chain. Surfers randomly click on links, where the probability of an outlink from page A is 1/m, where m is the number of outlinks from A. The surfer occasionally gets bored and is teleported to another web page, say B, where B is equally likely to be any page. Using the theory of Markov chains it can be shown that if the surfer follows links for long enough, the PageRank of a web page is the probability that the surfer will visit that page.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.13 Dangling Pages Problem: A and B have no outlinks. Solution: Assume A and B have links to all web pages with equal probability.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.14 Rank Sink Problem: Pages in a loop accumulate rank but do not distribute it. Solution: Teleportation, i.e. with a certain probability the surfer can jump to any other web page to get out of the loop.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.15 PageRank (PR)PageRank (PR) - Definition W is a web page Wi are the web pages that have a link to P O(Wi) is the number of outlinks from Pi T is the teleportation probability N is the size of the web

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.16 Example web site

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.17 Iteratively Computing PageRank Replace T/N in the def. of PR(W) by T, so PR will take values between 1 and N. T is normally set to 0.15, but for simplicity lets set it to 0.5 Set initial PR values to 1 Solve the following equations iteratively:

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.18 Example Computation of PR IterationPR(A)PR(B)PR(C) …………

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.19 The Largest Matrix Computation in the World Computing PageRank can be done via matrix multiplication, where the matrix has over 8 billion rows and columns. The matrix is sparse as average number of outlinks is between 7 and 8. Setting T = 0.15 or above requires about 100 iterations to convergence. Researchers are still trying to speed-up the computation.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.20 Factor in Link Metrics to Relevance of Page Multilply by PageRank of document (web page). We do not know exactly how Google factors in the PR, it may be that log(PR) is used.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.21 HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.22 Pre-processing for HITS 1)Collect the top t pages (say t = 200) based on the input query; call this the root set. 2)Extend the root set into a base set as follows, for all pages p in the root set: 1) add to the root set all pages that p points to, and 2)add to the root set up-to q pages that point to p (say q = 50). 3)Delete all links within the same web site in the base set resulting in a focused sub-graph.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.23 Expanding the Root Set

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.24 HITS Algorithm – Iterate until Convergence B is the base set q and p are web pages in B A(p) is the authority score for p H(p) is the hub score for p

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.25 Applications of HITS Search engine querying (speed is an issue). Finding web communities. Finding related pages. Populating categories in web directories. Citation analysis.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.26 Communities on the Web A densely linked focused sub-graph of hubs and authorities is called a community. Over 100,000 emerging web communities have been discovered from a web crawl (a process called trawling). Alternatively, a community is a set of web pages W having at least as many links to pages in W as to pages outside W.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.27 Weblogs influence on PageRank A weblog (or blog) is a frequently updated web site on a particular topic, made up of entries in reverse chronological order. Blogs are a rich source of links, and therfore their links influence PageRank. A google bomb is an attempt to influence the ranking of a web page for a given phrase by adding links to the page with the phrase as its anchor text.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.28 Link Spamming to Improve PageRank Spam is the act of trying unfairly to gain a high ranking on a search engine for a web page without improving the user experience. Link farms - join the farm by copying a hub page which links to all members. Selling links from sites with high PageRank.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.29 Popularity Based Metrics Factor in users opinions as represented in the query logs. Document space modification adjusts the weights of keywords in popular pages. Clickthrough data can also be taken into account to improve the ranking of search engine query results.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.30 Evaluating Search Engines Precision – top-n precision most important, say for n = 10 (i.e. a page of query results). Recall – related to search engine coverage. Mean reciprocal rank for Q&A systems. Evaluation can be carried out on test collections, e.g. TREC.

Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 5.31 Typical Recall-Precision Curve Top-n precision – proportion of relevant pages from top n ranked results. Measure top-n precision at fixed recall point for n being 0% to 100% of the ranked results.