Thanks to R. Mooney, S. White, W. Arms C. Manning, P. Raghavan, H. Schutze Search Engines How Google Works: Connectivity and Link Analysis.

Slides:

Advertisements

Similar presentations

Lecture 18: Link analysis

Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005

Search Engines and Information Retrieval

1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.

1 CS 430: Information Discovery Lecture 21 Web Search 3.

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.

Advances & Link Analysis

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Link Structure and Web Mining Shuying Wang

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.

Information Retrieval

Link Analysis HITS Algorithm PageRank Algorithm.

1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Web Search Slides based on those of C. Lee Giles, who credits R. Mooney, S. White, W. Arms C. Manning, P. Raghavan, H. Schutze.

1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.

HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.

Search Engines and Information Retrieval Chapter 1.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Overview of Web Ranking Algorithms: HITS and PageRank

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 

COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.

- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.

1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.

1 CS 430: Information Discovery Lecture 5 Ranking.

Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.

1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.

1 CS 430: Information Discovery Lecture 20 Web Search Engines.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)

1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Methods and Apparatus for Ranking Web Page Search Results

Link analysis and Page Rank Algorithm

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Thanks to Ray Mooney & Scott White

Anatomy of a search engine

Information retrieval and PageRank

Data Mining Chapter 6 Search Engines

Web Search Engines.

The Search Engine Architecture

Presentation transcript:

Thanks to R. Mooney, S. White, W. Arms C. Manning, P. Raghavan, H. Schutze Search Engines How Google Works: Connectivity and Link Analysis

What we have covered What is IR Evaluation Tokenization and properties of text Web crawling Query models Vector methods Measures of similarity Indexing Inverted files Web characteristics –Graph, links –Spam, SEO, duplication Google and Link Analysis

What we will cover Search engines –Classifications –Business models Ranking by importance –Link analysis (graph search) Citations PageRank Hubs/authorities Google –Role in search –Architecture

Google Facts

Google search

Web Search Goal Provide information discovery for large amounts of open access material on the web Challenges Volume of material -- several billion items, growing steadily Items created dynamically or in databases Great variety -- length, formats, quality control, purpose, etc. Inexperience of users -- range of needs Economic models to pay for the service

Search Engine Strategies Types of search engines Broad Classes: Subject hierarchies Yahoo!, dmoz -- use of human indexingYahoo!dmoz Web crawling + automatic indexing General -- Google, Ask, Exalead, BingGoogleAsk,Exalead,Bing Mixed models Graphs - KartOO; clusters – Clusty (now Yippy)KartOOClusty Metasearch New ones evolving

Many Aspects of a Web Search Service Computational Components Web crawler Indexing system Search system Social Considerations Economics Scalability Legal issues

Interface Query Engine Indexer Index Crawler Users Web A Typical Web Search Engine Ranking by web graph

Business Models Subscription Monthly fee with logon provides unlimited access (introduced by InfoSeek) Advertising Access is free, with display advertisements (introduced by Lycos) Can lead to distortion of results to suit advertisers Focused advertising - Google, Overture Licensing Cost of company are covered by fees, licensing of software and specialized services Others?

Business models for advertisers For a related query “tennis shoes” Your company –Comes up first in SERP –Comes up on the first SERP page Otherwise, say goodbye to web business Completely dependent on search engines ranking algorithm for organic SEO Google changes ranking –Will other search engines follow?

Generations of search engines 0th - Library catalog –Based on human created metadata 1st - Altavista –First large comprehensive database –Word based index and ranking 2nd - Google –High relevance –Link (connectivity) based importance

Types of Search Queries - Broder A taxonomy of web searches In the web context the "need behind the query" is often not informational in nature. Navigational. 25% –The immediate intent is to reach a particular site. Informational. 40% –The intent is to acquire some information assumed to be present on one or more web pages. Transactional. 35% –The intent is to perform some web-mediated activity.

Motivation for Link Analysis Early search engines mainly compare content similarity of the query and the indexed pages. I.e., –They use information retrieval methods, cosine, TF-IDF,... From mid 90’s, it became clear that content similarity alone was no longer sufficient. –The number of pages grew rapidly in the mid-late 1990’s. Try “classification methods”, Google estimates: millions of relevant pages.classification methods How to choose only pages and rank them suitably to present to the user? –Content similarity is easily spammed. A page owner can repeat some words and add many related words to boost the rankings of his pages and/or to make the pages relevant to a large number of queries.

Early hyperlinks Starting mid 90’s, researchers began to work on the problem, resorting to hyperlinks. –In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a hyperlink based search patent. The method uses words in anchor text of hyperlinks. Web pages on the other hand are connected through hyperlinks, which carry important information. –Manually made – social network –Some hyperlinks: organize information at the same site. –Other hyperlinks: point to pages from other Web sites. Such out-going hyperlinks often indicate an implicit conveyance of authority to the pages being pointed to. Those pages that are pointed to by many other pages are likely to contain authoritative information.

Hyperlink algorithms During , the two most influential hyperlink based search algorithms PageRank and HITS were reported. Both algorithms are related to social networks. They exploit the hyperlinks of the Web to rank pages according to their levels of “prestige” or “authority”. –HITS: Jon Kleinberg (Cornel University), at Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, January 1998 –PageRank: Sergey Brin and Larry Page, PhD students from Stanford University, at Seventh International World Wide Web Conference (WWW7) in April, PageRank powers the Google search engine. Impact of “Stanford University” in web search –Google: Sergey Brin and Larry Page (PhD candidates in CS) –Yahoo!: Jerry Yang and David Filo (PhD candidates in EE) –HP, Sun, Cisco, …

Other uses Apart from search ranking, hyperlinks are also useful for finding Web communities. –A Web community is a cluster of densely linked pages representing a group of people with a special interest. Beyond explicit hyperlinks on the Web, links in other contexts are useful too, e.g., –for discovering communities of named entities (e.g., people and organizations) in free text documents, and –for analyzing social phenomena in s..

The Web as a Directed Graph Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) Assumption 2: The anchor of the hyperlink describes the target page (textual context) Page A hyperlink Page B Anchor

Anchor Text Indexing Extract anchor text (between and ) of each link followed. Anchor text is usually descriptive of the document to which it points. Add anchor text to the content of the destination page to provide additional relevant keyword indices. Used by Google: – Evil Empire – IBM Anchor text

Anchor Text WWW Worm - McBryan [Mcbr94] WWW Worm For ibm how to distinguish between: –IBM’s home page (mostly graphical) –IBM’s copyright page (high term freq. for ‘ibm’) –Rival’s spam page (arbitrarily high term freq.) “ibm” “ibm.com” “IBM home page” A million pieces of anchor text with “ibm” send a strong signal

Indexing anchor text When indexing a document D, include anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today Joe’s computer hardware links Compaq HP IBM Big Blue today announced record profits for the quarter

Indexing anchor text Can sometimes have unexpected side effects - e.g., french military victories Helps when descriptive text in destination page is embedded in image logos rather than in accessible text. Many times anchor text is not useful: –“click here” Increases content more for popular pages with many incoming links, increasing recall of these pages. May even give higher weights to tokens from anchor text.

Anchor Text Other applications –Weighting/filtering links in the graph HITS [Chak98], Hilltop [Bhar01] –Generating page descriptions from anchor text [Amit98, Amit00]

Focus on Google Why Google works How it works –Link analysis/connectivity

Google Indexing the Web Goals: Precision Short queries applied to very large numbers of items leads to large numbers of hits. Goal is that the first hits presented should satisfy the user's information need -- requires ranking hits in order that fits user's requirements Recall is not an important criterion Completeness of index is not an important factor. Comprehensive crawling is unnecessary

Concept of Relevance Document measures Relevance, as conventionally defined, is binary (relevant or not relevant). It is usually estimated by the similarity between the terms in the query and each document. Importance measures documents by their likelihood of being useful to a variety of users. It is usually estimated by some measure of popularity. Web search engines rank documents by combination of relevance and importance. The goal is to present the user with the most important of the relevant documents.

Ranking Options 1.Paid advertisers 2.Manually created classification 3.Vector space ranking with corrections for document length 4.Extra weighting for specific fields, e.g., title, anchors, etc. 5.Popularity or importance, e.g., PageRank Many of these factors are NOT made public.

History of link analysis Bibliometrics –Citation analysis since the 1960’s –Citation links to and from documents Basis of pagerank idea

Bibliometrics Techniques that use citation analysis to measure the similarity of journal articles or their importance Bibliographic coupling: two papers that cite many of the same papers Cocitation: two papers that were cited by many of the same papers Impact factor (of a journal): frequency with which the average article in a journal has been cited in a particular year or period Citation frequency

Citation Analysis Citation frequency Co-citation coupling frequency –Cocitations with a given author measures “impact” –Cocitation analysis [Mcca90] Bibliographic coupling frequency –Articles that co-cite the same articles are related Citation indexing –Who is author cited by? (Garfield [Garf72] ) Pagerank preview: Pinsker and Narin ’60s

Citation Graph Paper cites is cited by Note that academic citations nearly always refer to earlier work. Bibliographic coupling cocitation

Graphical Analysis of Hyperlinks on the Web This page links to many other pages (hub) Many pages link to this page (authority)

Search Engines What is connectivity/linking? Role of connectivity in ranking –Academic paper analysis –Hits - IBM –Ask –Facebook –Google –Google Scholar –CiteSeer (CiteSeer X ), Chem X Seer, ArchSeer, etcCiteSeer X –…

HTML Structure & Feature Weighting Weight tokens under particular HTML tags more heavily: – tokens (Google seems to like title matches) –, … tokens – keyword tokens Parse page into conceptual sections (e.g. navigation links vs. page content) and weight tokens differently based on section.

Link Analysis What is link analysis? For academic documents –CiteSeer is an example of such a search engine –Others Google Scholar Rexa Scirus ACM portal …

Bibliometrics: Citation Analysis Many standard documents include bibliographies (or references), explicit citations to other previously published documents. Using citations as links, standard corpora can be viewed as a graph. The structure of this graph, independent of content, can provide interesting information about the similarity of documents and the structure of information. Impact of paper!

Impact Factor Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals. Measure of how often papers in the journal are cited by other scientists. Computed and published annually by the Institute for Scientific Information (ISI). The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y  1 or Y  2. Does not account for the quality of the citing article.

Bibliographic Coupling Measure of similarity of documents introduced by Kessler in The bibliographic coupling of two documents A and B is the number of documents cited by both A and B. Size of the intersection of their bibliographies. Maybe want to normalize by size of bibliographies? AB

Co-Citation An alternate citation-based measure of similarity introduced by Small in Number of documents that cite both A and B. Maybe want to normalize by total number of documents citing either A or B ? AB

Citations vs. Web Links Web links are a bit different than citations: –Many links are navigational. –Many pages with high in-degree are portals not content providers. –Not all links are endorsements. –Company websites don’t point to their competitors. –Citations to relevant literature is enforced by peer-review.

Ranking: query (in)dependence Query independent ranking –Important pages; no need for queries –Trusted pages? –PageRank can do this Query dependent ranking –Combine importance with query evaluation –Hits is query based. –So is GoogleRank (combine PageRank with some similarity measure)

PageRank vs Hits Web site classification –Hubs –Authorities Hubs & Authorities:  Hits Authorities:  PageRank

Authorities Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. In-degree (number of pointers to a page) is one simple measure of authority. However in-degree treats all links as equal. Should links from pages that are themselves authoritative count more?

Hubs Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). Ex: pages are included in the course home page

HITS Algorithm developed by Kleinberg in IBM search engine project Attempts to computationally determine hubs and authorities on a particular topic through analysis of a relevant subgraph of the web. Based on mutually recursive facts: –Hubs point to lots of authorities. –Authorities are pointed to by lots of hubs.

Hubs and Authorities Together they tend to form a bipartite graph: HubsAuthorities

HITS Algorithm Computes hubs and authorities for a particular topic specified by a normal query. –Thus query dependent ranking First determines a set of relevant pages for the query called the base set S. Analyze the link structure of the web subgraph defined by S to find authority and hub pages in this set.

Constructing a Base Subgraph For a specific query Q, let the set of documents returned by a standard search engine be called the root set R. Initialize S to R. Add to S all pages pointed to by any page in R. Add to S all pages that point to any page in R. R S

Base Limitations To limit computational expense: –Limit number of root pages to the top 200 pages retrieved for the query. –Limit number of “back-pointer” pages to a random set of at most 50 pages returned by a “reverse link” query. To eliminate purely navigational links: –Eliminate links between two pages on the same host. To eliminate “non-authority-conveying” links: –Allow only m (m  4  8) pages from a given host as pointers to any individual page.

Authorities and In-Degree Even within the base set S for a given query, the nodes with highest in-degree are not necessarily authorities (may just be generally popular pages like Yahoo or Amazon). True authority pages are pointed to by a number of hubs (i.e. pages that point to lots of authorities).

Iterative Algorithm Use an iterative algorithm to slowly converge on a mutually reinforcing set of hubs and authorities. Maintain for each page p  S: –Authority score: a p (vector a) –Hub score: h p (vector h) Initialize all a p = h p = 1 Maintain normalized scores:

Convergence Algorithm converges to a fix-point if iterated indefinitely. Define A to be the adjacency matrix for the subgraph defined by S. –A ij = 1 for i  S, j  S iff i  j Authority vector, a, converges to the principal eigenvector of A T A Hub vector, h, converges to the principal eigenvector of AA T In practice, 20 iterations produces fairly stable results.

HITS Results An ambiguous query can result in the principal eigenvector only covering one of the possible meanings. Non-principal eigenvectors may contain hubs & authorities for other meanings. Example: “jaguar”: –Atari video game (principal eigenvector) –NFL Football team (2 nd non-princ. eigenvector) –Automobile (3 rd non-princ. eigenvector) Reportedly first used by Teoma.comTeoma.com

Hyperlink-Induced Topic Search (HITS) In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: –Hub pages are good lists of links on a subject. e.g., “Bob’s list of cancer-related links.” –Authority pages occur recurrently on good hubs for the subject. Best suited for “broad topic” queries rather than for page-finding queries. Gets at a broader slice of common opinion.

Google Background “Our main goal is to improve the quality of web search engines” Google  googol = 10^100googol Originally part of the Stanford digital library project known as WebBase, commercialized in 1999

Initial Design Goals Deliver results that have very high precision even at the expense of recall Make search engine technology transparent, i.e. advertising shouldn’t bias results Bring search engine technology into academic realm in order to support novel research activities on large web data sets Make system easy to use for most people, e.g. users shouldn’t have to specify more than a couple words

Google Search Engine Features Two main features to increase result precision: Uses link structure of web (PageRank) Uses text surrounding hyperlinks to improve accurate document retrieval Other features include: Takes into account word proximity in documents Uses font size, word position, etc. to weight word Storage of full raw html pages

PageRank in Words Intuition: Imagine a web surfer doing a simple random walk on the entire web for an infinite number of steps. Occasionally, the surfer will get bored and instead of following a link pointing outward from the current page will jump to another random page. At some point, the percentage of time spent at each page will converge to a fixed value. This value is known as the PageRank of the page.

PageRank Link-analysis method used by Google (Brin & Page, 1998). Does not attempt to capture the distinction between hubs and authorities. Ranks pages just by authority. Query independent Applied to the entire web rather than a local neighborhood of pages surrounding the results of a query.

Initial PageRank Idea Just measuring in-degree (citation count) doesn’t account for the authority of the source of a link. Initial page rank equation for page p: –N q is the total number of out-links from page q. –A page, q, “gives” an equal fraction of its authority to all the pages it points to (e.g. p). –c is a normalizing constant set so that the rank of all pages always sums to 1.

Initial PageRank Idea (cont.) Can view it as a process of PageRank “flowing” from pages to the pages they cite

Initial Algorithm Iterate rank-flowing process until convergence: Let S be the total set of pages. Initialize  p  S: R(p) = 1/|S| Until ranks do not change (much) (convergence) For each p  S: For each p  S: R(p) = cR´(p) (normalize)

Sample Stable Fixed Point

Linear Algebra Version Treat R as a vector over web pages. Let A be a 2-d matrix over pages where –A vu = 1/N u if u  v else A vu = 0 Then R=cAR R converges to the principal eigenvector of A.

Problem with Initial Idea A group of pages that only point to themselves but are pointed to by other pages act as a “rank sink” and absorb all the rank in the system. Rank flows into cycle and can’t get out

Rank Source Introduce a “rank source” E that continually replenishes the rank of each page, p, by a fixed amount E(p).

PageRank Algorithm Let S be the total set of pages. Let  p  S: E(p) =  /|S| (for some 0<  <1, e.g ) Initialize  p  S: R(p) = 1/|S| Until ranks do not change (much) ( convergence ) For each p  S: For each p  S: R(p) = cR´(p) (normalize)

Linear Algebra Version R = c(AR + E) Since ||R|| 1 =1 : R = c(A + E  1)R –Where 1 is the vector consisting of all 1’s. So R is an eigenvector of (A + E x 1)

Random Surfer Model PageRank can be seen as modeling a “random surfer” that starts on a random page and then at each point: –With probability E(p) randomly jumps to page p. – Otherwise, randomly follows a link on the current page. R(p) models the probability that this random surfer will be on page p at any given time. “E jumps” are needed to prevent the random surfer from getting “trapped” in web sinks with no outgoing links.

Justifications for using PageRank Attempts to model user behavior Captures the notion that the more a page is pointed to by “important” pages, the more it is worth looking at Takes into account global structure of web

Speed of Convergence Early experiments on Google used 322 million links. PageRank algorithm converged (within small tolerance) in about 52 iterations. Number of iterations required for convergence is empirically O(log n) (where n is the number of links). Therefore calculation is quite efficient.

Google Ranking Complete Google ranking includes (based on university publications prior to commercialization). –Vector-space similarity component. –Keyword proximity component. –HTML-tag weight component (e.g. title preference). –PageRank component. Details of current commercial ranking functions are trade secrets. –PageRank becomes GoogleRank!

Personalized PageRank PageRank can be biased (personalized) by changing E to a non-uniform distribution. Restrict “random jumps” to a set of specified relevant pages. For example, let E(p) = 0 except for one’s own home page, for which E(p) =  This results in a bias towards pages that are closer in the web graph to your own homepage.

Google PageRank-Biased Crawling Use PageRank to direct (focus) a crawler on “important” pages. Compute page-rank using the current set of crawled pages. Order the crawler’s search queue based on current estimated PageRank.

Information Retrieval Using PageRank Remember no queries are used with PageRank PageRank precomputed (static rank) Query independent ranking How are queries put into the final ranking? Simple Method Combine term similarity (dynamic rank) with PageRank One method: Final rank = R(p) * Similarity Lucene index: add to boasting term

Combining Term Weighting with PageRank – another approach Combined Method 1. Find all documents that share a term with the query vector. 2. The similarity, using conventional term weighting, between the query and document j is s j. 3. The rank of document j using PageRank or other graph ranking is p j. 4. Calculate a combined rank c j = s j + (1- )p j, where is a constant. 5. Display the hits ranked by c j. This method is used in several commercial systems, but the details have not been published.

Why we need a fast link-based rank? “…The link structure of the Web is significantly more dynamic than the contents on the Web. Every week, about 25% new links are created. After a year, about 80% of the links on the Web are replaced with new ones. This result indicates that search engines need to update link-based ranking metrics very often…” [ Cho et al., 04 ] Cho et al., 04

Accelerating PageRank Web Graph Compression to fit in internal memory [Boldi et al., 04] Efficient External memory implementation [Haveliwala, 99; Chen et al., 02] Mathematical approaches Combination of the above strategies

Accelerating PageRank Adaptive Power method: C = set of pages converged, N = set of pages not yet converged Run PM on detecting converged components. In the paper, many other adapting strategies!! Slow-converging pages have high PageRank [ Kamvar et al., 03 ] SpeedUp: 22% time reduction, Precision: DataSet: nodes

Link Analysis Conclusions Link analysis uses information about the structure of the web graph to aid search. It is one of the major innovations in web search. It is the primary reason for Google’s success. Still lots of research regarding improvements

Limits of Link Analysis Stability –Adding even a small number of nodes/edges to the graph has a significant impact Topic drift –A top authority may be a hub of pages on a different topic resulting in increased rank of the authority page Content evolution –Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks

Google Architecture - preliminary Implemented in Perl, C and C++ on Solaris and Linux

Preliminary “Hitlist” is defined as list of occurrences of a particular word in a particular document including additional meta info: - position of word in doc - font size - capitalization - descriptor type, e.g. title, anchor, etc.

Google Architecture (cont.) Keeps track of URLs that have and need to be crawled Compresses and stores web pages Multiple crawlers run in parallel. Each crawler keeps its own DNS lookup cache and ~300 open connections open at once. Uncompresses and parses documents. Stores link information in anchors file. Stores each link and text surrounding link. Converts relative URLs into absolute URLs. Contains full html of every web page. Each document is prefixed by docID, length, and URL.

Google Architecture (cont.) Maps absolute URLs into docIDs stored in Doc Index. Stores anchor text in “barrels”. Generates database of links (pairs of docIds). Parses & distributes hit lists into “barrels.” Creates inverted index whereby document list containing docID and hitlists can be retrieved given wordID. In-memory hash table that maps words to wordIds. Contains pointer to doclist in barrel which wordId falls into. Partially sorted forward indexes sorted by docID. Each barrel stores hitlists for a given range of wordIDs. DocID keyed index where each entry includes info such as pointer to doc in repository, checksum, statistics, status, etc. Also contains URL info if doc has been crawled. If not just contains URL.

Google Architecture (cont.) List of wordIds produced by Sorter and lexicon created by Indexer used to create new lexicon used by searcher. Lexicon stores ~14 million words. New lexicon keyed by wordID, inverted doc index keyed by docID, and PageRanks used to answer queries 2 kinds of barrels. Short barrell which contain hit list which include title or anchor hits. Long barrell for all hit lists.

Scalability ,000 10, ,000 1,000,000 10,000, ,000,000 1,000,000,000 10,000,000, The growth of the web

Web search services are distributed centralized systems Over the past 9 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function. Will this continue? Possible areas for concern are: staff costs, telecommunications costs, disk access rates. Scalability

Growth of Web Searching In November 1997: AltaVista was handling 20 million searches/day. Google forecast for 2000 was 100s of millions of searches/day. In 2004, Google reports 250 million webs searches/day, and estimates that the total number over all engines is 500 million searches/day. Moore's Law and web searching In 7 years, Moore's Law predicts computer power will increase by a factor of at least 2 4 = 16. It appears that computing power is growing at least as fast as web searching.

Scalability: Performance Very large numbers of commodity computers Algorithms and data structures scale linearly Storage –Scale with the size of the Web –Compression/decompression System –Crawling, indexing, sorting simultaneously Searching –Bounded by disk I/O

Growth of Google In 2000: 85 people 50% technical, 14 Ph.D. in Computer Science In 2000: Equipment 2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily A 2005 estimate by Paul Strassmann has 200,000 servers, while unspecified sources claimed this number to be upwards of 450,000 in , 900,000 servers By fall 2002, Goo gle had grown to over 400 people. In 2004, Google hired 1,000 new people. 2008, 16,800 employees, $15 billion in sales => $1 million average earnings/employee 2011, 32,500 employees, $40 billion in sales 2012, 53,900 employees, $50 billion in sales

Platform Custom Software Google Web Server (GWS) — Custom Linux-based Web server that Google uses for its online services. Storage systems: Google File System and its successor, Colossus BigTable — structured storage built upon GFS/Colossus Spanner — planet-scale structured storage system, next generation of BigTable stack Google F1 — a distributed, quasi-SQL DBMS based on Spanner, substituting a custom version of MySQL. Chubby lock service Borg — job scheduling and monitoring system MapReduce and Sawzall programming language Indexing/search systems: TeraGoogle — Google's large search index (launched in early 2006), designed by Anna Paterson of Cuil fame. Caffeine (Percolator) — continuous indexing system (launched in 2010). Private Networks 3 rd largest ISP Data Centers Throughout the world

Scalability: Numbers of Computers Very rough calculation In March 2000, 5.5 million searches per day, required 2,500 computers In fall 2004, computers are about 8 times more powerful. Estimated number of computers for 250 million searches per day: = (250/5.5) x 2,500/8 = about 15,000 Some industry estimates suggest that today Google may have as many as 500,000 computers.

Scalability: Staff Programming: Have very well trained staff. Isolate complex code. System maintenance: Organize for minimal staff (e.g., automated log analysis, do not fix broken computers). Customer service: Automate everything possible, but complaints, large collections, etc. require staff.

Evaluation Web Searching Test corpus must be dynamic The web is dynamic (10%-20%) of URLs change every month Spam methods change change continually Queries are time sensitive Topic are hot and then not Need to have a sample of real queries New emphasis, 2007, QDF: “query deserves freshness” Languages At least 90 different languages Reflected in cultural and technical differences Amil Singhal, Google, 2004

Other Uses of Web Crawling and Associated Technology The technology developed for web search services has many other applications. Conversely, technology developed for other Internet applications can be applied in web searching Related objects (e.g., Amazon's "Other people bought the following"). Recommender and reputation systems (e.g., ePinion's reputation system).

Google Status July 2006 Nearly 500,000 linux boxes (servers) 20 billion pages and counting 100 million queries a day

Google Status - August 2007

Google Status - October 2008

Google Status - October 2009

Google Status - July 2010

Number of searches handled

Number of searches handled Gigs

Number of searches handled – 2010 Gigs

ComScoreComScore global share Number of search engine queries - US About 500M per day

ComScoreComScore global share Number of search engine queries - US About half a billion per day

Dec billion internet users

2012

Search Status

Search Status

Second generation Web search Ranking using web specific data –HTML tag information –click stream information (DirectHit) people vote with their clicks –directory information (Yahoo! directory) –anchor text –link analysis

Link Analysis Ranking - Google Intuition: a link from q to p denotes endorsement –people vote with their links Popularity count –rank according to the incoming links PageRank algorithm –perform a random walk on the Web graph. The pages visited most often are the ones most important.

Second generation SE performance Good performance for answering navigational queries –“finding needle in a haystack” … and informational queries –e.g “oscar winners” Resistant to text spamming Generated substantial amount of research Latest trend: specialized search engines

Result evaluation recall becomes useless precision measured over top-10/20 results Shift of interest from “relevance” to “authoritativeness/reputation” or importance ranking becomes critical

What’s coming? More personal search Social search Mobile search Specialty search Freshness search 3 rd generation search? Will anyone replace GoogleWill anyone replace Google? “Search as a problem is only 5% solved” Udi Manber, 1st Yahoo, 2nd Amazon, now Google

Are Google rankings fair? Are any search engines rankings fair? Try out certain queries –“Jew” USAJew –“Jew” GermanyJew –“tiananmen square” Google.com Google.cn Baidu.com

What is role of search engines in society Should these queries be banned? –“how do I make a dirty bomb” –“how do I make a black hole” Do search engines have a social responsibility?

Social implications of public search engines What is the web role of search in society –Human flesh search engineHuman flesh search engine –What is the responsibility of search engines? The googlization of everythinggooglization of everything Should governments control search engines –Positions Search is too important to be left to the private sector Search will only work from 8 to 5 on weekdays and will be off on holidays. –New Chinese government search engine Panguso

Google vs Bing

Google of the future

What next? GoogleFace or Faceoogle

What we covered Search engines types and business models Ranking by importance –Link analysis Citations Pagerank (query independent; must be combined with a lexical similarity) Hubs/authorities (query dependent) Google –Role in search –Architecture Covered the basics of modern search engines Search engines are constantly evolving