© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania HITS and PageRank; Google April 4, 2016.

© 2016 A. Haeberlen, Z. Ives Announcements HW3 is due today! Basic Testing Guide is now available Please test your solution carefully! Last call for team registrations! If you don't register a team by ~3pm today, I will assign you to a team Reading: (see web page) 2 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Web search before 1998 Based on information retrieval Boolean / vector model, etc. Based purely on 'on-page' factors, i.e., the text of the page Results were not very good Web doesn't have an editor to control quality Web contains deliberately misleading information (  SEO) Great variety in types of information: Phone books, catalogs, technical reports, slide shows,... Many languages, partial descriptions, jargon,... How to improve the results? 3 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Plan for today HITS Hubs and authorities PageRank Iterative computation Random-surfer model Refinements: Sinks and Hogs Google How Google worked in 1998 Google over the years SEOs 4 University of Pennsylvania NEXT

© 2016 A. Haeberlen, Z. Ives Goal: Find authoritative pages Many queries are relatively broad "cats", "harvard", "iphone",... Consequence: Abundance of results There may be thousands or even millions of pages that contain the search term, incl. personal homepages, rants,... IR-type ranking isn't enough; still way too much for a human user to digest Need to further refine the ranking! Idea: Look for the most authoritative pages But how do we tell which pages these are? Problem: No endogenous measure of authoritativeness  Hard to tell just by looking at the page. Need some 'off-page' factors 5 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Idea: Use the link structure Hyperlinks encode a considerable amount of human judgment What does it mean when a web page links another web page? Intra-domain links: Often created primarily for navigation Inter-domain links: Confer some measure of authority So, can we simply boost the rank of pages with lots of inbound links? 6 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 7 Relevance  Popularity! “A-Team” page Hollywood “Series to Recycle” page Yahoo Directory Wikipedia Mr. T’s page Team Sports Cheesy TV Shows page University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Hubs and authorities Idea: Give more weight to links from hub pages that point to lots of other authorities Mutually reinforcing relationship: A good hub is one that points to many good authorities A good authority is one that is pointed to by many good hubs 8 University of Pennsylvania AB Hub Authority

© 2016 A. Haeberlen, Z. Ives HITS Algorithm for a query Q: 1. Start with a root set R, e.g., the t highest-ranked pages from the IR-style ranking for Q 2. For each p  R, add all the pages p points to, and up to d pages that point to p. Call the resulting set S. 3. Assign each page p  S an authority weight x p and a hub weight y p ; initially, set all weights to be equal and sum to 1 4. For each p  S, compute new weights x p and y p as follows: New x p := Sum of all y q such that q  p is an interdomain link New y p := Sum of all x q such that p  q is an interdomain link 5. Normalize the new weights such that both the sum of all the x p and the sum of all the y p are 1 6. Repeat from step 4 until a fixpoint is reached If A is adjacency matrix, fixpoints are principal eigenvectors of A T A and AA T, respectively 9 University of Pennsylvania R S

© 2016 A. Haeberlen, Z. Ives Recap: HITS Improves the ranking based on link structure Intuition: Links confer some measure of authority Overall ranking is a combination of IR ranking and this Based on concept of hubs and authorities Hub: Points to many good authorities Authority: Is pointed to by many good hubs Iterative algorithm to assign hub/authority scores Query-specific No notion of 'absolute quality' of a page; ranking needs to be computed for each new query 10 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Google's PageRank (Brin/Page 98) A technique for estimating page quality Based on web link graph, just like HITS Like HITS, relies on a fixpoint computation Important differences to HITS: No hubs/authorities distinction; just a single value per page Query-independent Results are combined with IR score Think of it as: TotalScore = IR score * PageRank In practice, search engines use many other factors (for example, Google says it uses more than 200) 12 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives PageRank: Intuition Imagine a contest for The Web's Best Page Initially, each page has one vote Each page votes for all the pages it has a link to To ensure fairness, pages voting for more than one page must split their vote equally between them Voting proceeds in rounds; in each round, each page has the number of votes it received in the previous round In practice, it's a little more complicated - but not much! 13 University of Pennsylvania A B E C D F G H I J Shouldn't E's vote be worth more than F's? How many levels should we consider?

© 2016 A. Haeberlen, Z. Ives PageRank Each page i is given a rank x i Goal: Assign the x i such that the rank of each page is governed by the ranks of the pages linking to it: 14 University of Pennsylvania Rank of page j Rank of page i Every page j that links to i Number of links out from page j How do we compute the rank values?

© 2016 A. Haeberlen, Z. Ives 20 Naïve PageRank Algorithm Restated Let N(p) = number outgoing links from page p B(p) = number of back-links to page p Each page b distributes its importance to all of the pages it points to (so we scale by 1/N(b)) Page p’s importance is increased by the importance of its back set University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 21 In Linear Algebra formulation Create an m x m matrix M to capture links: M(i, j) = 1 / n j if page i is pointed to by page j and page j has n j outgoing links = 0 otherwise Initialize all PageRanks to 1, multiply by M repeatedly until all values converge: Computes principal eigenvector via power iteration University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 22 A Brief Example Google AmazonYahoo 00.5 00 1 0 g' y’ a’ g y a = * Total rank sums to number of pages g y a 1 1 1 = 1 0.5 1.5, 1 0.75 1.25, 1 0.67 1.33, … Running for multiple iterations: University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 23 Oops #1 – PageRank Sinks Google AmazonYahoo 000.5 0 00 g' y’ a’ g y a = * g y a 1 1 1 = 0.5 1, 0.25 0.5 0.25, 0 0 0, …, Running for multiple iterations: 'dead end' - PageRank is lost after each round University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 24 Oops #2 – PageRank hogs Google AmazonYahoo 000.5 1 00 g' y’ a’ g y a = * g y a 1 1 1 = 0.5 2, 0.25 2.5 0.25, 0 3 0, …, Running for multiple iterations: PageRank cannot flow out and accumulates University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 25 Improved PageRank Remove out-degree 0 nodes (or consider them to refer back to referrer) Add decay factor d to deal with sinks Typical value: d=0.85 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Random Surfer Model PageRank has an intuitive basis in random walks on graphs Imagine a random surfer, who starts on a random page and, in each step, with probability d, klicks on a random link on the page with probability 1-d, jumps to a random page (bored?) The PageRank of a page can be interpreted as the fraction of steps the surfer spends on the corresponding page Transition matrix can be interpreted as a Markov Chain 26 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 27 Stopping the Hog 000.5 1 00 g' y’ a’ g y a = 0.85 * g y a = 0.26 2.48 0.26, 0.15 + Running for multiple iterations: … though does this seem right? Google AmazonYahoo 0.57 1.85 0.57 0.39 2.21 0.39 0.32 2.36 0.32,,, …, University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Reminder: Search-Engine Optimization White-hat techniques Google webmaster tools; add meta tags to documents, etc. Black-hat techniques Link farms Keyword stuffing, hidden text, meta-tag stuffing,... Spamdexing Initial solution:... Some people started to abuse this to improve their own rankings Doorway pages / cloaking Special pages just for search engines BMW Germany and Ricoh Germany banned in February 2006 Link buying You need countermeasures for these Otherwise they can degrade the quality of your results 28 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 29 Recap: PageRank Estimates absolute 'quality' or 'importance' of a given page based on inbound links Query-independent Can be computed via fixpoint iteration Can be interpreted as the fraction of time a 'random surfer' would spend on the page Several refinements, e.g., to deal with sinks Considered relatively stable But vulnerable to black-hat SEO An important factor, but not the only one Overall ranking is based on many factors (Google: >200) Need to perform rank merging, e.g., with TF/IDF scores e.g., TF/IDF can ensure high precision, and PageRank high quality University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives What could be the other 200 factors? Note: This is entirely speculative! 30 University of Pennsylvania PositiveNegative On-page Off-page Source: Web Information Systems, Prof. Beat Signer, VU Brussels Keyword in title? URL? Keyword in domain name? Quality of HTML code Page freshness Rate of change... Links to 'bad neighborhood' Keyword stuffing Over-optimization Hidden content (text has same color as background) Automatic redirect/refresh... High PageRank Anchor text of inbound links Links from authority sites Links from well-known sites Domain expiration date... Fast increase in number of inbound links (link buying?) Link farming Different pages user/spider Content duplication...

© 2016 A. Haeberlen, Z. Ives Beyond PageRank PageRank assumes a “random surfer” who starts at any node and estimates likelihood that the surfer will end up at a particular page A more general notion: label propagation Take a set of start nodes each with a different label Estimate, for every node, the distribution of arrivals from each label In essence, captures the relatedness or influence of nodes Used in YouTube video matching, schema matching, … 31 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 33 Google Architecture [Brin/Page 98] Focus was on scalability to the size of the Web First to really exploit Link Analysis Started as an academic project @ Stanford; became a startup Our discussion will be on early Google – today they keep things secret! University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 34 The Heart of Google Storage “BigFile” system for storing indices, tables Support for 2 64 bytes across multiple drives, filesystems Manages its own file descriptors, resources This was the predecessor to GFS First use: Repository Basically, a warehouse of every HTML page (this is the 'cached page' entry), compressed in zlib (faster than bzip) Useful for doing additional processing, any necessary rebuilds Repository entry format: [DocID][ECode][UrlLen][PageLen][Url][Page] The repository is indexed (not inverted here) University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 35 Repository Index One index for looking up documents by DocID Done in ISAM (think of this as a B+ Tree without smart re-balancing) Index points to repository entries (or to URL entry if not crawled) One index for mapping URL to DocID Sorted by checksum of URL Compute checksum of URL, then perform binary search by checksum Allows update by merge with another similar file Why is this done? University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 36 Lexicon The list of searchable words (Presumably, today it’s used to suggest alternative words as well) The “root” of the inverted index As of 1998, 14 million “words” Kept in memory (was 256MB) Two parts: Hash table of pointers to words and the “barrels” (partitions) they fall into List of words (null-separated) University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 37 Indices – Inverted and “Forward” Inverted index divided into “barrels” (partitions by range) Indexed by the lexicon; for each DocID, consists of a Hit List of entries in the document Two barrels: short (anchor and title); full (all text) Forward index uses the same barrels Indexed by DocID, then a list of WordIDs in this barrel and this document, then Hit Lists corresponding to the WordIDs original tables from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 38 Hit Lists (Not Mafia-Related) Used in inverted and forward indices Goal was to minimize the size – the bulk of data is in hit entries For 1998 version, made it down to 2 bytes per hit (though that’s likely climbed since then): cap 1font: 3position: 12 Plain cap 1font: 7type: 4 position: 8 Fancy cap 1font: 7type: 4 hash: 4 pos: 4 Anchor vs. special-cased to: University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 39 Google’s Distributed Crawler Single URL Server – the coordinator A queue that farms out URLs to crawler nodes Implemented in Python! Crawlers had 300 open connections apiece Each needs own DNS cache – DNS lookup is major bottleneck, as we have seen Based on asynchronous I/O Many caveats in building a “friendly” crawler (remember robot exclusion protocol?) University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Theory vs. practice Expect the unexpected They accidentally crawled an online game Huge array of possible errors: Typos in HTML tags, non-ASCII characters, kBs of zeroes in the middle of a tag, HTML tags nested hundreds deep,... Social issues Lots of email and phone calls, since most people had not seen a crawler before: "Wow, you looked at a lot of pages from my web site. How did you like it?" "This page is copy-righted and should not be indexed"... Typical of new services deployed "in the wild" We had similar experiences with our ePOST system and our measurement study of broadband networks 40 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 41 Google’s Search Algorithm 1. Parse the query 2. Convert words into wordIDs 3. Seek to start of doclist in the short barrel for every word 4. Scan through the doclists until there is a document that matches all of the search terms 5. Compute the rank of that document IR score: Dot product of count weights and type weights Final rank: IR score combined with PageRank 6. If we’re at the end of the short barrels, start at the doclists of the full barrel, unless we have enough 7. If not at the end of any doclist, goto step 4 8. Sort the documents by rank; return the top K University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 42 Ranking in Google Considers many types of information: Position, font size, capitalization Anchor text PageRank Count of occurrences (basically, TF) in a way that tapers off (Not clear if they did IDF at the time?) Multi-word queries consider proximity as well How? University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 43 Google’s Resources In 1998: 24M web pages About 55GB data w/o repository About 110GB with repository Lexicon 293MB Worked quite well with low-end PC In 2007: > 27 billion pages, >1.2B queries/day: Don’t attempt to include all barrels on every machine! e.g., 5+TB repository on special servers separate from index servers Many special-purpose indexing services (e.g., images) Much greater distribution of data (~500K PCs?), huge net BW Advertising needs to be tied in (>1M advertisers in 2007) University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Google over the years August 2001: Search algorithm revamped Incorporate additional ranking criteria more easily February 2003: Local connectivity analysis More weight to links from experts' sites. Google's first patent. Summer 2003: Fritz Index updated incrementally, rather than in big batches June 2005: Personalized results Users can let Google mine their own search behavior December 2005: Engine update Allows for more comprehensive web crawling 44 University of Pennsylvania Source: http://www.wired.com/magazine/2010/02/ff_google_algorithm/all/1

© 2016 A. Haeberlen, Z. Ives Google over the years May 2007: Universal search Users can get links to any medium (images, news, books, maps, etc) on the same results page December 2009: Real-time search Display results from Twitter & blogs as they are posted August 2010: Caffeine New indexing system; "50 percent fresher results" February 2011: Major change to algorithm The "Panda update" (revised since; Panda 4.2 in July 2015) "designed to reduce the rankings of low-quality sites" Algorithm is still updated frequently 45 University of Pennsylvania Source: http://www.wired.com/magazine/2010/02/ff_google_algorithm/all/1

© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania HITS and PageRank; Google April 4, 2016.

Similar presentations

Presentation on theme: "© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania HITS and PageRank; Google April 4, 2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania HITS and PageRank; Google April 4, 2016.

Similar presentations

Presentation on theme: "© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania HITS and PageRank; Google April 4, 2016."— Presentation transcript:

Similar presentations

About project

Feedback