The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
© All Rights Reserved Web Browser A software application that enables you to view and interact with pages on the World Wide Web. Examples.
How does a web search engine work?. search  google (started 1998 … now worth $365 billion)  bing  amazon  web, images, news, maps, books, shopping,
These are slides with a history. I found them on the web... They are apparently based on Dan Weld’s class at U. Washington, (who in turn based his slides.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
How Google Relies on Discrete Mathematics Gerald Kruse Juniata College Huntingdon, PA
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
Google and the Page Rank Algorithm Székely Endre
Search Engine Optimization
Google’s PageRank: The Math Behind the Search Engine Author:Rebecca S. Wills, 2006 Instructor: Dr. Yuan Presenter: Wayne.
Map Reduce Architecture
Web Intelligence Search and Ranking. Today The anatomy of search engines (read it yourself) The key design goal(s) for search engines Why google is good:
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Roshnika Fernando P AGE R ANK. W HY P AGE R ANK ?  The internet is a global system of networks linking to smaller networks.  This system keeps growing,
SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Web software. Two types of web software Browser software – used to search for and view websites. Web development software – used to create webpages/websites.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
All About Nutch Michael J. Cafarella CSE 454 April 14, 2005.
Intro to Web Search Michael J. Cafarella December 5, 2007.
Web Search Algorithms By Matt Richard and Kyle Krueger.
CompSci 100E 3.1 Random Walks “A drunk man wil l find his way home, but a drunk bird may get lost forever”  – Shizuo Kakutani Suppose you proceed randomly.
Search Engines By: Faruq Hasan.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Google Cluster Computing Faculty Training Workshop Module III: Nutch This presentation © Michael Cafarella Redistributed under the Creative Commons Attribution.
Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
The Nutch Open-Source Search Engine CSE 454 Slides by Michael J. Cafarella.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
PageRank Google : its search listings always seemed deliver the “good stuff” up front. 1 2 Part of the magic behind it is its PageRank Algorithm PageRank™
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Search Engine Marketing Science Writers Conference 2009.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Search Engine Optimization
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
PageRank & Random Walk “The important of a Web page is depends on the readers interest, knowledge and attitudes…” –By Larry Page, Co-Founder of Google.
OCR A-Level Computing - Unit 01 Computer Systems Lesson 1. 3
MapReduce: Simplified Data Processing on Large Clusters
The Map-Reduce Framework
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introducing the World Wide Web
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
PageRank and Markov Chains
Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
The Anatomy of a Large-Scale Hypertextual Web Search Engine
PageRank & Random Walk “The important of a Web page is depends on the readers interest, knowledge and attitudes…” –By Larry Page, Co-Founder of Google.
Information retrieval and PageRank
MapReduce: Simplified Data Processing on Large Clusters
PageRank PAGE RANK (determines the importance of webpages based on link structure) Solves a complex system of score equations PageRank is a probability.
Presentation transcript:

The Technology Behind

The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear each day! About 1 billion Internet users (and growing)!! The World Wide Web in Image source: 2

Use a huge number of computers – Data Centers An ordinary Google Search uses machines! Search through a massive number of webpages – MapReduce Find which webpages match your query - PageRank 3

Data Centers Google’s Data Center in The Dalles, Oregon Buildings that house computing equipment Contain 1000s of computers Inside a Microsoft Data Center 4

Google’s 36 Data Centers 5

Why do we need so many computers? Searching the Internet is like looking for a needle in a haystack! – There are a trillion webpages – There are millions of users – Imagine writing a for or while- loop to search the contents of each webpage!  Use the 1000s of computers in parallel to speed up the search 6

Map/Reduce Adapted from the Lisp programming language Easy to distribute across many computers 7 MapReduce slides adapted from Dan Weld’s slides at U. Washington:

Map/Reduce in Lisp ( map f list [list 2 list 3 …]) ( map square ‘( )) o ( ) ( reduce + ‘( )) o (+ 16 (+ 9 (+ 4 1) ) ) o 30 Unary operator Binary operator 8

Map/Reduce ala Google map(key, val) is run on each item in set – emits new-key / new-val pairs reduce(key, vals) is run for each unique key emitted by map() – emits final output 9

Example: Counting words in webpages – Input consists of (url, contents) pairs – map(key=url, val=contents): For each word w in contents, emit (w, “1”) – reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” 10

Count, Illustrated map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” see bob throw see spot run see1 bob1 run1 see 1 spot 1 throw1 bob1 run1 see 2 spot 1 throw1 11

Map/Reduce Job Processing Master Worker 0 Worker 1Worker 2 Worker 3Worker 4Worker 5 1.Client submits the “count” job, indicating code and input data 2.Master breaks input data into 6 chunks and assigns work to Workers. 3.After map(), Workers exchange map-output so that they can do the reduce() function 4.Master breaks reduce() keyspace into 6 chunks and assigns work to the Workers 5.Final reduce() step is done by the Master “count”

13 Finding the Right Websites for a Query Relevance - Is the document similar to the query term? Importance - Is the document useful to a variety of users? Search engine approaches – Paid advertisers – Manually created classification – Feature detection, based on title, text, anchors, … – "Popularity"

Google’s PageRank™ Algorithm Measure popularity of pages based on hyperlink structure of Web. 14 Google Founders – Larry Page and Sergei Brin

90-10 Rule Model. Web surfer chooses next page: – 90% of the time surfer clicks random hyperlink. – 10% of the time surfer types a random page. Crude, but useful, web surfing model. – No one chooses links with equal probability. – The breakdown is just a guess. – It does not take the back button or bookmarks into account. 15

Basic Ideas Behind PageRank PageRank is a probability distribution that denotes the likelihood that the “random surfer” will arrive at a particular webpage. Links coming from important pages convey more importance to a page. – If a web page has a link off the CNN home page, it may be just one link but it is a very important one. A page has high rank if the sum of the ranks of its inbound links is high. – Covers the cases where a page has many inbound links and also when a page has a few highly ranked inbound links. 16

The PageRank Algorithm Assume that there are only 4 pages – A, B, C, D and that the distribution is evenly divided among the pages – PR(A) = PR(B) = PR(C) = PR(D) = 0.25 A B C D 17

If B, C, D each only link to A B, C, and D each confer their 0.25 PageRank to A PR(A) = PR(B) + PR(C) + PR(D) = 0.75 A B C D 18

Assume B links to C and D links to B and C Value of link-votes divided amongst the outbound links on a page – B gives vote worth to A and C – D gives vote worth to A, B, C PR(A) = (PR(B)/2) + (PR(C)/1) + (PR(D)/3) A B C D 19

PageRank The PageRank for any page u: PR(u) is dependent on the PageRank values for each page v out of the set B u (this set contains all pages linking to page u), divided by the number L(v) of links from page v 20

The Life of a Google Query 21 Image Source: MapReduce + PageRank

References The paper by Larry Page and Sergei Brin that describes their Google prototype: The paper by Jeffrey Dean and Sanjay Ghemawat that describes MapReduce: Wikipedia article on PageRank: