Page Rank. Page Rank Overview Two dimensional arrays Monte Carlo algorithms Searching the world wide web Big data Page rank Goal: we will write a program.

Slides:



Advertisements
Similar presentations
Lecture 18: Link analysis
Advertisements

Topic 14 Searching and Simple Sorts "There's nothing in your head the sorting hat can't see. So try me on and I will tell you where you ought to be." -The.
110/6/2014CSE Suprakash Datta datta[at]cse.yorku.ca CSE 3101: Introduction to the Design and Analysis of Algorithms.
Garfield AP Computer Science
Introduction to Graph Theory Instructor: Dr. Chaudhary Department of Computer Science Millersville University Reading Assignment Chapter 1.
Introduction to Algorithms Rabie A. Ramadan rabieramadan.org 2 Some of the sides are exported from different sources.
Hypertext, hypermedia and interactivity. A brief overview and background primer.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
1 CS 177 Week 12 Recitation Slides Running Time and Performance.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
E.G.M. Petrakissearching1 Searching  Find an element in a collection in the main memory or on the disk  collection: (K 1,I 1 ),(K 2,I 2 )…(K N,I N )
More Algorithms for Trees and Graphs Eric Roberts CS 106B March 11, 2013.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Data Structures and Algorithms Semester Project – Fall 2010 Faizan Kazi Comparison of Binary Search Tree and custom Hash Tree data structures.
The Confident Researcher: Google Away (Module 2) The Confident Researcher: Google Away 2.
Week 11 Introduction to Computer Science and Object-Oriented Programming COMP 111 George Basham.
CSCA48 Course Summary.
ITEC 2620A Introduction to Data Structures
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Stephen P. Carl - CS 2421 Recursion Reading : Chapter 4.
Module 5 Planning for SQL Server® 2008 R2 Indexing.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
 DATA STRUCTURE DATA STRUCTURE  DATA STRUCTURE OPERATIONS DATA STRUCTURE OPERATIONS  BIG-O NOTATION BIG-O NOTATION  TYPES OF DATA STRUCTURE TYPES.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
CSC 211 Data Structures Lecture 13
Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
CompSci 100E 3.1 Random Walks “A drunk man wil l find his way home, but a drunk bird may get lost forever”  – Shizuo Kakutani Suppose you proceed randomly.
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
Search Technologies. Examples Fast Google Enterprise – Google Search Solutions for business – Page Rank Lucene – Apache Lucene is a high-performance,
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Internal and External Sorting External Searching
CompSci 100E 4.1 Google’s PageRank web site xxx web site yyyy web site a b c d e f g web site pdq pdq.. web site yyyy web site a b c d e f g web site xxx.
1 A Case Study: Percolation Percolation. Pour liquid on top of some porous material. Will liquid reach the bottom? Applications. [ chemistry, materials.
2.4 A Case Study: Percolation Introduction to Programming in Java: An Interdisciplinary Approach · Robert Sedgewick and Kevin Wayne · Copyright © 2008.
1 A Case Study: Percolation Percolation. Pour liquid on top of some porous material. Will liquid reach the bottom? Applications. [ chemistry, materials.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
PageRank Google : its search listings always seemed deliver the “good stuff” up front. 1 2 Part of the magic behind it is its PageRank Algorithm PageRank™
Mathematics of the Web Prof. Sara Billey University of Washington.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Data mining in web applications
Welcome to ….. File Organization.
15-499:Algorithms and Applications
The Anatomy of a Large-Scale Hypertextual Web Search Engine
A Comparative Study of Link Analysis Algorithms
Information Retrieval
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lesson Objectives Aims You should know about: – Web Technologies
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Searching CLRS, Sections 9.1 – 9.3.
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
2.4 A Case Study: Percolation
Presentation transcript:

Page Rank

Overview Two dimensional arrays Monte Carlo algorithms Searching the world wide web Big data Page rank Goal: we will write a program to compute the relevancy of WWW documents based on the static structure of the WWW.

Two Dimensional Arrays Significance (a topic on the AP Computer Science A exam) Syntax Example of matrix multiplication Arrays of arrays

Significance of Two Dimensional Arrays Tables; for instance, assignments for each student in a class, quarterly sales for each item in inventory, etc. Matrices and binary relations in mathematics. For example, is there a direct road from city1 in USA to city2 in USA? For our goal in the this section, we will have need for the number of links from doc1 in the WWW to doc2 in the WWW.

Syntax int[][] frequency = new int [26][26]; Elements are accessed: frequency[4][7] and not frequency[4,7] Array indices in Java (like C, C++, C#) always begin with 0; in other words, the element with index 1 is the second element of the array.

Matrix multiplication

Matrix Multiplication Exercise http://cs.fit.edu/~ryan/java/programs/basic_algorithms/MatrixMultip lication2.java

Arrays of Arrays Logically: arrays of arrays in the tradition of C and C++. Very simple. Unfortunately: introduces pointers, memory allocation, etc. Very complicated.

Monte Carlo Methods Introduction The example of a Monte Carlo estimate for Pi (Java exercise). Fair shuffling (Java exercise). Random walk (important in financial analysis) Used in path tracing to create realistic images Percolation – an example of the power of a Monte Carlo algorithm Goal: we will write a Monte Carlo algorithm to estimate the relevancy of WWW documents based on the static structure of the WWW.

Monte Carlo Casino The name refers to the grand casino in the Principality of Monaco at Monte Carlo, which is well-known around the world as an icon of gambling.

Monte Carlo estimate for Pi Java exercise: http://cs.fit.edu/~ryan/java/programs/basic_algorithms/ComputePi2.java Since we know the value of pi it is not really necessary to invent an algorithm to estimate its value.

Fair shuffling (Java exercise) How would you test a algorithm for shuffling, say, cards? In particular how would you know if all of the many possible results are equally likely? Main program http://cs.fit.edu/~ryan/java/programs/basic_algorithms/Experiment.j ava. Nothing to write; requires the method to shuffle. http://cs.fit.edu/~ryan/java/programs/basic_algorithms/Shuffle.java contains two methods of shuffling cards. Run the experiment with multiple trials and convince yourself both methods are fair

Percolation Theory Percolation. Pour liquid on top of some porous material. Will liquid reach the bottom? Many applications in chemistry, materials science, etc. Spread of forest fires. Natural gas through semi-porous rock. Flow of electricity through network of resistors. Permeation of gas in coal mine through a gas mask filter.

Percolation Theory Given an N-by-N system where each site is vacant with probability p, what is the probability that system percolates? Remark. Famous open question in statistical physics. No known mathematical solution. Computational thinking creates new science. Recourse. Take a computational approach: Monte Carlo simulation. Uses a recursive, dfs algorithm, but diverges from the present topic. (Recursion is a topic on the AP Computer Science A exam.) p = 0.3 (does not percolate) p = 0.4 (does not percolate) p = 0.5 (does not percolate) p = 0.6 (percolates) p = 0.7 (percolates)

We will examine a Monte Carlo algorithm for estimating the relevancy of WWW documents.

Random Walk Page rank can be computed a lot like random walk See the Java applet (1 dim) at http://www.math.uah.edu/stat/applets/RandomWalkExperiment.htm l See the Java applet (2 dim) at http://vlab.infotech.monash.edu.au/simulations/swarms/random- walk/

Searching the World Wide Web History of Search Engines Hypertext Crawling the World Wide Web Indexing

History of Search Engines History of Search by Larry Kim of WordStream

Markup and Hypertext Documents served up through the WWW are generally “marked up” for presentation in a structured, standard called hypertext markup language (HTML). The most important feature of HTML is the referencing (via URLs) of other WWW documents which enables easy, non-sequential, and varied paths of reading the documents.

Hypertext

WWW Spiders Google, and others, continually, crawl around the WWW recording what they see to enable searching.

44% of hits and 35% of bandwidth is attributable to bots (and other odd things). July 2013 (up to 9:30 am 26 Jul 2013) on the WWW server cs.fit.edu Russian search engine

Indexing Finding a relevant document is a vast ocean of linked HTML documents requires a very large index. An index is a (sorted) list of keywords (terms) and the list of values (URLs) which contain them.

An example index of WWW documents Bourgeois .../manifesto.txt Hero …/lilwomen.txt, …/muchado.txt, …/war+peace.txt His .../manifesto.txt, …/lilwomen.txt, …/mobydick.txt, …/muchado.txt, …/war+peace.txt Treachery …/war+peace.txt Whale …/mobydick.txt Yellowish …/lilwomen.txt , …/war+peace.txt

Several Issues Pick out the words from the mark-up What’s a word? 2nd, abc’s, CSTA Normalize: lowercase, stemming Some words are not worth indexing “the”, “a”, etc. A so-called stop list, eg., words ignored in Wikipedia search Java exercise: http://cs.fit.edu/~ryan/java/programs/xml/URLtoText.java First some preliminary remarks before doing the exercise.

Searching and Sorting Problem: Determine if the word is in the stop list. What is the best approach? Searching: linear search, binary search. (These are topics on the AP Computer Science A exam.) Binary search requires the data (the index, for example) to be sorted. Sorting: selection sort, insertion sort, merge sort, quick sort; external sorting. (The first three of these sorts are topics on the AP Computer Science A exam.)

Linear versus Binary search Suppose each comparison takes one millisecond (0.001)

Linear versus Binary Search

Linear versus Binary Search

Obama at Google https://www.youtube.com/watch?v=k4RRi_ntQc8

Sorting Demo http://cs.fit.edu/~ryan/cse1002/sort.html See also sorting illustrated by Algo-rythmics http://algo- rythmics.ms.sapientia.ro and folk dancers

Now do the exercise Java exercise: http://cs.fit.edu/~ryan/java/programs/xml/URLtoText.java PS. How to students really program? http://xkcd.com/1185 Observe the tool tip!

OK, we have a keyword index OK, we have a keyword index. It is likely we still have “gazillion” documents, for most of the terms. (See Googlewacks, Googlewhackblatt; one and two words search terms that return one document.) How do we find the most relevant pages?

Big Data The problem Count-Min Algorithm

The problem with Big Data Consider a popular website which wants to keep track of statistics on the queries used to search the site. One could keep track of the full log of queries, and answer exactly the frequency of any search query at the site. However, the log can quickly become very large. This problem is an instance of the count tracking problem. Even known sophisticated solutions for fast querying such as a tree-structure or hash table to count up the multiple occurrences of the same query, can prove to be slow and wasteful of resources. Notice that in this scenario, we can tolerate a little imprecision. In general, we are interested only in the queries that are asked frequently. So it is acceptable if there is some fuzziness in the counts. Thus, we can tradeoff some precision in the answers for a more efficient and lightweight solution. This tradeoff is at the heart of sketches. Cormode and Muthurishnon, 2011

Page Rank Gave Google a Competitive Advantage Not based on the WWW surfer as voter (popularity), but on the WWW author as voter (hence relatively static) Random surfer mindlessly follows the hyperlinks of the WWW authors Markov chains

S&W Tiny Hypertext

S&W Tiny Graph

S&W Tiny: Adj list & Adj matrix 5 0 1 1 2 1 2 1 3 1 3 1 4 2 3 3 0 4 0 4 2 5 5 0 1 0 0 0 0 0 2 2 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0

Wiki2 Hypertext

Wiki2 Graph

Wiki2: Adj List & Adj Matrix 7 0 1 0 2 0 3 0 4 0 6 1 0 2 0 2 1 3 1 3 2 3 4 4 0 4 2 4 3 4 5 5 0 5 4 6 4 7 7 0 1 1 1 1 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0

Wiki1 Hypertext

Wiki1 Graph

Java Exercise Modify Adajency1.java Print adjacency matrix Print probability matrix Print probability matrix with 90-10 rule

Interactive WWW Page for PageRank http://williamcotton.com/pagerank-explained-with-javascript

Reachability, Markov Theory Can node 2 reach node 4? Yes, using a path of length 2 through node 3.

Final Challenge Raise the page rank of page “23” by modifying only the links on page “23” Decrease the page rank of page “23” by modifying only the links on page “23” Can you find the maximum/minimum page rank?

Search engine optimization, link schemes, link farming, Google bombs

Ted Talks: Brin & Page: The Genesis of Google http://www.ted.com/talks/sergey_brin_and_larry_page_on_google.h tml