Presentation on theme: "Page Rank. Page Rank Overview Two dimensional arrays Monte Carlo algorithms Searching the world wide web Big data Page rank Goal: we will write a program."— Presentation transcript:
3Overview Two dimensional arrays Monte Carlo algorithms Searching the world wide webBig dataPage rankGoal: we will write a program to compute the relevancy of WWW documents based on the static structure of the WWW.
4Two Dimensional Arrays Significance (a topic on the AP Computer Science A exam)SyntaxExample of matrix multiplicationArrays of arrays
5Significance of Two Dimensional Arrays Tables; for instance, assignments for each student in a class, quarterly sales for each item in inventory, etc.Matrices and binary relations in mathematics. For example, is there a direct road from city1 in USA to city2 in USA?For our goal in the this section, we will have need for the number of links from doc1 in the WWW to doc2 in the WWW.
6Syntax int frequency = new int ; Elements are accessed: frequency and not frequency[4,7]Array indices in Java (like C, C++, C#) always begin with 0; in other words, the element with index 1 is the second element of the array.
9Arrays of ArraysLogically: arrays of arrays in the tradition of C and C++. Very simple.Unfortunately: introduces pointers, memory allocation, etc. Very complicated.
10Monte Carlo Methods Introduction The example of a Monte Carlo estimate for Pi (Java exercise). Fair shuffling (Java exercise). Random walk (important in financial analysis)Used in path tracing to create realistic imagesPercolation – an example of the power of a Monte Carlo algorithmGoal: we will write a Monte Carlo algorithm to estimate the relevancy of WWW documents based on the static structure of the WWW.
11Monte Carlo CasinoThe name refers to the grand casino in the Principality of Monaco at Monte Carlo, which is well-known around the world as an icon of gambling.
12Monte Carlo estimate for Pi Java exercise:Since we know the value of pi it is not really necessary to invent an algorithm to estimate its value.
13Fair shuffling (Java exercise) How would you test a algorithm for shuffling, say, cards? In particular how would you know if all of the many possible results are equally likely?Main program ava. Nothing to write; requires the method to shuffle.contains two methods of shuffling cards.Run the experiment with multiple trials and convince yourself both methods are fair
14Percolation TheoryPercolation. Pour liquid on top of some porous material. Will liquid reach the bottom? Many applications in chemistry, materials science, etc.Spread of forest fires.Natural gas through semi-porous rock.Flow of electricity through network of resistors.Permeation of gas in coal mine through a gas mask filter.
15Percolation TheoryGiven an N-by-N system where each site is vacant with probability p, what is the probability that system percolates?Remark. Famous open question in statistical physics. No known mathematical solution. Computational thinking creates new science.Recourse. Take a computational approach: Monte Carlo simulation.Uses a recursive, dfs algorithm, but diverges from the present topic. (Recursion is a topic on the AP Computer Science A exam.)p = 0.3 (does not percolate)p = 0.4 (does not percolate)p = 0.5 (does not percolate)p = 0.6 (percolates)p = 0.7 (percolates)
16We will examine a Monte Carlo algorithm for estimating the relevancy of WWW documents.
17Random Walk Page rank can be computed a lot like random walk See the Java applet (1 dim) at lSee the Java applet (2 dim) at walk/
18Searching the World Wide Web History of Search EnginesHypertextCrawling the World Wide WebIndexing
19History of Search Engines History of Search by Larry Kim of WordStream
20Markup and HypertextDocuments served up through the WWW are generally “marked up” for presentation in a structured, standard called hypertext markup language (HTML).The most important feature of HTML is the referencing (via URLs) of other WWW documents which enables easy, non-sequential, and varied paths of reading the documents.
22WWW SpidersGoogle, and others, continually, crawl around the WWW recording what they see to enable searching.
2344% of hits and 35% of bandwidth is attributable to bots (and other odd things). July 2013 (up to 9:30 am 26 Jul 2013) on the WWW server cs.fit.eduRussian search engine
24IndexingFinding a relevant document is a vast ocean of linked HTML documents requires a very large index.An index is a (sorted) list of keywords (terms) and the list of values (URLs) which contain them.
25An example index of WWW documents Bourgeois .../manifesto.txt Hero …/lilwomen.txt, …/muchado.txt, …/war+peace.txt His .../manifesto.txt, …/lilwomen.txt, …/mobydick.txt, …/muchado.txt, …/war+peace.txt Treachery …/war+peace.txt Whale …/mobydick.txt Yellowish …/lilwomen.txt , …/war+peace.txt
26Several Issues Pick out the words from the mark-up What’s a word? 2nd, abc’s, CSTANormalize: lowercase, stemmingSome words are not worth indexing“the”, “a”, etc.A so-called stop list, eg., words ignored in Wikipedia searchJava exercise:First some preliminary remarks before doing the exercise.
27Searching and SortingProblem: Determine if the word is in the stop list. What is the best approach?Searching: linear search, binary search. (These are topics on the AP Computer Science A exam.) Binary search requires the data (the index, for example) to be sorted.Sorting: selection sort, insertion sort, merge sort, quick sort; external sorting. (The first three of these sorts are topics on the AP Computer Science A exam.)
28Linear versus Binary search Suppose each comparison takes one millisecond (0.001)
31Obama at Googlehttps://www.youtube.com/watch?v=k4RRi_ntQc8
32Sorting Demo http://cs.fit.edu/~ryan/cse1002/sort.html See also sorting illustrated by Algo-rythmics rythmics.ms.sapientia.ro and folk dancers
33Now do the exerciseJava exercise:PS. How to students really program?Observe the tool tip!
34OK, we have a keyword index OK, we have a keyword index. It is likely we still have “gazillion” documents, for most of the terms. (See Googlewacks, Googlewhackblatt; one and two words search terms that return one document.) How do we find the most relevant pages?
36The problem with Big Data Consider a popular website which wants to keep track of statistics on the queries used to search the site. One could keep track of the full log of queries, and answer exactly the frequency of any search query at the site. However, the log can quickly become very large. This problem is an instance of the count tracking problem. Even known sophisticated solutions for fast querying such as a tree-structure or hash table to count up the multiple occurrences of the same query, can prove to be slow and wasteful of resources. Notice that in this scenario, we can tolerate a little imprecision. In general, we are interested only in the queries that are asked frequently. So it is acceptable if there is some fuzziness in the counts. Thus, we can tradeoff some precision in the answers for a more efficient and lightweight solution. This tradeoff is at the heart of sketches. Cormode and Muthurishnon, 2011
39Page Rank Gave Google a Competitive Advantage Not based on the WWW surfer as voter (popularity), but on the WWW author as voter (hence relatively static)Random surfer mindlessly follows the hyperlinks of the WWW authorsMarkov chains
57Final ChallengeRaise the page rank of page “23” by modifying only the links on page “23”Decrease the page rank of page “23” by modifying only the links on page “23”Can you find the maximum/minimum page rank?
58Search engine optimization, link schemes, link farming, Google bombs
59Ted Talks: Brin & Page: The Genesis of Google tml