Structure Google “theory”, see essay by Brin and Page http://www7.scu.edu.au/programme/ fullpapers/1921/com1921.htm Google query language, form Calishain and Dornfest. Next week: Google for special interests
web information retrieval We can think of the web as a pile of documents called pages. Some "pages" are hard to index –PDF documents –Pictures –Sound files But a majority of pages are written in HTML –easy to index –have a loose structure
Google uses the structure of HTML Google finds the title of the page, i.e. the contents of the element. Google analysis headings and large font sizes and gives priority weight to terms found there. Most importantly, Google uses the link structure of the web to find important pages.
classic IR and the web In classic information retrieval, every document has the same importance. They differ as to their relevance to a query. In classic information retrieval, a document d is relevant if the query terms appears relatively frequently in d rather than in other documents. If a web page contains the words "Bill Clinton sucks" and a picture, it is not a relevant hit for "Bill Clinton".
Google finds important pages The idea is that the documents on the web have different degrees of "importance". Google will show the most important pages first. The ideas is that more important pages are likely to be more relevant to any query than non-important pages.
Google's monkey Imagine that the web has P pages. Each page has its own address (URL). Imagine a monkey who sits at a terminal. He follows links at random, but on rare occasions he gets bored and types in an address of a random page out of those P. Will the monkey visit all pages with equal probability?
PageRank Google page rank of a page is the probability that the Google's money will visit the page. –The monkey will come frequently to pages that have a lot of links to them. –Once he is there, he will likely go to a page that it linked by one of the pages that an important page links to. The structure of all the links on the entire web reveals the importance of the page.
many PageRanks There is an infinite number of ways to calculate the page rank depending on –how likely the monkey gets bored. –the probability of the monkey to visit each page. Potentially, there is a page rank for each user of the web. Google tries to observe users and may be associating personal page ranks.
interfaces simple interface has command driven features that make it more advanced than the advanced interface The advanced interface is a form interface to query language available on the simple interface. There are extensive language settings –preferences for finding pages in a certain language –preferences for the language of the interface
query language I default Boolean AND between terms case insensitive terms can be ORed with "OR" or "|" adjacent terms have to be put in double quotes Boolean NOT can be expressed with – Example: "krichel –thomas"
query language II * is a wildcard for any word +stopword requires the presences of a stop word stopword. But the list of stop words has not been published. There is a limit of 10 words, but a * does not count towards the limit
query treatment Google prefers pages that have the search terms –in close proximity –in the same order as in the query Repeating a query term once adds weight to it repeating it twice has no further effect
special syntax I intitle: find in title only, "intitle: google" intext: find in text only, "intext: html" inanchor: in link text, "inanchor:Palmer" link: pages that link to, "link: openlib.org" cache: pages that are in the google cache, useful if query result has nothing to do with the query terms filetype: file suffix "filetype: ppt" related: to a page "related: liu.edu" info: information about a page
site: and inurl: special syntax inurl: find in URL only, "inurl: help" –can use the * as a wildcard, like in inurl: “*.openlib.org" site: domain of page, "site: liu.edu" – breaks down if a path is included –can not be used on its one, only with other query expressions
daterange: special syntax limits the search to pages indexed between a range of dates. Changed pages are reindexed, unchanged pages are not reindexed when the crawler visits a page. dates are expressed in the Julian period, i.e. number of days after -4713-01-01 0:00 UTC of the Julian calendar. Today is 2452739 example: daterange: 2452640-2452739
mixing special syntax expressions The link: syntax does not mix with others. Other bad ideas: –"site:openlib.org –inurl:openlib" –"site:edu site:com" Things that work well –intitle:search –Intitle:biology inurl:help
Examples George Bush site:nytimes.com "Copyright * The New York Times" "George Bush" Intitle:"directory * * trees" Botany intitle:"directory of" site:edu "powered by blogger" or site:blogspot.com "classical music" (inurl:mailman | inurl:listserv)
http://openlib.org/home/krichel Thank you for your attention!