The College of Saint Rose CIS 521 / MBA 541 – Introduction to Internet Development David Goldschmidt, Ph.D. selected material from Search Engines: Information.

The College of Saint Rose CIS 521 / MBA 541 – Introduction to Internet Development David Goldschmidt, Ph.D. selected material from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

 What is search?  What are we searching for?  How many searches are processed per day?  What is the average number of words in text-based searches?

 Applications and varieties of search:  Web search  Site search  Vertical search  Enterprise search  Desktop search  Peer-to-peer search search

where do we search next? how do we acquire new documents?

how do we best convert documents to their index terms how do we make acquired documents searchable?

 Relevance  Search results contain information the searcher was looking for  Problems with vocabulary mismatch ▪ Homonyms (e.g. “Jersey shore”)  User relevance  Search results relevant to one user may be completely irrelevant to another user SNOOKI

 Precision  Proportion of retrieved documents that are relevant  How precise were the results?  Recall (and coverage)  Proportion of relevant documents that were actually retrieved  Did we retrieve all of the relevant documents? http://trec.nist.gov

 Timeliness and freshness  Search results contain information that is current and up-to-date  Performance  Users expect subsecond response times  Media  Users increasingly use cellphones, mobile devices

 Scalability  Designs that work, must perform equally well as the system grows and expands ▪ Increased number of documents, number of users, etc.  Flexibility (or adaptability)  Tune search engine components to keep up with changing landscape  Spam-resistance

 Gerard Salton (1927-1995)  Pioneer in information retrieval  Defined information retrieval as:  “a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information”  This was 1968 (before the Internet and Web!)

 Structured information:  Often stored in a database  Organized via predefined tables, columns, etc.  Select all accounts with balances less than $200  Unstructured information  Document text (headings, words, phrases)  Images, audio, video (often relies on textual tags) account numberbalance 7004533711$498.19 7004533712$781.05 7004533713$147.15 7004533714$195.75

 Search and IR has largely focused on text processing and documents  Search typically uses the statistical properties of text  Word counts  Word frequencies  But ignore linguistic features (noun, verb, etc.)

 Image search currently relies on textual tags  Therefore just another form of text-based search Edie; little girl kid’s laptop bare foot drink; sippy cup

 A URL identifies a resource on the Web, and consists of:  A scheme or protocol (e.g. http, https)  A hostname (e.g. academic2.strose.edu)  A resource (e.g. /math_and_science/goldschd)  e.g. http://cs.strose.edu/courses-ug.html

 When a client requests a Web page, the client uses either a GET or POST request  (followed by the protocol and a blank line)  We hopefully receive a 200 response GET / HTTP/1.0 GET /subdir/stuff.html HTTP/1.0 GET /images/icon.png HTTP/1.0 GET /docs/paper.pdf HTTP/1.0 HTTP/1.1 200 OK Date: current date and time Last-Modified: last modified date and time etc.

 Web crawlers adhere to a politeness policy:  GET requests sent every few seconds or minutes  A robots.txt file specifies what crawlers are allowed to crawl:

default priority is 0.5 some URLs might not be discovered by crawler

how do we best convert documents to their index terms how do we make acquired documents searchable?

 Simplest approach is find, which requires no text transformation  Useful in user applications, but not in search (why?)  Optional transformation handled during the find operation: case sensitivity

 English documents are predictable:  Top two most frequently occurring words are “the” and “of” (10% of word occurrences)  Top six most frequently occurring words account for 20% of word occurrences  Top fifty most frequently occurring words account for 50% of word occurrences  Given all unique words in a (large) document, approximately 50% occur only once

 Zipf’s law:  Rank words in order of decreasing frequency  The rank (r) of a word times its frequency (f) is approximately equal to a constant (k) r x f = k  In other words, the frequency of the rth most common word is inversely proportional to r George Kingsley Zipf (1902-1950)

 The probability of occurrence (P r ) of a word is the word frequency divided by the total number of words in the document  Revise Zipf’s law as: r x P r = c for English, c ≈ 0.1

 Verify Zipf’s law using the AP89 dataset:  Collection of Associated Press (AP) news stories from 1989 (available at http://trec.nist.gov):http://trec.nist.gov Total documents 84,678 Total word occurrences39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064 Total documents 84,678 Total word occurrences39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064

 Top 50 words of AP89

 For each document we process, the goal is to isolate each word occurrence  This is called tokenization or lexical analysis  We might also recognize various types of content, including:  Metadata (i.e. invisible tags)  Images and video (via textual tags)  Document structure (sections, tables, etc.)

 Before we tokenize the given sequence of characters, we might normalize the text by:  Converting to lowercase  Omitting punctuation and special characters  Omitting words less than 3 characters long  Omitting HTML/XML/other tags  What do we do with numbers?

 Certain function words (e.g. “the” and “of”) are typically ignored during text processing  These are called stopwords, because processing stops when they are encountered  Alone, stopwords rarely help identify document relevance  Stopwords occur very frequently, which would bog down indexes

 Top 50 words of AP89  Mostly stopwords!

 Constructing stopword lists:  Created manually (by a human!)  Created automatically using word frequencies ▪ Mark the top n most frequently occurring words as stopwords  What about “to be or not to be?”

 Stopword lists may differ based on what part of the document we are processing  Additional stopwords for an tag: ▪ click ▪ here ▪ more ▪ information ▪ read ▪ link ▪ view ▪ document

 Stemming reduces different forms of a word down to a common stem ▪ Stemming reduces the number of unique words in each document ▪ Stemming increases the accuracy of search (by 5-10% for English)

 A stem might not be an actual valid word

 How do we implement stemming?  Use a dictionary-based approach to map words to their stems (http://wordnet.princeton.edu/)http://wordnet.princeton.edu/  Use an algorithmic approach ▪ Suffix-s stemming: remove last ‘s’ if present ▪ Suffix-ing stemming: remove trailing ‘ing’ ▪ Suffix-ed stemming: remove trailing ‘ed’ ▪ Suffix-er stemming: remove trailing ‘er’ ▪ etc.

 The Porter stemmer is an algorithmic stemmer developed in the 1970/80s  http://tartarus.org/~martin/PorterStemmer/ http://tartarus.org/~martin/PorterStemmer/  Consists of a sequence of rules and steps focused on reducing or eliminating suffixes ▪ There are 5 steps, each with many “sub-steps”  Used in a variety of IR experiments  Effective at stemming TREC datasets Dr. Martin Porter

 Nothing is perfect... ▪ also see http://snowball.tartarus.orghttp://snowball.tartarus.org detects a relationship where one does not actually exist (same stem) does not detect a relationship where one does exist (different stem)

 An n-gram refers to any consecutive sequence of n words ▪ The more frequently an n-gram occurs, the more likely it is to correspond to a meaningful phrase in the language WorldnewsabouttheUnitedStates overlapping n-grams with n = 2 (a.k.a. bigrams) World news news about about the the United United States

 Phrases are:  More precise than single words ▪ e.g. “black sea” instead of “black” and “sea”  Less ambiguous than single words ▪ e.g. “big apple” instead of “apple”  Drawback:  Phrases and n-grams tend to make ranking more difficult

 By applying a part-of-speech (POS) tagger, high-frequency noun phrases are detected  (but too slow!)

 Word n-grams follow a Zipf distribution, much like single word frequencies

 A sampling from Google: ▪ Most common English trigram: all rights reserved ▪ see http://googleresearch.blogspot.com/2006/08/all- our-n-gram-are-belong-to-you.htmlhttp://googleresearch.blogspot.com/2006/08/all- our-n-gram-are-belong-to-you.html

 Computers store and retrieve information  Retrieval first requires finding information once we find the data, we often must extract what we need...

 Weblogs record each and every access to the Web server  Use the data to answer questions  Which pages are the most popular?  How much spam is the site experiencing?  Are certain days/times busier than others?  Are there any missing pages (bad links)?  Where is the traffic coming from?

 Apache software records an access_log file:  75.194.143.61 - - [26/Sep/2010:22:38:12 -0400] "GET /cis460/wordfreq.php HTTP/1.1" 200 566 HTTP requestserver response code (for server response codes, see http://en.wikipedia.org/wiki/List_of_HTTP_status_codes ) http://en.wikipedia.org/wiki/List_of_HTTP_status_codes size in bytes of data returned requesting IP (or host)username/passwordaccess timestamp

 Links are useful to us humans for navigating Web sites and finding things  Links are also useful to search engines  Latest News anchor text destination link (URL)

 How does anchor text apply to ranking?  Anchor text describes the content of the destination page  Anchor text is short, descriptive, and often coincides with query text  Anchor text is typically written by a non-biased third party

 We often represent Web pages as vertices and links as edges in a webgraph http://www.openarchives.org/ore/0.1/datamodel-images/WebGraphBase.jpg

http://www.growyourwritingbusiness.com/images/web_graph_flower.jpg  An example:

 Links may be interpreted as describing a destination Web page in terms of its:  Popularity  Importance  We focus on incoming links (inlinks)  And use this for ranking matching documents  Drawback is obtaining incoming link data  Authority  Incoming link count

 PageRank is a link analysis algorithm  PageRank is accredited to Sergey Brin and Lawrence Page (the Google guys!)  The original PageRank paper: ▪ http://infolab.stanford.edu/~backrub/google.html http://infolab.stanford.edu/~backrub/google.html

 Browse the Web as a random surfer:  Choose a random number r between 0 and 1  If r < λ then go to a random page  else follow a random link from the current page  Repeat!  The PageRank of page A (noted PR(A)) is the probability that this “random surfer” will be looking at that page

 Jumping to a random page avoids getting stuck in:  Pages that have no links  Pages that only have broken links  Pages that loop back to previously visited pages

 A cycle tends to negate the effectiveness of the PageRank algorithm

 A retrieval model is a formal (mathematical) representation of the process of matching a query and a document  Forms the basis of ranking results doc 234 doc 345 doc 455 doc 567 doc 678 doc 789 doc 881 doc 972 doc 123 doc 257 user query terms ? doc 913

 Goal: Retrieve exactly the documents that users want (whether they know it or not!)  A good retrieval model finds documents that are likely to be considered relevant by the user submitting the query (i.e. user relevance)  A good retrieval model also often considers topical relevance

 Given a query, topical relevance identifies documents judged to be on the same topic  Even though keyword-based document scores might show a lack of relevance! Abraham Lincoln query: Abraham Lincoln Civil War Tall Guys with Beards Stovepipe Hats U.S. Presidents

 User relevance is difficult to quantify because of each user’s subjectivity  Humans often have difficulty explaining why one document is more relevant than another  Humans may disagree about a given document’s relevance in relation to the same query R R

The College of Saint Rose CIS 521 / MBA 541 – Introduction to Internet Development David Goldschmidt, Ph.D. selected material from Search Engines: Information.

Similar presentations

Presentation on theme: "The College of Saint Rose CIS 521 / MBA 541 – Introduction to Internet Development David Goldschmidt, Ph.D. selected material from Search Engines: Information."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The College of Saint Rose CIS 521 / MBA 541 – Introduction to Internet Development David Goldschmidt, Ph.D. selected material from Search Engines: Information.

Similar presentations

Presentation on theme: "The College of Saint Rose CIS 521 / MBA 541 – Introduction to Internet Development David Goldschmidt, Ph.D. selected material from Search Engines: Information."— Presentation transcript:

Similar presentations

About project

Feedback