The College of Saint Rose CIS 521 / MBA 541 – Introduction to Internet Development David Goldschmidt, Ph.D. selected material from Search Engines: Information.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Information Retrieval in Practice
Chapter 4 Processing Text. n Modifying/Converting documents to index terms n Why?  Convert the many forms of words into more consistent index terms that.
Search Engines and Information Retrieval
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Information Retrieval in Practice
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
The Internet & The World Wide Web Notes
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
How Search Engines Work General Search Strategies Dr. Dania Bilal IS 587 SIS Fall 2007.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Search Engines and Information Retrieval Chapter 1.
Chapter 4 Processing Text. n Modifying/Converting documents to index terms  Convert the many forms of words into more consistent index terms that represent.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Autumn Web Information retrieval (Web IR) Handout #1:Web characteristics Ali Mohammad Zareh Bidoki ECE Department, Yazd University
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web- and Multimedia-based Information Systems Lecture 2.
Information Retrieval
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Statistical Properties of Text
Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition.
Chapter 4 Processing Text. n Modifying/Converting documents to index terms  Convert the many forms of words into more consistent index terms that represent.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Information Retrieval in Practice
Information Retrieval in Practice
Search Engine Architecture
Information Retrieval (in Practice)
Text Based Information Retrieval
CS 430: Information Discovery
Thanks to Bill Arms, Marti Hearst
Information Retrieval
CS 430: Information Discovery
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
5.00 Apply procedures to organize content by using Dreamweaver. (22%)
Information Retrieval and Web Design
Presentation transcript:

The College of Saint Rose CIS 521 / MBA 541 – Introduction to Internet Development David Goldschmidt, Ph.D. selected material from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN

 What is search?  What are we searching for?  How many searches are processed per day?  What is the average number of words in text-based searches?

 Applications and varieties of search:  Web search  Site search  Vertical search  Enterprise search  Desktop search  Peer-to-peer search search

where do we search next? how do we acquire new documents?

how do we best convert documents to their index terms how do we make acquired documents searchable?

 Relevance  Search results contain information the searcher was looking for  Problems with vocabulary mismatch ▪ Homonyms (e.g. “Jersey shore”)  User relevance  Search results relevant to one user may be completely irrelevant to another user SNOOKI

 Precision  Proportion of retrieved documents that are relevant  How precise were the results?  Recall (and coverage)  Proportion of relevant documents that were actually retrieved  Did we retrieve all of the relevant documents?

 Timeliness and freshness  Search results contain information that is current and up-to-date  Performance  Users expect subsecond response times  Media  Users increasingly use cellphones, mobile devices

 Scalability  Designs that work, must perform equally well as the system grows and expands ▪ Increased number of documents, number of users, etc.  Flexibility (or adaptability)  Tune search engine components to keep up with changing landscape  Spam-resistance

 Gerard Salton ( )  Pioneer in information retrieval  Defined information retrieval as:  “a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information”  This was 1968 (before the Internet and Web!)

 Structured information:  Often stored in a database  Organized via predefined tables, columns, etc.  Select all accounts with balances less than $200  Unstructured information  Document text (headings, words, phrases)  Images, audio, video (often relies on textual tags) account numberbalance $ $ $ $195.75

 Search and IR has largely focused on text processing and documents  Search typically uses the statistical properties of text  Word counts  Word frequencies  But ignore linguistic features (noun, verb, etc.)

 Image search currently relies on textual tags  Therefore just another form of text-based search Edie; little girl kid’s laptop bare foot drink; sippy cup

 A URL identifies a resource on the Web, and consists of:  A scheme or protocol (e.g. http, https)  A hostname (e.g. academic2.strose.edu)  A resource (e.g. /math_and_science/goldschd)  e.g.

 When a client requests a Web page, the client uses either a GET or POST request  (followed by the protocol and a blank line)  We hopefully receive a 200 response GET / HTTP/1.0 GET /subdir/stuff.html HTTP/1.0 GET /images/icon.png HTTP/1.0 GET /docs/paper.pdf HTTP/1.0 HTTP/ OK Date: current date and time Last-Modified: last modified date and time etc.

 Web crawlers adhere to a politeness policy:  GET requests sent every few seconds or minutes  A robots.txt file specifies what crawlers are allowed to crawl:

default priority is 0.5 some URLs might not be discovered by crawler

how do we best convert documents to their index terms how do we make acquired documents searchable?

 Simplest approach is find, which requires no text transformation  Useful in user applications, but not in search (why?)  Optional transformation handled during the find operation: case sensitivity

 English documents are predictable:  Top two most frequently occurring words are “the” and “of” (10% of word occurrences)  Top six most frequently occurring words account for 20% of word occurrences  Top fifty most frequently occurring words account for 50% of word occurrences  Given all unique words in a (large) document, approximately 50% occur only once

 Zipf’s law:  Rank words in order of decreasing frequency  The rank (r) of a word times its frequency (f) is approximately equal to a constant (k) r x f = k  In other words, the frequency of the rth most common word is inversely proportional to r George Kingsley Zipf ( )

 The probability of occurrence (P r ) of a word is the word frequency divided by the total number of words in the document  Revise Zipf’s law as: r x P r = c for English, c ≈ 0.1

 Verify Zipf’s law using the AP89 dataset:  Collection of Associated Press (AP) news stories from 1989 (available at Total documents 84,678 Total word occurrences39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064 Total documents 84,678 Total word occurrences39,749,179 Vocabulary size 198,763 Words occurring > 1000 times 4,169 Words occurring once 70,064

 Top 50 words of AP89

 For each document we process, the goal is to isolate each word occurrence  This is called tokenization or lexical analysis  We might also recognize various types of content, including:  Metadata (i.e. invisible tags)  Images and video (via textual tags)  Document structure (sections, tables, etc.)

 Before we tokenize the given sequence of characters, we might normalize the text by:  Converting to lowercase  Omitting punctuation and special characters  Omitting words less than 3 characters long  Omitting HTML/XML/other tags  What do we do with numbers?

 Certain function words (e.g. “the” and “of”) are typically ignored during text processing  These are called stopwords, because processing stops when they are encountered  Alone, stopwords rarely help identify document relevance  Stopwords occur very frequently, which would bog down indexes

 Top 50 words of AP89  Mostly stopwords!

 Constructing stopword lists:  Created manually (by a human!)  Created automatically using word frequencies ▪ Mark the top n most frequently occurring words as stopwords  What about “to be or not to be?”

 Stopword lists may differ based on what part of the document we are processing  Additional stopwords for an tag: ▪ click ▪ here ▪ more ▪ information ▪ read ▪ link ▪ view ▪ document

 Stemming reduces different forms of a word down to a common stem ▪ Stemming reduces the number of unique words in each document ▪ Stemming increases the accuracy of search (by 5-10% for English)

 A stem might not be an actual valid word

 How do we implement stemming?  Use a dictionary-based approach to map words to their stems (  Use an algorithmic approach ▪ Suffix-s stemming: remove last ‘s’ if present ▪ Suffix-ing stemming: remove trailing ‘ing’ ▪ Suffix-ed stemming: remove trailing ‘ed’ ▪ Suffix-er stemming: remove trailing ‘er’ ▪ etc.

 The Porter stemmer is an algorithmic stemmer developed in the 1970/80s   Consists of a sequence of rules and steps focused on reducing or eliminating suffixes ▪ There are 5 steps, each with many “sub-steps”  Used in a variety of IR experiments  Effective at stemming TREC datasets Dr. Martin Porter

 Nothing is perfect... ▪ also see detects a relationship where one does not actually exist (same stem) does not detect a relationship where one does exist (different stem)

 An n-gram refers to any consecutive sequence of n words ▪ The more frequently an n-gram occurs, the more likely it is to correspond to a meaningful phrase in the language WorldnewsabouttheUnitedStates overlapping n-grams with n = 2 (a.k.a. bigrams) World news news about about the the United United States

 Phrases are:  More precise than single words ▪ e.g. “black sea” instead of “black” and “sea”  Less ambiguous than single words ▪ e.g. “big apple” instead of “apple”  Drawback:  Phrases and n-grams tend to make ranking more difficult

 By applying a part-of-speech (POS) tagger, high-frequency noun phrases are detected  (but too slow!)

 Word n-grams follow a Zipf distribution, much like single word frequencies

 A sampling from Google: ▪ Most common English trigram: all rights reserved ▪ see our-n-gram-are-belong-to-you.htmlhttp://googleresearch.blogspot.com/2006/08/all- our-n-gram-are-belong-to-you.html

 Computers store and retrieve information  Retrieval first requires finding information once we find the data, we often must extract what we need...

 Weblogs record each and every access to the Web server  Use the data to answer questions  Which pages are the most popular?  How much spam is the site experiencing?  Are certain days/times busier than others?  Are there any missing pages (bad links)?  Where is the traffic coming from?

 Apache software records an access_log file:  [26/Sep/2010:22:38: ] "GET /cis460/wordfreq.php HTTP/1.1" HTTP requestserver response code (for server response codes, see ) size in bytes of data returned requesting IP (or host)username/passwordaccess timestamp

 Links are useful to us humans for navigating Web sites and finding things  Links are also useful to search engines  Latest News anchor text destination link (URL)

 How does anchor text apply to ranking?  Anchor text describes the content of the destination page  Anchor text is short, descriptive, and often coincides with query text  Anchor text is typically written by a non-biased third party

 We often represent Web pages as vertices and links as edges in a webgraph

 An example:

 Links may be interpreted as describing a destination Web page in terms of its:  Popularity  Importance  We focus on incoming links (inlinks)  And use this for ranking matching documents  Drawback is obtaining incoming link data  Authority  Incoming link count

 PageRank is a link analysis algorithm  PageRank is accredited to Sergey Brin and Lawrence Page (the Google guys!)  The original PageRank paper: ▪

 Browse the Web as a random surfer:  Choose a random number r between 0 and 1  If r < λ then go to a random page  else follow a random link from the current page  Repeat!  The PageRank of page A (noted PR(A)) is the probability that this “random surfer” will be looking at that page

 Jumping to a random page avoids getting stuck in:  Pages that have no links  Pages that only have broken links  Pages that loop back to previously visited pages

 A cycle tends to negate the effectiveness of the PageRank algorithm

 A retrieval model is a formal (mathematical) representation of the process of matching a query and a document  Forms the basis of ranking results doc 234 doc 345 doc 455 doc 567 doc 678 doc 789 doc 881 doc 972 doc 123 doc 257 user query terms ? doc 913

 Goal: Retrieve exactly the documents that users want (whether they know it or not!)  A good retrieval model finds documents that are likely to be considered relevant by the user submitting the query (i.e. user relevance)  A good retrieval model also often considers topical relevance

 Given a query, topical relevance identifies documents judged to be on the same topic  Even though keyword-based document scores might show a lack of relevance! Abraham Lincoln query: Abraham Lincoln Civil War Tall Guys with Beards Stovepipe Hats U.S. Presidents

 User relevance is difficult to quantify because of each user’s subjectivity  Humans often have difficulty explaining why one document is more relevant than another  Humans may disagree about a given document’s relevance in relation to the same query R R