The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.

The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

 For each document we process, the goal is to isolate each word occurrence  This is called tokenization or lexical analysis  We might also recognize various types of content, including:  Metadata (i.e. invisible tags)  Images and video (via textual tags)  Document structure (sections, tables, etc.)

 Before we tokenize the given sequence of characters, we might normalize the text by:  Converting to lowercase  Omitting punctuation and special characters  Omitting words less than 3 characters long  Omitting HTML/XML/other tags  What do we do with numbers?

 Certain function words (e.g. “the” and “of”) are typically ignored during text processing  These are called stopwords, because processing stops when they are encountered  Alone, stopwords rarely help identify document relevance  Stopwords occur very frequently, which would bog down indexes

 Top 50 words of AP89  Mostly stopwords!

 Constructing stopword lists:  Created manually (by a human!)  Created automatically using word frequencies ▪ Mark the top n most frequently occurring words as stopwords  What about “to be or not to be?”

 Stopword lists may differ based on what part of the document we are processing  Additional stopwords for an tag: ▪ click ▪ here ▪ more ▪ information ▪ read ▪ link ▪ view ▪ document

 Stemming reduces different forms of a word down to a common stem ▪ Stemming reduces the number of unique words in each document ▪ Stemming increases the accuracy of search (by 5-10% for English)

 A stem might not be an actual valid word

 How do we implement stemming?  Use a dictionary-based approach to map words to their stems (http://wordnet.princeton.edu/)http://wordnet.princeton.edu/  Use an algorithmic approach ▪ Suffix-s stemming: remove last ‘s’ if present ▪ Suffix-ing stemming: remove trailing ‘ing’ ▪ Suffix-ed stemming: remove trailing ‘ed’ ▪ Suffix-er stemming: remove trailing ‘er’ ▪ etc.

 The Porter stemmer is an algorithmic stemmer developed in the 1970/80s  http://tartarus.org/~martin/PorterStemmer/ http://tartarus.org/~martin/PorterStemmer/  Consists of a sequence of rules and steps focused on reducing or eliminating suffixes ▪ There are 5 steps, each with many “sub-steps”  Used in a variety of IR experiments  Effective at stemming TREC datasets Dr. Martin Porter

 Nothing is perfect... ▪ also see http://snowball.tartarus.orghttp://snowball.tartarus.org detects a relationship where one does not actually exist (same stem) does not detect a relationship where one does exist (different stem)

 An n-gram refers to any consecutive sequence of n words ▪ The more frequently an n-gram occurs, the more likely it is to correspond to a meaningful phrase in the language WorldnewsabouttheUnitedStates overlapping n-grams with n = 2 (a.k.a. bigrams) World news news about about the the United United States

 Phrases are:  More precise than single words ▪ e.g. “black sea” instead of “black” and “sea”  Less ambiguous than single words ▪ e.g. “big apple” instead of “apple”  Drawback:  Phrases and n-grams tend to make ranking more difficult

 By applying a part-of-speech (POS) tagger, high-frequency noun phrases are detected  (but too slow!)

 Word n-grams follow a Zipf distribution, much like single word frequencies

 A sampling from Google: ▪ Most common English trigram: all rights reserved ▪ see http://googleresearch.blogspot.com/2006/08/all- our-n-gram-are-belong-to-you.htmlhttp://googleresearch.blogspot.com/2006/08/all- our-n-gram-are-belong-to-you.html

 Read and study Chapter 4

The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.

Similar presentations

Presentation on theme: "The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.

Similar presentations

Presentation on theme: "The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st."— Presentation transcript:

Similar presentations

About project

Feedback