Download presentation
Presentation is loading. Please wait.
1
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
2
For each document we process, the goal is to isolate each word occurrence This is called tokenization or lexical analysis We might also recognize various types of content, including: Metadata (i.e. invisible tags) Images and video (via textual tags) Document structure (sections, tables, etc.)
3
Before we tokenize the given sequence of characters, we might normalize the text by: Converting to lowercase Omitting punctuation and special characters Omitting words less than 3 characters long Omitting HTML/XML/other tags What do we do with numbers?
4
Certain function words (e.g. “the” and “of”) are typically ignored during text processing These are called stopwords, because processing stops when they are encountered Alone, stopwords rarely help identify document relevance Stopwords occur very frequently, which would bog down indexes
5
Top 50 words of AP89 Mostly stopwords!
6
Constructing stopword lists: Created manually (by a human!) Created automatically using word frequencies ▪ Mark the top n most frequently occurring words as stopwords What about “to be or not to be?”
7
Stopword lists may differ based on what part of the document we are processing Additional stopwords for an tag: ▪ click ▪ here ▪ more ▪ information ▪ read ▪ link ▪ view ▪ document
8
Stemming reduces different forms of a word down to a common stem ▪ Stemming reduces the number of unique words in each document ▪ Stemming increases the accuracy of search (by 5-10% for English)
9
A stem might not be an actual valid word
10
How do we implement stemming? Use a dictionary-based approach to map words to their stems (http://wordnet.princeton.edu/)http://wordnet.princeton.edu/ Use an algorithmic approach ▪ Suffix-s stemming: remove last ‘s’ if present ▪ Suffix-ing stemming: remove trailing ‘ing’ ▪ Suffix-ed stemming: remove trailing ‘ed’ ▪ Suffix-er stemming: remove trailing ‘er’ ▪ etc.
11
The Porter stemmer is an algorithmic stemmer developed in the 1970/80s http://tartarus.org/~martin/PorterStemmer/ http://tartarus.org/~martin/PorterStemmer/ Consists of a sequence of rules and steps focused on reducing or eliminating suffixes ▪ There are 5 steps, each with many “sub-steps” Used in a variety of IR experiments Effective at stemming TREC datasets Dr. Martin Porter
13
Nothing is perfect... ▪ also see http://snowball.tartarus.orghttp://snowball.tartarus.org detects a relationship where one does not actually exist (same stem) does not detect a relationship where one does exist (different stem)
14
An n-gram refers to any consecutive sequence of n words ▪ The more frequently an n-gram occurs, the more likely it is to correspond to a meaningful phrase in the language WorldnewsabouttheUnitedStates overlapping n-grams with n = 2 (a.k.a. bigrams) World news news about about the the United United States
15
Phrases are: More precise than single words ▪ e.g. “black sea” instead of “black” and “sea” Less ambiguous than single words ▪ e.g. “big apple” instead of “apple” Drawback: Phrases and n-grams tend to make ranking more difficult
16
By applying a part-of-speech (POS) tagger, high-frequency noun phrases are detected (but too slow!)
17
Word n-grams follow a Zipf distribution, much like single word frequencies
18
A sampling from Google: ▪ Most common English trigram: all rights reserved ▪ see http://googleresearch.blogspot.com/2006/08/all- our-n-gram-are-belong-to-you.htmlhttp://googleresearch.blogspot.com/2006/08/all- our-n-gram-are-belong-to-you.html
19
Read and study Chapter 4
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.