Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.
Inverted Index Construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted.
Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
The term vocabulary and postings lists
CS 430 / INFO 430 Information Retrieval
CS 430 / INFO 430 Information Retrieval
Information Retrieval in Practice
CpSc 881: Information Retrieval
Learning Bit by Bit Class 3 – Stemming and Tokenization.
20/07/2000, Page 1 HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus Processing & Feature Vector Extraction A. Xafopoulos,
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chapter 1 Many slides are revisited from Stanford’s lectures by P.R.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Term weighting and vector representation of text Lecture 3.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Text Analysis Everything Data CompSci Spring 2014.
In general, whitespace is insignificant.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan The term vocabulary,
Information Retrieval Lecture 2: The term vocabulary and postings lists.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Chapter 6: Information Retrieval and Web Search
1 Documents  Last lecture: Simple Boolean retrieval system  Our assumptions were:  We know what a document is.  We can “machine-read” each document.
Introduction to Digital Libraries Information Retrieval.
Web- and Multimedia-based Information Systems Lecture 2.
Dan Jurafsky Text Normalization Every NLP task needs to do text normalization: 1.Segmenting/tokenizing words in running text 2.Normalizing word formats.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval Term vocabulary and postings lists – preprocessing steps 1.
Statistical Properties of Text
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Basic Text Processing Word tokenization.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Search Engine Architecture
Ch 2 Term Vocabulary & Postings List
Lecture 1: Introduction and the Boolean Model Information Retrieval
7CCSMWAL Algorithmic Issues in the WWW
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 2: The term vocabulary and postings lists
CS 430: Information Discovery
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Document ingestion.
CS276: Information Retrieval and Web Search
信息检索与搜索引擎 Introduction to Information Retrieval GESC1007
PageRank GROUP 4.
Lecture 2: The term vocabulary and postings lists
Lecture 2: The term vocabulary and postings lists
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Text Processing Word tokenization.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2

Overview Getting started: – tokenization, stemming, compounds – end of sentence Collection vocabulary – Terms, tokens, types – Vocabulary size – Term distribution Stop words Vector representation of text and term weighting

Tokenization Friends, Romans, Countrymen, lend me your ears; Friends | Romans | Countrymen | lend | me your | ears Token an instance of a sequence of characters that are grouped together as a useful semantic unit for processing Type the class of all tokens containing the same character sequence Term type that is included in the system dictionary (normalized)

The cat slept peacefully in the living room. It’s a very old cat.

Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. How to handle special cases involving apostrophes, hyphens etc? C++, C#, URLs, s, phone numbers, dates San Francisco, Los Angeles

Issues of tokenization are language specific – Requires the language to be known Language identification based on classifiers that use short character subsequences as features is highly effective – Most languages have distinctive signature patterns

Very important for information retrieval Splitting tokens on spaces can cause bad retrieval results – Search for York University, returns pages containing new york university German: compound nouns – Retrieval systems for German greatly benefit fron the use of compound-splitter module – Checks if a word can be subdivided into words that appear in the vocabulary East Asian Languages (Chinese, Japanese, Korean, Thai) – Text is written without any spaces between words

Stop words Very common words that have no discriminatory power

Building a stop word list collection frequency Sort terms by collection frequency and take the most frequent – In a collection about insurance practices, “insurance” would be a stop word Why do we need stop lists – Smaller indices for information retrieval – Better approximation of importance for summarization etc Use problematic in phrasal searches

Trend in IR systems over time – Large stop lists ( terms) – Very small stop lists (7-12 terms) – No stop list whatsoever – The 30 most common words account for 30% of the tokens in written text Good compression techniques for indices Term weighting leads to very common words having little impact for document represenation

Normalization Token normalization – Canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens – U.S.A vs USA – Anti-discriminatory vs antidiscriminatory – Car vs automobile?

Normalization sensitive to query Query term Terms that should match Windows windows Windows, windows, window Window window, windows

Capitalization/case folding Good for – Allow instances of Automobile at the beginning of a sentence to match with a query of automobile – Helps a search engine when most users type ferrari when they are interested in a Ferrari car Bad for – Proper names vs common nouns – General Motors, Associated Press, Black Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning In IR, lowercasing is most practical because of the way users issue their queries

Other languages 60% of webpages are in english – Less than one third of Internet users speak English – Less than 10% of the world’s population primarily speak English Only about one third of blog posts are in English

Stemming and lemmatization Organize, organizes, organizing Democracy, democratic, democratization Am, are, is  be Car, cars, car’s, cars’ ==? car

Stemming – Crude heuristic process that chops off the ends of the words Democratic  democa Lemmatization – Use of vocabulary and morphological analysis, returns the base form of a word (lemma) Democratic  democracy Sang  sing

Porter stemmer Most common algorithm for stemming English – 5 phases of word reduction – SSES  SS caresses  caress – IES  I ponies  poni – SS  SS – S  cats  cat – EMENT  replacement  replac cement  cement

Vocabulary size Dictionaries – 600,000+ words But they do not include names of people, locations, products etc

Heap’s law: estimating the number of terms M vocabulary size (number of terms) T number of tokens 30 < k < 100 b = 0.5 Linear relation between vocabulary size and number of tokens in log-log space

Zipf’s law: modeling the distribution of terms The collection frequency of the i th most common term is proportional to 1/i If the most frequent term occurs cf 1 then the second most frequent term has half as many occurrences, the third most frequent term has a third as many, etc

Problems with the normalization A change in the stop word list can dramatically alter term weightings A document may contain an outlier term