1 I256 Applied Natural Language Processing Fall 2009 Lecture 3 Morphology Stemming Tokenization Segmentation Barbara Rosario.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.
Advertisements

Corpus Processing and NLP
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
1 Linguistics week 11 Finish assimilation; start morphology.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Chapter 3 Morphological Structure of English Words ——0601 黎娟娟.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Brief introduction to morphology
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Towards an NLP `module’ The role of an utterance-level interface.
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
Stemming, tagging and chunking Text analysis short of parsing.
CSCI 5832 Natural Language Processing Lecture 5 Jim Martin.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Morphological analysis
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Parsing SLP Chapter 13. 7/2/2015 Speech and Language Processing - Jurafsky and Martin 2 Outline  Parsing with CFGs  Bottom-up, top-down  CKY parsing.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Chapter Four Morphology
1 CPE 641 Natural Language Processing Lecture 2: Levels of Linguistic Analysis, Tokenization & Part- of-speech Tagging Asst. Prof. Dr. Nuttanart Facundes.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Spelling Belle Vale School Improvement Liverpool 9 th May Sarah Williams.
1 Corpus-Based Work Chapter 4 Foundations of statistical natural language processing.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
Chapter 3 : Corpus-Based Work Presented By: Geoff Hulten.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Tokenization & POS-Tagging
M ORPHOLOGY Lecturer/ Najla AlQahtani. W HAT IS MORPHOLOGY ? It is the study of the basic forms in a language. A morpheme is “a minimal unit of meaning.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Statistical NLP: Lecture 6 Corpus-Based Work (Ch 4)
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Natural Language Processing Chapter 2 : Morphology.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
MORPHOLOGY definition; variability among languages.
Levels of Linguistic Analysis
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Foundations of Statistical NLP Chapter 4. Corpus-Based Work 박 태 원박 태 원.
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
 A phoneme is the vocal gestures from which words are constructed. There are 42 pure sounds singly and in combinations needed to write our 26 letter.
VOCABULARY 101 MORPHEMIC ANALYSIS an·ti·dis·es·tab·lish·men·tar·i·an·ism anti- against anti- against dis- not or opposite of dis- not or opposite of.
Introduction to Parsing
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
عمادة التعلم الإلكتروني والتعليم عن بعد
Tokenizer and Sentence Splitter CSCI-GA.2591
Natural Language Processing (NLP)
Chapter 6 Morphology.
Grammar Workshop Thursday 9th June.
CSCI 5832 Natural Language Processing
Token generation - stemming
R.Rajkumar Asst.Professor CSE
Theory of Computation Languages.
Basic Text Processing: Sentence Segmentation
Língua Inglesa - Aspectos Morfossintáticos
Levels of Linguistic Analysis
Introduction to Text Analysis
Statistical NLP: Lecture 6
Natural Language Processing (NLP)
Basic Text Processing Word tokenization.
Natural Language Processing (NLP)
Presentation transcript:

1 I256 Applied Natural Language Processing Fall 2009 Lecture 3 Morphology Stemming Tokenization Segmentation Barbara Rosario

2 Morphology Morphology is the study of the internal structure of words, of the way words are built up from smaller meaning units. Morpheme: –The smallest meaningful unit in the grammar of a language. Two classes of morphemes –Stems: “main” morpheme of the word, supplying the main meaning (i.e. establish in the example below) –Affixes: add additional meaning Prefixes: Antidisestablishmentarianism Suffixes: Antidisestablishmentarianism Infixes: hingi (borrow) – humingi (borrower) in Tagalog Circumfixes: sagen (say) – gesagt (said) in German

3 Morphology: examples Unladylike –The word unladylike consists of three morphemes and four syllables. –Morpheme breaks: un- 'not' lady '(well behaved) female adult human' -like 'having the characteristics of' –None of these morphemes can be broken up any more without losing all sense of meaning. Lady cannot be broken up into "la" and "dy," even though "la" and "dy" are separate syllables. Note that each syllable has no meaning on its own. Dogs –The word dogs consists of two morphemes and one syllable: dog, and -s, a plural marker on nouns –Note that a morpheme like "-s" can just be a single phoneme and does not have to be a whole syllable. Technique –The word technique consists of only one morpheme having two syllables. –Even though the word has two syllables, it is a single morpheme because it cannot be broken down into smaller meaningful parts. Adapted from

4 Types of morphological processes Inflection: –Systematic modification of a root form by means of prefixes and suffixes to indicate grammatical distinctions like singular and plural. –Stems: also called lemma, base form, root, lexeme –Doesn’t change the word class –New grammatical role –Usually produces a predictable, non idiosyncratic change of meaning. run  runs | running | ran hope+ing  hopinghop  hopping

5 Types of morphological processes Derivation: –Ex: compute  computer  computerization –Less systematic that inflection –It can involve a change of meaning Wide  Widely Suffix en transforms adjective into verbs –Weak  weaken, soft  soften Suffix able transforms verbs into adjective –Understand  Understandable Suffix er transforms verbs into nouns (nominalization) –teach  teacher –Difficult cases: building  from which sense of “build”?

6 Types of morphological processes Compounding: –Merging of two or more words into a new word Downmarket, (to) overtake

7 Stemming The removal of the inflectional ending from words (strip off any affixes) Laughing, laugh, laughs, laughed  laugh –Problems Can conflate semantically different words –Gallery and gall may both be stemmed to gall –A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization.

8 Regular Expressions for Detecting Word Patterns Many linguistic processing tasks involve pattern matching. Regular expressions (RE) give us a powerful and flexible method for describing the character patterns we are interested in. To use regular expressions in Python we need to import the re library using: import re $ is a meta-Characters re.search(p,s) is a function to check whether the pattern p can be found somewhere inside the string s.

9 Regular Expressions Basic Regular Expression Meta-Characters

10 Regular Expressions for Stemming Note: the star operator is "greedy" and the.* part of the expression tries to consume as much of the input as possible. If we use the "non-greedy" version of the star operator, written *?, we get what we want:

11 Regular Expressions for Stemming Let’s define a function to perform stemming, and apply it to a whole text The RE removed the s from ponds but also from is and basis. It produced some non-words like distribut and deriv, but these are acceptable stems in some applications.

12 NLTK Stemmers NLTK includes several off-the-shelf stemmers. The Porter and Lancaster stemmers follow their own rules for stripping affixes. Stemming is not a well-defined process, and we typically pick the stemmer that best suits the application we have in mind.

13 Porter Stemmer Lexicon free stemmer Rewrite rules ATIONAL  ATE (e.g. relational, relate) FUL  ε (e.g. hopeful, hope) SSES  SS (e.g. caresses, caress) Errors of Commission Organization  organ Policy  police Errors of Omission Urgency (not stemmed to urgent) European (not stemmed to Europe)

14 NLTK Stemmers nltk.wordnet.morphy –A slightly more sophisticated approach –Use an understanding of inflectional morphology Use an Exception List for irregulars Handle collocations in a special way –Do the transformation, compare the result to the WordNet dictionary –If the transformation produces a real word, then keep it, else use the original word. –For more details, see N.htmlhttp://wordnet.princeton.edu/man/morphy.7W N.html

15 Is stemming useful? For IR performance, some improvement (especially for smaller documents) May help a lot for some queries, but on average (across all queries) it doesn’t help much (i.e. for some queries the results are worse) –Word sense disambiguation on query terms: business may be stemmed to busy, saw (the tool) to see –A truncated stem can be intelligible to users –Most studies for stemming for IR done for English (may help more for other languages) –The possibility of letting people interactively influence the stemming has not been studied much Since improvement is small, often IR engine usually don’t use stemming More on this when we’ll talk about IR

16 Text Normalization Stemming Convert to lower case Identifying non-standard words including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary. –For example, every decimal number could be mapped to a single token 0.0, and every acronym could be mapped to AAA. This keeps the vocabulary small and improves the accuracy of many language modeling tasks. Lemmatization –Make sure that the resulting form is a known word in a dictionary –WordNet lemmatizer only removes affixes if the resulting word is in its dictionary

17 Lemmatization WordNet lemmatizer only removes affixes if the resulting word is in its dictionary The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid lemmas Notice that if doesn't handle lying, but it converts women to woman.

18 Tokenization Divide text into units called tokens (words, numbers, punctuations) (page 124—136 Manning) What is a word? –Graphic word: string of continuous alpha numeric character surrounded by white space $22.50 –Main clue (in English) is the occurrence of whitespaces –Problems Periods: usually remove punctuation but sometimes it’s useful to keep periods (Wash.  wash) Single apostrophes, contractions (isn’t, didn’t, dog’s: for meaning extraction could be useful to have 2 separate forms: is + n’t or not) Hyphenation: –Sometime best a single word: co-operate –Sometime best as 2 separate words: 26-year-old, aluminum-export ban

19 Tokenization Whitespace often do not indicate a word break: sometime we may want to lump together words that are separated by a white space (whitespace?) but that we want to regard as a single word –San Francisco –The New York-New Heaven railroad –Wake up, work out I couldn’t work the answer out

20 RE for Tokenizing Text The very simplest method for tokenizing text is to split on whitespace.

21 NLTK's Regular Expression Tokenizer

22 Tokenization Tokenization turns out to be a far more difficult task than you might have expected. No single solution works well across-the-board, and we must decide what counts as a token depending on the application domain. When developing a tokenizer it helps to have access to raw text which has been manually tokenized, in order to compare the output of your tokenizer with high-quality (or "gold-standard") tokens. The NLTK corpus collection includes a sample of Penn Treebank data, including the raw Wall Street Journal text ( nltk.corpus.treebank_raw.raw()) and the tokenized version, nltk.corpus.treebank.words()

23 Segmentation Word segmentation –For languages that do not put spaces between words Chinese, Japanese, Korean, Thai, German (for compound nouns) Sentence segmentation –Divide text into sentences –Why?

24 Sentence Segmentation Sentence: –Something ending with a.. ?, ! (and sometime also :) –“You reminded me,” she remarked, “of your mother.” Nested sentences Note the.” Sentence boundary detection algorithms –Heuristic (see figure 4.1 page 135 Manning) –Statistical classification trees (Riley 1989) Probability of a word to occur before or after a boundary, case and length of words –Neural network (Palmer and Hearst 1997) Part of speech distribution of preceding and following words –Maximum Entropy (Mikheev 1998) For reference see Manning

25 Sentence Segmentation Sentence: –Something ending with a.. ?, ! (and sometime also :) –“You reminded me,” she remarked, “of your mother.” Nested sentences Note the.” Sentence boundary detection algorithms –Heuristic (see figure 4.1 page 135 Manning) –Statistical classification trees (Riley 1989) Probability of a word to occur before or after a boundary, case and length of words –Neural network (Palmer and Hearst 1997) Part of speech distribution of preceding and following words –Maximum Entropy (Mikheev 1998) Note: MODELS and Features

26 Some corpora already provide access at the sentence level. –In the following example, we compute the average number of words per sentence in the Brown Corpus: – Sentence Segmentation NLTK tools

27 Sentence Segmentation NLTK tools In other cases, the text is only available as a stream of characters. Before tokenizing the text into words, we need to segment it into sentences. NLTK facilitates this by including the Punkt sentence segmenter (Kiss & Strunk, 2006)* (unsupervised language-independent, unsupervised approach to sentence boundary detection.)(Kiss & Strunk, 2006) –It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using three criteria that only require information about the candidate type itself and are independent of context: Abbreviations can be defined as a very tight collocation consisting of a truncated word and a final period abbreviations are usually short, abbreviations sometimes contain internal periods. –Example –CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co. * Tibor Kiss and Jan Strunk. Unsupervised multilingual sentence boundary detection. In Computational Linguistics, volume 32, pages 485–525, 2006.

28 Segmentation NLTK tools: Punkt sentence segmenter * Tibor Kiss and Jan Strunk. Unsupervised multilingual sentence boundary detection. In Computational Linguistics, volume 32, pages 485–525, 2006.

29 Segmentation as classification Sentence segmentation can be viewed as a classification task for punctuation: –Whenever we encounter a symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it terminates the preceding sentence. –We’ll return on this when we cover classification See Section 6.2 NLTK bookSection 6.2 For word segmentation see section 3.8 NLTK book –Also page 180 of Speech and Language Processing Jurafsky and MartinSpeech and Language Processing

30 Next class Text corpora & corpus-based work Assignment 1 is due Readings: –Chapter 3 of Foundations of Statistical NLP –Chapter 2 of NLP-NLTK bookChapter 2 of NLP-NLTK book