Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stemming, tagging and chunking Text analysis short of parsing.

Similar presentations

Presentation on theme: "Stemming, tagging and chunking Text analysis short of parsing."— Presentation transcript:

1 Stemming, tagging and chunking Text analysis short of parsing

2 Word-based analysis Whereas parsing gives a full syntactic analysis, sometimes it is sufficient to have less detailed information In many applications we are more interested in words But what do we mean by “word”?

3 Words Naïve definition of a word: sequence of characters surrounded separated from each other by a space But punctuation marks are usually attached to words Though not all punctuation marks are word-delimiters, e.g. possessive apostrophe, hyphen

4 Words We may want to treat hyphenated and compound words as one word, or two By the same token we may want to treat word sequences as if they were a single word In addition, a given “word” can have different word forms, depending on inflections, or even conventions of orthography

5 Tokenization The simplest form of analysis is to reduce different word forms into tokens Also called “normalization” For example, if you want to count how many times a given word occurs in a text Or you want to search for texts containing certain words (e.g. Google)

6 Stemming Stemming is the particular case of tokenization which reduces inflected forms to a single base form or stem (Recall our discussion of stem ~ base form ~ dictionary form ~ citation form) Stemming algorithms are basic string- handling algorithms, which depend on rules which identify affixes that can be stripped

7 Stemming As we know, morphology can be less than straightforward, so a stemmer has to “know” about rules such as consonant doubling, y→i, etc. Also has to know about irregularities And to avoid overgeneration For this it probably needs a dictionary

8 Stemming Best known stemming algorithm for English is Martin Porter’s stemmer, published in 1979 Original use was in information retrieval In computational terms, it is really just a sophisticated string-handling algorithm In linguistic terms, it is interesting in that it captures generalisations about English morphology

9 Word categories A.k.a. parts of speech (POSs) Important and useful to identify words by their POS –To distinguish homonyms –To enable more general word searches POS familiar (?) from school and/or language learning (noun, verb, adjective, etc.)

10 Word categories Recall that we distinguished –open-class categories (noun, verb, adjective, adverb) –Closed-class categories (preposition, determiner, pronoun, conjunction, …) While the big four are fairly clearcut, it is less obvious exactly what and how many closed-class categories there may be

11 POS tagging Labelling words for POS can be done by dictionary lookup and/or some sort of process Identifying POS can be seen as a prerequisite to parsing, and/or a result of morphological analysis in its own right However, there are some differences: –Parsers often work with the most simple set of word categories, subcategorized by feature (or attribute- value) schemes –Indeed the parsing procedure may contribute to the disambiguation of homonyms

12 POS tagging POS tagging, per se, aims to identify word- category information somewhat independently of sentence structure … … and typically uses rather different means POS tags are generally shown as labels on words: John/NPN saw/VB the/AT book/NCN on/PRP the/AT table/NN./PNC We’ll return to tagging in detail, but first let’s mention …

13 Chunking Like parsing except that it aims only to identify major constituents And does not attempt to identify structure, neither internal (within the chunk), nor external (between chunks) Chunking will leave some parts of the text unanalysed Example: [ NP [ NP G.K. Chesterton ], [ NP [ NP author ] of [ NP [ NP The Man ] who was [ NP Thursday ] ] ] ]

14 Chunking Chunks can be represented like tags or like parse trees

15 Chunk parser A “chunk” is a continuous non-overlapping sequence of words Chunker finds such sequences, often using tagged text as input Chunk rules can be as simple as regular expressions Chunkers can allow embedding, but typically only to a shallow level Another example: (S: (NP: I) saw (NP: the big dog). )

Download ppt "Stemming, tagging and chunking Text analysis short of parsing."

Similar presentations

Ads by Google