March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.

March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation

March 2006Introduction to Computational Linguistics 2 Information Food Chain Inference ↑Knowledge Representation ↑Meaning Extraction ↑Semantic Relationships ↑Chunking (noun phrases; verb phrases) ↑Part of Speech Annotation ↑Paragraph and sentence identification ↑Tokenisation ↑Raw Text

March 2006Introduction to Computational Linguistics 3 Start with a Corpus A corpus is an organised body of materials from language that is used as a basis for empirical studies. Corpora classfied according to –Representativeness –Medium –Language –Information Content –Structure

March 2006Introduction to Computational Linguistics 4 Examples of Corpora Project Gutenberg: public domain text resources. http://www.promo.net/pghttp://www.promo.net/pg Brown Corpus: a tagged corpus of about 1M words put together at Brown 1960-70 Penn Treebank: a corpus of parsed sentences based on text from the WSJ Canadian Hansards: bilingual (En Fr) corpus the Canadian parliament.

March 2006Introduction to Computational Linguistics 5 Low Level Issues Preprocessing: getting rid of junk such as whitespace, images, certain formatting information etc. Normalisation: deciding on standard character representations; adopting upper or lower case (or both) Tokenisation

March 2006Introduction to Computational Linguistics 6 Tokenisation Tokenisation is a process which divides input text into individual units called tokens. Tokens are normally taken to be indivisible by the next level of analysis, but they can be associated with various kinds of information. An example of such information is the type of the token: word, punctuation, number

March 2006Introduction to Computational Linguistics 7 What counts as a word? Words are quite tricky to define The standard definition: a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis 1967) It is easy to find exceptions.

March 2006Introduction to Computational Linguistics 8 Problems Identifying Words VfB Stuttgart scored twice in quick success -ion early in the second half on their way to a deserved 2-1 victory over Manchester United in the Champions League on Wednesday. (example from Mary Dalrymple, University of London) VfB Stuttgart, Manchester United succession 2-1 Wednesday

March 2006Introduction to Computational Linguistics 9 Problems Identifying Words Problems Involving Spaces Lack of spaces between words Lebensversicherungsgesellschaftsanngesteller (life insurance company employee) Ix-Xemx The presence of spaces may not indicate a word break Coca Cola; +356 21 456 457

March 2006Introduction to Computational Linguistics 10 Problems Involving Special Characters Words often include non-alphanumeric characters which are actually part of the word. $22.50; www.di-ve.com.mt; BSc. IT :-) Words are often terminated by punctuation which is not part of the word. Sometimes, terminating punctuation is part of the word.

March 2006Introduction to Computational Linguistics 11 Periods In general, punctuation marks attach to words, and can be removed. However there are special cases: Most periods mark end of sentence Others mark abbreviations, e.g. "e.g.". "Wash." Note that when an abbreviation occurs at the end of a sentence there is only one period.

March 2006Introduction to Computational Linguistics 12 Apostrophe English contractions such as won't or I'll count as one word according to the classic definition However there are reasons for wanting two separate tokens – such as interaction with grammar rules (S → NP VP) Penn Treebank splits such contractions into two words.

March 2006Introduction to Computational Linguistics 13 Apostrophe This sometimes leaves odd words For example isn’t yields is + n't 's is ambiguous –Abbreviation for is (he's strange) –Possessive (John's car) Word-final aprostrophe is ambiguous –end of quotation –possessive of word ending in s

March 2006Introduction to Computational Linguistics 14 Exercise How is the apostrophe used in Maltese How should a Maltese tokeniser deal with it?

March 2006Introduction to Computational Linguistics 15 Hyphen Issue: do sequences of words joined by hyphens count as one word or more? Typesetting hyphens (at end of line) and hyphens in measure phrases (35-year-old) are usually removed. Typesetting hyphens can be ambiguous Lexical hyphens are usually kept hi-fi Hyphens – standing alone – are used as punctuation. Texts are often inconsistent in usage of hyphens

March 2006Introduction to Computational Linguistics 16 Case Types vs. Tokens –How many tokens in the following sentence: The cat chased the rat on the table –How many types? Tokenisation should correctly identify word types, i.e. –Tokens of the same type should be identified –Tokens of different type should be distinguished Case representation of ordinary words must be standardised.

March 2006Introduction to Computational Linguistics 17 Case Heuristics –Map first character of a sentence to standard case –Map all words in titles to lowercase Problems –Identification of sentence boundaries –Identification of proper names

March 2006Introduction to Computational Linguistics 18 Normalisation Character representations. Converting all letters to lower or upper case Removing punctuation Removing letters with accent marks and other diacritics Expanding abbreviations

March 2006Introduction to Computational Linguistics 19 Further Normalisation Stemming: are eats and eating different words? They are two different wordforms that have the same stem, eat, but different suffixes, -s and -ing Stemming versus full morphological analysis.

March 2006Introduction to Computational Linguistics 20 Summary The tokenisation problem interacts with design decisions at different levels concerning –Handling of non alphanumeric characters –Case –Punctuation Typically many of these problems are dealt with by hand crafting special rules which match a particular case. Such rules are often built out of regular expressions.

March 2006Introduction to Computational Linguistics 21 Sources Foundations of Statistical Language Processing, Manning and Schütze, MIT 1999

March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.

Similar presentations

Presentation on theme: "March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.

Similar presentations

Presentation on theme: "March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation."— Presentation transcript:

Similar presentations

About project

Feedback