Tokenization & POS-Tagging

Name: Tokenization & POS-Tagging
Uploaded: 2017-08-12T06:35:26+00:00
Duration: PTM7S44
Channel: Emerald Lindsey
Description: Tokenization & POS-Tagging

Tokenization & POS-Tagging
presented by: Yajing Zhang Saarland University

Outline Tokenization POS tagging Importance Problems & solutions
HMM tagger TnT statistical tagger winter semester 05/06

Why Tokenization? Tokenization: the isolation of word-like units from a text. Building blocks of other text processing. The accuracy of tokenization affects the results of other higher level processing, e.g.: parsing. winter semester 05/06

Problems of tokenization
Definition of Token United States, AT&T, 3-year-old Ambiguity of punctuation as sentence boundary Prof. Dr. J.M. Ambiguity in numbers 123,456.78 winter semester 05/06

Some Solutions Using regular expressions to match numbers and abbreviations ([0-9]+[,])*[0-9]([.][0-9]+)? [A-Z][bcdfghj-np-tvxz]+\. Using corpus as a filter to identify abbreviations Using a lexical list (most important abbreviations are listed) winter semester 05/06

POS Tagging Labeling each word in a sentence with its appropriate part of speech Information sources in tagging: Tags of other words in the context The word itself Different approaches: Rule-based Tagger Stochastic POS Tagger Simplest stochastic Tagger HMM Tagger winter semester 05/06

Simplest Stochastic Tagger
Each word is assigned its most frequent tag (most frequently encountered in the training set) Problem: may generate a valid tag for a word but unacceptable tag sequences Time flies like an arrow NN VBZ VB DT NN winter semester 05/06

Markov Models (MM) In a Markov chain, the future element of the sequence depends only on the current element in the sequence, but not the past elements X = (X1, …, XT) is a sequence of random variables, S = {s1, …, sN} is the state space and winter semester 05/06

Example of Markov Models (MM)
Cf. Manning & Schütze, 1999, page 319 winter semester 05/06

Hidden Markov Model In (visible) MM, we know the state sequences the model passes, so the state sequence is regarded as output In HMM, we don’t know the state sequences, but only some probabilistic function of it Markov models can be used wherever one wants to model the probability of a linear sequence of events HMM can be trained from unannotated text winter semester 05/06

HMM Tagger Assumption: word’s tag only depends on the previous tag and this dependency does not change over time HMM tagger uses states to represent POS tags and outputs (symbol emission) to represent the words. Tagging task is to find the most probable tag sequence for a sequence of words. winter semester 05/06

Finding the most probable sequence
Cf. Erhard Hinrichs & Sandra Kübler winter semester 05/06

HMM tagging – an example
Cf. Erhard Hinrichs & Sandra Kübler winter semester 05/06

Calculating the most likely sequence
Green: transition probabilities Blue: emission probabilities winter semester 05/06

Dealing with unknown words
The simplest model: assume that unknown words can have any POS tags, or the most frequent one in the tagset In practice, morphological info like suffix is used as hint winter semester 05/06

TnT (Trigrams’n’Tags)
A statistical tagger using Markov Models: states represent tags and outputs represent words To find the current tag is to calculate: winter semester 05/06

Transition and emission probabilities
Transition and output probabilities are estimated from a tagged corpus: Bigrams: Trigrams: Lexical: winter semester 05/06

Smoothing Technique Needed due to sparse-data problem
The trigram is most likely to be zero in a limited corpus: Without smoothing, the complete probability becomes zero Smoothing: where winter semester 05/06

Other techniques Handling unknown words Capitalization
Using the longest suffix (the final sequence of characters of a word) as a strong predictor for word classes To calculate the probability of a tag t given the last m letters li of an n letter word. m depends on the specific word Capitalization Works better for English than for German winter semester 05/06

Evaluation Corpora: 10-fold cross validation
German NEGRA corpus around 355,000 tokens WSJ (Wall Street Journal) in the Penn Treebank around 1.2 Million tokens 10-fold cross validation The tagger assigns tags as well as probabilities to words rank different assignments winter semester 05/06

Results for German and English
winter semester 05/06

POS Learning Curve for NEGRA

Learning Curve for Penn Treebank

Conclusion Good results for both German and English corpus
Average accuracy TnT achieves is between 96% and 97% The accuracy for known tokens is significantly higher than for unknown tokens winter semester 05/06

References: What is a word, what’s a sentence (Grefenstette 94)
POS-Tagging and Partial Parsing (Abney 96) TNT- A Statistical Part-of-Speech Tagger (Brants 2000) Foundations of Statistical Natural Language Processing (Manning & Schütze 99) winter semester 05/06

Tokenization & POS-Tagging

Similar presentations

Presentation on theme: "Tokenization & POS-Tagging"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tokenization & POS-Tagging

Similar presentations

Presentation on theme: "Tokenization & POS-Tagging"— Presentation transcript:

Similar presentations

About project

Feedback