Leif GrönqvistColloquia Linguistica1 Part II: The development of Automated Syntactic Taggers Leif Grönqvist Göteborg University.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
Midterm Review CS4705 Natural Language Processing.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
More about tagging, assignment 2 DAC723 Language Technology Leif Grönqvist 4. March, 2003.
Stockholm 6. Feb -04Robust Methods for Automatic Transcription and Alignment of Speech Signals1 Course presentation: Speech Recognition Leif Grönqvist.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Word classes and part of speech tagging Chapter 5.
Research methods in corpus linguistics Xiaofei Lu.
Albert Gatt Corpora and Statistical Methods Lecture 9.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
1 Statistical NLP: Lecture 10 Lexical Acquisition.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
Word classes and part of speech tagging Chapter 5.
Leif Grönqvist 1 Tagging a Corpus of Spoken Swedish Leif Grönqvist Växjö University School of Mathematics and Systems Engineering
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Natural Language Processing
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
CSA3202 Human Language Technology HMMs for POS Tagging.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Stochastic and Rule Based Tagger for Nepali Language Krishna Sapkota Shailesh Pandey Prajol Shrestha nec & MPP.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Word classes and part of speech tagging Chapter 5.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Erasmus University Rotterdam
CSCI 5832 Natural Language Processing
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
CS4705 Natural Language Processing
Classical Part of Speech (PoS) Tagging
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Leif GrönqvistColloquia Linguistica1 Part II: The development of Automated Syntactic Taggers Leif Grönqvist Göteborg University

Leif GrönqvistColloquia Linguistica2 Overview Some basic thing about corpora (quick) –What is a corpus –What can we do with it Part-of-speech tagging (slower) –What is the problem –Some common approaches A rule based tagger A statistical tagger Corpus tools –Different tools –Demonstration of Multitool

Leif GrönqvistColloquia Linguistica3 What is a corpus for a computational linguist? Various properties are important but the word ‘corpus’ is just Latin for ‘body’ These properties should be considered: –Representativeness –Size –Form (annotation standard) –Standard reference

Leif GrönqvistColloquia Linguistica4 Representativeness A corpus used for analyzing spoken Swedish should ideally contain all utterances of Swedish ever spoken But this is impossible, so there are at least two strategies depending on purpose: –Try to collect various dialogue types of sizes proportional to the “complete corpus” –Collect enough big portions of each type to make sure to find all wanted phenomena Regardless of which strategy you use it is important to select the samples from each type carefully, preferably using random

Leif GrönqvistColloquia Linguistica5 Corpus size: how big should it be? Depends on purpose! Some strategies: –Monitor corpus: as big as possible Bank of English > 500 million tokens Used for lexicography –Finite size, big enough for current task POS-tagging, ~100 tags: 1 million tokens Language model for automatic speech recognition: 100 million tokens

Leif GrönqvistColloquia Linguistica6 Machine readable form Corpora have been used in linguistics for more than 100 years. Now: a corpus => machine readable The annotations should be made in a way to make extraction of wanted features as simple as possible

Leif GrönqvistColloquia Linguistica7 Standard reference (quick) Typical content of a research article: “We used the corpus XX, took 90% for training, and 10% for testing with our new algorithm. We then got 97.2% correctness, which is a significant improvement from the old tagger at the 99% level” –Exactly the corpus XX must be available for other research groups

Leif GrönqvistColloquia Linguistica8 What to do with a corpus Check our linguistic intuition Annotate interesting features manually Use it for training of taggers and parsers –Annotate new data automatically But, be careful! A corpus is not the complete language

Leif GrönqvistColloquia Linguistica9 Text encoding Various encoding schemes around –Text based Human and machine readable Could be difficult to check for validity –Word processor based Only human readable Rarely used in computational linguistics –XML/SGML based Machine readable May be transformed to human readable form using XSLT Formalisms and tools for free, well more or less free Limitations of XML may be annoying sometimes

Leif GrönqvistColloquia Linguistica10 Some important properties (skip) Important properties according to Geoffrey Leech Possibility to extract original corpus Possibility to separate annotations Based on well defined guidelines Make clear how the annotations were done Make clear that there may be errors in the corpus Widely agreed theory-neutral annotation scheme No annotation scheme is the a priori standard scheme

Leif GrönqvistColloquia Linguistica11 Some annotation standards TEI (Text Encoding Initiative) Huge standard for all types of texts and corpora developed by the TEI Consortium since 1987 SGML based in the beginning but now XML (X)CES (XML Corpus Encoding Standard) Highly inspired by the TEI Not as complicated but only in beta version ISLE (International Standards for Language Engineering) Developed by three working groups (lexicon, multimodality and evaluation) CDIF (Corpus Document Interchange Format) Used by the British National Corpus A lot in common with the TEI

Leif GrönqvistColloquia Linguistica12 Some typical results directly extracted from a corpus Concordances (KWIC) Frequency lists N-gram statistics Probabilities

Leif GrönqvistColloquia Linguistica13 Concordances rer, matematiker och dataloger i Göteborgsregionen, bandavskrifter och dataloggar, skriver Feldt.|Si, bandavskrifter och dataloggar.|Men den nya Palme Ahlberg, forskare i datalogi på Chalmers.|Av PER- Ahlberg, forskare i datalogi på Chalmers.|SIDAN 4 und blir professor i datalogi vid Umeå universitet a fyra olika kurser: datalogi, pedagogik, teknik o ybjer och Jan Smith, datalogi.|Sektionen för maski atorer eller pluggar datalogi.|Så på fritiden leke r det gäller trådlös datalogistik, nu kommer det ö

Leif GrönqvistColloquia Linguistica14 Frequency lists 74556de 48104ja 39947e 34342å 25694så 25639att 22378va 19134som 18679vi 18084inte 17611på 17214man 16870i 16846då 77810det 36843är 35471och 32404ja 30439att 28628jag 26059så 19205som 18681inte 18469har 18421vi 17719på 17377man 17343då , 40438och 33978i 26358att 25634det 21830en 21333som 19743på 15754är 14333med 13837för 13683av 13547jag

Leif GrönqvistColloquia Linguistica15 N-gram statistics 3395det är 2913för att 2451det var 1560att det 1351är det 1278i en 1174att han 1003i den 966som en 920men det 889på en 884att jag 882är en 882med en 42i stället för att 36för några år sedan 35men det är inte 34en stor del av 33på samma sätt som 32det var som om 31att det är en 30är en av de 30men det var inte 28vad är det för 28det är svårt att 27det är som om 27att det inte var 26för ett år sedan

Leif GrönqvistColloquia Linguistica16 Part-of-speech tagging We want to assign the right part-of-speech (just as an example) to each word in a corpus Input is a tokenized corpus The tagset is determined in advance The word types in the corpus have various properties in the training data –Some are unambiguous –Some are ambiguous (typically 2-7 POS each) –Some are unknown (not there)

Leif GrönqvistColloquia Linguistica17 An example Tagset: noun, verb, pron, art, infmrk, prep In: $A: you have to book a chair on deck Out: pron verb infmrk verb art noun prep noun But, “book” and “chair” may be either verb or noun - the tagger has to disambiguate! Several approaches to do this, all based on patterns and regularities in the language

Leif GrönqvistColloquia Linguistica18 Terms used in tagging Tagging: put the right label (i.e. word class) on each token Tagset: all possible labels (word classes) Tokenizing: divide the corpus into tokens (words, sentence boundaries) Training: find the rules or probabilities needed by the tagger

Leif GrönqvistColloquia Linguistica19 Various approaches Rule based tagging –Constraint based tagging (SweTwol, EngTwol by Lingsoft) –Transformation-based tagging (Eric Brill) Stochastic tagging (HMM) –Calculate the most probable tag sequence –Using maximum likelihood estimation –Or some bootstrap based training

Leif GrönqvistColloquia Linguistica20 Constrain based tagging Basic idea: –Assign all possible tags to each words –Remove tags according to a set of rules of the type: “if word+1 is an adj, adv or quantifier and the following is a sentence boundary and word-1 is not a verb like ‘consider’ then eliminate non-adv else eliminate adv.” –Continue until no rule is applicable, but never remove the last tag on a word Typically more than 1000 hand written rules, but may also be machine learned

Leif GrönqvistColloquia Linguistica21 The example: Constraint grammar Tagset: nn, vb, pron, art, infmrk, prep First: look up all possible classes for each word Rules will then remove unwanted tags InStep 1 youpron haveverb toinfmrk booknoun, verb aart chairnoun, verb onprep decknoun

Leif GrönqvistColloquia Linguistica22 Transformation-based tagging Basic idea: –Set the most probable tag for each word as a start value –Change tags according to rules of the type: “if a word is tagged as a verb and the word before is an article, then change the tag to noun”. Perform rules in a specific order! Training is done using a tagged corpus: 1.Write a set of rule templates of the type: “if word-1 or word+1 is an X then change the tag for word to Y” 2.Among the set of possible rules, find the one with the highest score 3.Continue from 2 until a lowest score threshold is passed 4.Keep the ordered set of rules Rules will make errors that are corrected by later rules

Leif GrönqvistColloquia Linguistica23 The example: Transformation based learning Tagset: nn, vb, pron, art, infmrk, prep First: look up the most common tag for each word Rules will then change to the right tags WordStep 1 youpron haveverb toinfmrk booknoun aart chairnoun onprep decknoun

Leif GrönqvistColloquia Linguistica24 An HMM tagger: uses statistics (brief) The problem may be formulated as: Which may be reformulated as: But the denominator is constant and may be removed and we get:

Leif GrönqvistColloquia Linguistica25 HMM tagger, cont. (brief) The Markov assumption (for n=3) and the chain rule gives us: What we need now is:

Leif GrönqvistColloquia Linguistica26 The example: HMM WordSeq.1Seq.2Seq.3Seq.4 youpron haveverb toinfmrk booknoun verb aart chairnounverbnounverb onprep decknoun Select the sequence with the highest probability!

Leif GrönqvistColloquia Linguistica27 Training of an HMM tagger The best way is the Maximum Likelihood Estimation. But it requires a hand tagged corpus A fancy name for a simple principle: expect the new data to be as the training data. Count the thing there: –P(c) = freq(c) / Ntok –P(w,c) = freq(w,c) / Ntok –P(w|c) = P(w,c) / P(c)

Leif GrönqvistColloquia Linguistica28 Evaluation (skip) The result is compared with: the so called “Gold Standard” (manually coded) –Typically accuracy reach 96-97% –This may be compared with the result for a baseline tagger, for example a tagger not using context at all –Similarity between two gold standards may verified with the kappa measure Important to note that 100% is impossible even for human annotators

Leif GrönqvistColloquia Linguistica29 Problems (quick) Words and sequences are missing in the training data. This is cured using smoothing: –Additive: add one occurrence to each event frequency –Good-Turing estimation: try to calculate the number of unseen events to get a better estimation of their probabilities –Back-off and Linear interpolation –Morphology may help (-arity, -s)

Leif GrönqvistColloquia Linguistica30 The Viterbi algorithm (quick) To calculate the probabilities for all possible sequences of tags would take too long time The Viterbi algorithm helps us to find the most probable path in linear time to the length of the text and quadratic time to the number of states, using dynamic programming

Leif GrönqvistColloquia Linguistica31 Example of corpus tools at the linguistics department in Göteborg The Corpus Browser –A tool for searching (for words and expressions) and browsing in our transcriptions TraSA –A tool that count things like number of words, utterances, overlaps, vocabulary richness, etc Multitool –A tool for browsing and coding a transcription, with audio and video available at the same time –Demonstration?

Leif GrönqvistColloquia Linguistica32 Thank you! Thank you for listening! Well, do we have any time left for questions?