Leif GrönqvistColloquia Linguistica1 Part II: The development of Automated Syntactic Taggers Leif Grönqvist Göteborg University.

Leif GrönqvistColloquia Linguistica1 Part II: The development of Automated Syntactic Taggers Leif Grönqvist Göteborg University

Leif GrönqvistColloquia Linguistica2 Overview Some basic thing about corpora (quick) –What is a corpus –What can we do with it Part-of-speech tagging (slower) –What is the problem –Some common approaches A rule based tagger A statistical tagger Corpus tools –Different tools –Demonstration of Multitool

Leif GrönqvistColloquia Linguistica3 What is a corpus for a computational linguist? Various properties are important but the word ‘corpus’ is just Latin for ‘body’ These properties should be considered: –Representativeness –Size –Form (annotation standard) –Standard reference

Leif GrönqvistColloquia Linguistica4 Representativeness A corpus used for analyzing spoken Swedish should ideally contain all utterances of Swedish ever spoken But this is impossible, so there are at least two strategies depending on purpose: –Try to collect various dialogue types of sizes proportional to the “complete corpus” –Collect enough big portions of each type to make sure to find all wanted phenomena Regardless of which strategy you use it is important to select the samples from each type carefully, preferably using random

Leif GrönqvistColloquia Linguistica5 Corpus size: how big should it be? Depends on purpose! Some strategies: –Monitor corpus: as big as possible Bank of English > 500 million tokens Used for lexicography –Finite size, big enough for current task POS-tagging, ~100 tags: 1 million tokens Language model for automatic speech recognition: 100 million tokens

Leif GrönqvistColloquia Linguistica6 Machine readable form Corpora have been used in linguistics for more than 100 years. Now: a corpus => machine readable The annotations should be made in a way to make extraction of wanted features as simple as possible

Leif GrönqvistColloquia Linguistica7 Standard reference (quick) Typical content of a research article: “We used the corpus XX, took 90% for training, and 10% for testing with our new algorithm. We then got 97.2% correctness, which is a significant improvement from the old tagger at the 99% level” –Exactly the corpus XX must be available for other research groups

Leif GrönqvistColloquia Linguistica8 What to do with a corpus Check our linguistic intuition Annotate interesting features manually Use it for training of taggers and parsers –Annotate new data automatically But, be careful! A corpus is not the complete language

Leif GrönqvistColloquia Linguistica9 Text encoding Various encoding schemes around –Text based Human and machine readable Could be difficult to check for validity –Word processor based Only human readable Rarely used in computational linguistics –XML/SGML based Machine readable May be transformed to human readable form using XSLT Formalisms and tools for free, well more or less free Limitations of XML may be annoying sometimes

Leif GrönqvistColloquia Linguistica10 Some important properties (skip) Important properties according to Geoffrey Leech Possibility to extract original corpus Possibility to separate annotations Based on well defined guidelines Make clear how the annotations were done Make clear that there may be errors in the corpus Widely agreed theory-neutral annotation scheme No annotation scheme is the a priori standard scheme

Leif GrönqvistColloquia Linguistica11 Some annotation standards TEI (Text Encoding Initiative) Huge standard for all types of texts and corpora developed by the TEI Consortium since 1987 SGML based in the beginning but now XML (X)CES (XML Corpus Encoding Standard) Highly inspired by the TEI Not as complicated but only in beta version ISLE (International Standards for Language Engineering) Developed by three working groups (lexicon, multimodality and evaluation) CDIF (Corpus Document Interchange Format) Used by the British National Corpus A lot in common with the TEI

Leif GrönqvistColloquia Linguistica12 Some typical results directly extracted from a corpus Concordances (KWIC) Frequency lists N-gram statistics Probabilities

Leif GrönqvistColloquia Linguistica13 Concordances rer, matematiker och dataloger i Göteborgsregionen, bandavskrifter och dataloggar, skriver Feldt.|Si, bandavskrifter och dataloggar.|Men den nya Palme Ahlberg, forskare i datalogi på Chalmers.|Av PER- Ahlberg, forskare i datalogi på Chalmers.|SIDAN 4 und blir professor i datalogi vid Umeå universitet a fyra olika kurser: datalogi, pedagogik, teknik o ybjer och Jan Smith, datalogi.|Sektionen för maski atorer eller pluggar datalogi.|Så på fritiden leke r det gäller trådlös datalogistik, nu kommer det ö

Leif GrönqvistColloquia Linguistica14 Frequency lists 74556de 48104ja 39947e 34342å 25694så 25639att 22378va 19134som 18679vi 18084inte 17611på 17214man 16870i 16846då 77810det 36843är 35471och 32404ja 30439att 28628jag 26059så 19205som 18681inte 18469har 18421vi 17719på 17377man 17343då 90304. 56075, 40438och 33978i 26358att 25634det 21830en 21333som 19743på 15754är 14333med 13837för 13683av 13547jag

Leif GrönqvistColloquia Linguistica15 N-gram statistics 3395det är 2913för att 2451det var 1560att det 1351är det 1278i en 1174att han 1003i den 966som en 920men det 889på en 884att jag 882är en 882med en 42i stället för att 36för några år sedan 35men det är inte 34en stor del av 33på samma sätt som 32det var som om 31att det är en 30är en av de 30men det var inte 28vad är det för 28det är svårt att 27det är som om 27att det inte var 26för ett år sedan

Leif GrönqvistColloquia Linguistica16 Part-of-speech tagging We want to assign the right part-of-speech (just as an example) to each word in a corpus Input is a tokenized corpus The tagset is determined in advance The word types in the corpus have various properties in the training data –Some are unambiguous –Some are ambiguous (typically 2-7 POS each) –Some are unknown (not there)

Leif GrönqvistColloquia Linguistica17 An example Tagset: noun, verb, pron, art, infmrk, prep In: $A: you have to book a chair on deck Out: pron verb infmrk verb art noun prep noun But, “book” and “chair” may be either verb or noun - the tagger has to disambiguate! Several approaches to do this, all based on patterns and regularities in the language

Leif GrönqvistColloquia Linguistica18 Terms used in tagging Tagging: put the right label (i.e. word class) on each token Tagset: all possible labels (word classes) Tokenizing: divide the corpus into tokens (words, sentence boundaries) Training: find the rules or probabilities needed by the tagger

Leif GrönqvistColloquia Linguistica19 Various approaches Rule based tagging –Constraint based tagging (SweTwol, EngTwol by Lingsoft) –Transformation-based tagging (Eric Brill) Stochastic tagging (HMM) –Calculate the most probable tag sequence –Using maximum likelihood estimation –Or some bootstrap based training

Leif GrönqvistColloquia Linguistica20 Constrain based tagging Basic idea: –Assign all possible tags to each words –Remove tags according to a set of rules of the type: “if word+1 is an adj, adv or quantifier and the following is a sentence boundary and word-1 is not a verb like ‘consider’ then eliminate non-adv else eliminate adv.” –Continue until no rule is applicable, but never remove the last tag on a word Typically more than 1000 hand written rules, but may also be machine learned

Leif GrönqvistColloquia Linguistica21 The example: Constraint grammar Tagset: nn, vb, pron, art, infmrk, prep First: look up all possible classes for each word Rules will then remove unwanted tags InStep 1 youpron haveverb toinfmrk booknoun, verb aart chairnoun, verb onprep decknoun

Leif GrönqvistColloquia Linguistica22 Transformation-based tagging Basic idea: –Set the most probable tag for each word as a start value –Change tags according to rules of the type: “if a word is tagged as a verb and the word before is an article, then change the tag to noun”. Perform rules in a specific order! Training is done using a tagged corpus: 1.Write a set of rule templates of the type: “if word-1 or word+1 is an X then change the tag for word to Y” 2.Among the set of possible rules, find the one with the highest score 3.Continue from 2 until a lowest score threshold is passed 4.Keep the ordered set of rules Rules will make errors that are corrected by later rules

Leif GrönqvistColloquia Linguistica23 The example: Transformation based learning Tagset: nn, vb, pron, art, infmrk, prep First: look up the most common tag for each word Rules will then change to the right tags WordStep 1 youpron haveverb toinfmrk booknoun aart chairnoun onprep decknoun

Leif GrönqvistColloquia Linguistica24 An HMM tagger: uses statistics (brief) The problem may be formulated as: Which may be reformulated as: But the denominator is constant and may be removed and we get:

Leif GrönqvistColloquia Linguistica25 HMM tagger, cont. (brief) The Markov assumption (for n=3) and the chain rule gives us: What we need now is:

Leif GrönqvistColloquia Linguistica26 The example: HMM WordSeq.1Seq.2Seq.3Seq.4 youpron haveverb toinfmrk booknoun verb aart chairnounverbnounverb onprep decknoun Select the sequence with the highest probability!

Leif GrönqvistColloquia Linguistica27 Training of an HMM tagger The best way is the Maximum Likelihood Estimation. But it requires a hand tagged corpus A fancy name for a simple principle: expect the new data to be as the training data. Count the thing there: –P(c) = freq(c) / Ntok –P(w,c) = freq(w,c) / Ntok –P(w|c) = P(w,c) / P(c)

Leif GrönqvistColloquia Linguistica28 Evaluation (skip) The result is compared with: the so called “Gold Standard” (manually coded) –Typically accuracy reach 96-97% –This may be compared with the result for a baseline tagger, for example a tagger not using context at all –Similarity between two gold standards may verified with the kappa measure Important to note that 100% is impossible even for human annotators

Leif GrönqvistColloquia Linguistica29 Problems (quick) Words and sequences are missing in the training data. This is cured using smoothing: –Additive: add one occurrence to each event frequency –Good-Turing estimation: try to calculate the number of unseen events to get a better estimation of their probabilities –Back-off and Linear interpolation –Morphology may help (-arity, -s)

Leif GrönqvistColloquia Linguistica30 The Viterbi algorithm (quick) To calculate the probabilities for all possible sequences of tags would take too long time The Viterbi algorithm helps us to find the most probable path in linear time to the length of the text and quadratic time to the number of states, using dynamic programming

Leif GrönqvistColloquia Linguistica31 Example of corpus tools at the linguistics department in Göteborg The Corpus Browser –A tool for searching (for words and expressions) and browsing in our transcriptions TraSA –A tool that count things like number of words, utterances, overlaps, vocabulary richness, etc Multitool –A tool for browsing and coding a transcription, with audio and video available at the same time –Demonstration?

Leif GrönqvistColloquia Linguistica32 Thank you! Thank you for listening! Well, do we have any time left for questions?

Leif GrönqvistColloquia Linguistica1 Part II: The development of Automated Syntactic Taggers Leif Grönqvist Göteborg University.

Similar presentations

Presentation on theme: "Leif GrönqvistColloquia Linguistica1 Part II: The development of Automated Syntactic Taggers Leif Grönqvist Göteborg University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Leif GrönqvistColloquia Linguistica1 Part II: The development of Automated Syntactic Taggers Leif Grönqvist Göteborg University.

Similar presentations

Presentation on theme: "Leif GrönqvistColloquia Linguistica1 Part II: The development of Automated Syntactic Taggers Leif Grönqvist Göteborg University."— Presentation transcript:

Similar presentations

About project

Feedback