Corpus annotation: what is it good for? LING306, week 3.

Corpus annotation: what is it good for? LING306, week 3

What’s in a corpus? Text: plain text (?) Markup: representing features of the text Metadata: data about data Annotation: linguistic analysis We will discuss all of these briefly today, but we will spend most of the time talking about annotation

Beyond plain text The most basic form of corpus contains nothing but raw text – the words of the original data and nothing else But many (most) corpora contain more than this – extra information of one form or another

Markup Can we preserve features of the original text other than just the words? YES – by adding in codes to indicate features of the original layout / structure of the text ◦ Paragraph begin / end ◦ Sentence begin / end ◦ Page breaks ◦ Headings versus normal paragraphs ◦ Etc. etc. We will see later what this actually looks like We can then use this in searches (e.g. the word “new” only in headings; the word “the” only at the start of a paragraph)

Metadata When building a corpus, we often record “extra info” about the text, where it came from, who wrote it (or spoke it), and so on ◦ This is called metadata Corpus files often have a blob of markup at the start of the file that contains metadata ◦ This is called the header ◦ We will see some examples later We can use the metadata to limit our searches to particular sorts of texts ◦ You have done this with the BNC (searching written texts versus spoken texts – this is based on metadata in the headers of the BNC texts)

Annotation Adding analysis to a corpus ◦ That is: material is added which is not part of the original text, but instead represents the results of some linguistic analysis being performed upon it Most obvious example: part-of-speech (POS) tagging ◦ The_D cat_N sat_V on_I the_D mat_N. But there are many other forms of annotation, as we will see

Markup, annotation, tagging A warning: ◦ These terms are often used interchangeably – but in this lecture we will be using them as I have outlined above ◦ You should, however, be aware in your reading that the term “tagging” in particular can be used to mean any process that adds info to a corpus

Why have a lecture on annotation? Lancaster University (also UCL) led the way in several different sorts of corpus annotation – it is a very important part of our “version” of corpus linguistics Many of the corpus tools you have access to (BNCweb, CQPweb, Wmatrix) let you use a range of different kinds of annotation – so it will help you to know what analysis options you have

Annotate what? There are many kinds of annotation From the most basic to the most complex ◦ Tokenisation – word boundaries ◦ POS tagging – grammatical categories ◦ Semantic tagging – categories of meaning ◦ Lemmatisation – linking words to base forms ◦ Parsing – sentence structures ◦ Pragmatic / discourse / stylistic tagging – higher- level phenomena, longer-range relationships

Tokenisation Where do the words start and end? ◦ Don’t the spaces tell us this? ◦ Yes… but…  don’t, isn’t, I’m, ’tis, t’other, etc.  Instead of, because of, in terms of, of course … Token = a unit of language that can be analysed as a single word, regardless of whether it happens to be written as a single word or not ◦ Tokenising text = annotating text to indicate the token boundaries ◦ Often a first step to other sorts of annotation

Part-of-speech (POS) tagging What is the grammatical category of each word? The_D cat_N sat_V on_I the_D mat_N. But most POS tagging schemes make finer distinctions than this. There are many POS taggers and as many different tagsets… ◦ We do not have time to describe more than one Lancaster’s tagger: CLAWS (Garside 1987) The tagsets it uses: C5, C7, C8

What is the difference between these tagsets? C5 tagset ◦ Used in the BNC ◦ Tags simplified a bit for non-linguists ◦ 62 different categories ◦ http://ucrel.lancs.ac.uk/claws5tags.html http://ucrel.lancs.ac.uk/claws5tags.html C7 tagset ◦ The one we use most of the time ◦ More linguistic distinctions: therefore, opens more opportunities for advanced analysis ◦ 137 different categories ◦ http://ucrel.lancs.ac.uk/claws7tags.html http://ucrel.lancs.ac.uk/claws7tags.html C8 ◦ A more detailed version of C7; you may see it occasionally

What is the difference between these tagsets? – nouns Noun tags in C5 NN0noun (neutral for number) (e.g. AIRCRAFT, DATA) NN1singular noun (e.g. PENCIL, GOOSE) NN2plural noun (e.g. PENCILS, GEESE) NP0proper noun (e.g. LONDON, MICHAEL, MARS) ND1singular noun of direction (e.g. north, southeast) NNcommon noun, neutral for number (e.g. sheep, cod, headquarters) NN1singular common noun (e.g. book, girl) NN2plural common noun (e.g. books, girls) NNAfollowing noun of title (e.g. M.A.) NNBpreceding noun of title (e.g. Mr., Prof.) NNL1singular locative noun (e.g. Island, Street) NNL2plural locative noun (e.g. Islands, Streets) NNOnumeral noun, neutral for number (e.g. dozen, hundred) NNO2numeral noun, plural (e.g. hundreds, thousands) NNT1temporal noun, singular (e.g. day, week, year) NNT2temporal noun, plural (e.g. days, weeks, years) NNUunit of measurement, neutral for number (e.g. in, cc) NNU1singular unit of measurement (e.g. inch, centimetre) NNU2plural unit of measurement (e.g. ins., feet) NPproper noun, neutral for number (e.g. IBM, Andes) NP1singular proper noun (e.g. London, Jane, Frederick) NP2plural proper noun (e.g. Browns, Reagans, Koreas) NPD1singular weekday noun (e.g. Sunday) NPD2plural weekday noun (e.g. Sundays) NPM1singular month noun (e.g. October) NPM2plural month noun (e.g. Octobers) Noun tags in C7

What is the difference between these tagsets? – determiners Determiner tags in C5 DPSpossessive determiner form (e.g. YOUR, THEIR) DT0general determiner (e.g. THESE, SOME) DTQwh-determiner (e.g. WHOSE, WHICH) Determiner tags in C7 DAafter-determiner or post-determiner capable of pronominal function (e.g. such, former, same) DA1singular after-determiner (e.g. little, much) DA2plural after-determiner (e.g. few, several, many) DARcomparative after-determiner (e.g. more, less, fewer) DATsuperlative after-determiner (e.g. most, least, fewest) DBbefore determiner or pre-determiner capable of pronominal function (all, half) DB2plural before-determiner ( both) DDdeterminer (capable of pronominal function) (e.g any, some) DD1singular determiner (e.g. this, that, another) DD2plural determiner ( these,those) DDQwh-determiner (which, what) DDQGEwh-determiner, genitive (whose) DDQVwh-ever determiner, (whichever, whatever)

Semantic tagging Same principle as POS tagging… but the tags refer to categories of meaning rather than categories of grammar Lancaster developed a system of semantic tagging called “USAS” in the early 1990s (Andrew Wilson, Paul Rayson) Like POS tagging, this is now largely automated We will return to this in Wk 5 & afterwards

Lancaster ’ s semantic tagset semantic tagsetsemantic tagset Plus subdivisions!! (see handout)

Semantic tagging: example (from a BNC fiction text) He_Z8m had_A9+ written_Q1.2 to_Z5 tell_Q2.2 them_Z8mfn about_Z5 Amy_Z1f, since_ Z5 his_ Z8m parents_ S4mf constantly_ T2++ threatened_E3-/Q2.2 to_Z5 visit_S1.1.1 and_Z5, unlikely_A7- as_Z5 this_M6 was_A3+ actually_A5.4+ to_Z5 happen_A2.1+, the_Z5 prospect_A7+ had_A9+ added_N5+/A2.1 an_Z5 intolerable_S7.4- anxiety_E6- to_Z5 his_Z8m already_T1.1.1 anxiety-ridden_Z99 life_L1+.

Points to note Z5 = Grammatical bin (most common tag) “anxiety” = E6- = Worry and confidence “anxiety-ridden” = Z99 = ????? “constantly” = T2++ = Time: Beginning “already” = T1.1.1 = Time: Past Note also gender marking (“m”, “f”, “n”)

Lemmatisation Lemma = a “word” in the lexicon i.e. a headword plus its inflection forms ◦ E.g. go, goes, went, going, gone are different word forms but they are part of the same lemma ◦ Headword = GO Lemmatisation = tagging each word in a text with its headword ◦ went => GO ◦ badgers => BADGER ◦ him=>HE ◦ n’t=>NOT

Parsing There are two basic kinds of parsing… ◦ Phrase-structure parsing – indicating where syntactic phrases begin and end, which phrases are within each other, etc.  We will talk about this in more detail later in term ◦ Dependency parsing – showing grammatical links between words: each word in the sentence is dependent on another word

Parsing: phrase structure vs. dependency [N The cat N] [V sat [P on [N the mat N]P]V] http://www.connexor.eu/technology/machine se/demo/syntax/ http://www.connexor.eu/technology/machine se/demo/syntax/

Further levels of annotation Discourse annotation ◦ E.g. anaphora – tagging pronouns to indicate what they refer back to Pragmatic annotation ◦ Tagging utterance for their speech act; for their level of politeness; for their use of indirectness; etc. Stylistic annotation ◦ Tagging different kinds of speech and thought representation (Semino and Short 2004)  Elena will talk about this more in wk 7

Markup versus annotation: a fuzzy distinction? Markup is representation of the language of the text Annotation is metalinguistic ◦ But is there a clear dividing line? Questionable cases: ◦ Transcription of phonetic / prosodic features ◦ Identification of “levels” of heading in markup ◦ … are these representation or analysis? Transcription is itself an act of analysis ◦ Even deciding what is a sentence boundary is an act of interpretation ◦ “No!” she exclaimed, “it can’t be!”

Manual versus automatic annotation Some tagging can be done pretty reliably by machine ◦ Tokenisation, POS tagging, lemmatisation Some is a bit shakier ◦ Semantic tagging, parsing Some is difficult or impossible for a computer ◦ Pragmatic tagging For the user, there is in theory no difference ◦ In practice, there are differences of reliability (all automatic tagging makes mistakes, so do human beings, but different sorts!) ◦ Manual tagging is a huge task, so only relatively small corpora will come with (e.g.) pragmatic annotation

The great annotation debate The case against (John Sinclair inter alia) ◦ Tagging imposes theoretical preconceptions on the data – so you will never get anything out that you don’t put in ◦ Tagging destroys the integrity of the text ◦ Tagging can slow down the searching of a corpus

The great annotation debate The case in favour (Geoff Leech inter alia) ◦ There are some “philosophical” reasons why annotation is desirable ◦ There are also some strong practical reasons why tagging comes in very handy  (See next few slides!) ◦ As we will see, there are types of searches and analysis that you cannot do unless your corpus is annotated ◦ Contra Sinclair: the original text is preserved (it is always possible to hide tags – e.g. BNCweb) and tagging need not slow down searches

Why annotate? The “practical” reasons Disambiguation ◦ Frequency counts  How many instances of “may” as a modal are there in the BNC? ◦ Searches  Find only instances of “light” that are adjectives Higher levels of abstraction ◦ Frequency counts  How many words are there in each semantic category? ◦ Searches  Find noun phrases with a single premodifying adjective

Why annotate? The “philosophical” reasons To preserve your analysis and make it available to others ◦ Hard work becomes reusable ◦ Each different analyst doesn’t reinvent the wheel To allow different people to work with the same basic analysis ◦ We know we are all starting from the same place ◦ We don’t have to worry about our analyses not being the same

How does annotation / markup actually appear in a corpus text? Many different ways of coding annotation have been invented The aim: to make sure the words of the text and the annotation can be kept apart

An early example: LOB-style annotation D01 7 ^ some_DTI critics_NNS,_, not_XNOT many_AP,_, argue_VB that_CS D01 7 the_ATI gospel_NN is_BEZ the_ATI product_NN of_IN D01 8 one_CD1 mind_NN and_CC one_CD1 hand_NN._. ^ for_IN them_PP3OS the_ATI D01 8 problems_NNS of_IN the_ATI fourth_OD gospel_NN D01 9 exist_VB only_RB in_IN the_ATI mind_NN of_IN its_PP$ detractors_NNS._. D01 9 ^ the_ATI difficulties_NNS which_WDTR are_BER D01 10 felt_VBN by_IN modern_JJ critics_NNS are_BER due_JJ to_IN the_ATI D01 10 book_NN being_BEG read_VBN and_CC examined_VBN as_CS D01 11 it_PP3 was_BEDZ never_RB meant_VBN to_TO be_BE._. ^ there_EX is_BEZ D01 11 some_DTI truth_NN in_IN this_DT contention_NN,_, and_CC D01 12 one_CD1 must_MD always_RB remember_VB that_CS no_ATI book_NN of_IN D01 12 the_ATI new_JJ testament_NN was_BEDZ written_VBN D01 13 with_IN the_ATI special_JJ interests_NNS of_IN a_AT modern_JJ critic_NN

A more complex example: column-format annotation He PPHS1 Z8m he is VBZ Z5 be standing VVG M6 stand close RR A13:4 close to II A13:4 to the AT Z5 the lazy JJ X5:2d~S1:2 lazy gentleman NN1 S2:2m gentleman,, __UNDEF__ PUNC and CC Z5 and says VVZ Q2:1 say with IW Z5 with a AT1 Z5 a faint JJ A11:2d faint smile NN1 E4:1u smile

The most recent way to do it: XML (Extensible Markup Language) XML is made up of tags in angled brackets ◦ Anything inside is part of the markup ◦ Anything outside the is part of the text ◦ Zones are indicated like this:  … ◦ You can use whatever XML elements you like Example: ◦ This is a sentence. This is another sentence. The two sentences are inside a paragraph. XML can also be used for headers (metadata) and marking extra info on words (annotation)

An XML header (example from the Corpus of English Dialogues) D4FWARD 4 FICTION D4FWARD WHOLE PLEASURES WARD EDWARD 4: 1680-1719 4: 1688/1710? 1714? Fiction WARD, EDWARD. (text of the actual corpus file goes here)

XML being used for annotation (example from the BNC) Dalgliesh made good time and by three he was driving through Lydsett village.

How can we exploit the annotation? Searches based on tags not words ◦ In BNCweb you can do the following:  _NN1 or _{N} -- to find nouns  {sit} or {be} -- to find all forms of these verbs Quantitative analysis at other levels of abstraction ◦ Frequency lists of tags, not words ◦ Key tags, not key words ◦ Tag collocates: co-occurring tags, not co-occurring words We will see in future weeks how we can use these kinds of analysis to tell us interesting things about texts and genres

Reservations about annotation We are effectively taking someone else’s analysis and using it – this is obviously hazardous We cannot simply assume accuracy We must be aware of the scheme and any problems it has We must be aware of the limits of the software (% accuracy?) A critical awareness is essential

Conclusion… Corpus annotation: what is it good for? It helps us disambiguate elements in our corpus It helps us do searches and frequency counts at higher levels of abstraction than just the word level It allows us to have a common basis of analysis for everyone who uses the corpus It opens up new options for corpus analysis

This week’s reading The set reading is two articles by Geoff Leech about the uses of corpus annotation: ◦ Leech (1997), Leech and Smith (1999) Further reading: ◦ Generally super book on annotation:  Garside, Leech and McEnery (1997) ◦ On the web re: POS tagging:  http://ucrel.lancs.ac.uk/claws/ http://ucrel.lancs.ac.uk/claws/ ◦ On the web re: semantic tagging:  http://ucrel.lancs.ac.uk/usas/ http://ucrel.lancs.ac.uk/usas/ ◦ On stylistic annotation: Semino and Short (2004) ◦ On POS tags and their uses:Hardie (2007)  (Shameless self-promotion)

Corpus annotation: what is it good for? LING306, week 3.

Similar presentations

Presentation on theme: "Corpus annotation: what is it good for? LING306, week 3."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Corpus annotation: what is it good for? LING306, week 3.

Similar presentations

Presentation on theme: "Corpus annotation: what is it good for? LING306, week 3."— Presentation transcript:

Similar presentations

About project

Feedback