Presentation on theme: "Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven."— Presentation transcript:
Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven
Paris2 This talk... Why CGN: Spoken Dutch Corpus? At that time … Other layers –Orthographic transcription –PoS tagging Syntactic annotation –Dependencies and categories Spoken language –“standard” language, disfluencies LASSY/SoNaR: Written Dutch Corpus What to take into account when planning a ‘spoken treebank’
Paris3 Why CGN? Dutch Language Union Dutch/Flemish organization taking care of common language : report state of the art wrt Language & Speech Technology 1998: Spoken Dutch Corpus, 5 years, 2/3 Netherlands - 1/3 Flanders, balanced 1000 hours, +/- 10M words 1 M Syntactic Annotation Both research purposes and services (EU) / industry
Paris4 At that time This talk: focus on textual aspects! No taggers, parsers that could be reused Existing grammars cover(ed) the northern variant of Dutch No ‘formal’ grammar ►start from scratch
Paris5 Other layers Relevant for syntax: –Orthographic transcription –PoS tagging All layers in parallel, but per fragment: layer A finished before start layer B (except for errors) Reason: time But: gave us opportunity to express wishes/needs wrt other layers Example: handling of specific types of words.
Paris6 Transcription and PoS An example:
Paris7 Specific types of words *vwords in another language (not 'adopted' in Dutch) *a not fully realized words (gaan probe instead of gaan proberen) *xwords that could not be (fully) understood (also xxx, ggg) *umispronounced words (ploberen instead of proberen, om-uh-dat*u instead of omdat) *ddialectal words One or more words? zo’n vs zo ‘n (such a): one token! But hebde*d (litt. have you) realized as hebt*d de*d : two tokens
Paris8 Syntactic analysis: goal CGN Annotation in theory-neutral format in order to be useful for as many people as possible Categories: NP, PP, … Functions/dependencies: subject, object1, … As automatic as possible: –Tool from NEGRA-corpus: Annotate –for German –same desiderata as CGN (contrary to Dutch AMAZON-parser).
Paris Annotate Developed for NEGRA-project (Saarbrücken) – Oliver Plaehn, Thorsten Brants Semi-automatic annotation – Works with tagger and parser – Suggests structures Combined with Cascaded Markov Models (Brants) – Bootstrapping approach possible
Paris Annotate screen.
Paris11 Annotate ‘correction’ format
Paris12 Annotate export format.
Paris13 Principles of syntactic annotation Structures as flat as possible Only new level when there is a new head No branching when just one node is involved No duplication of functions (1 SU, 1 OBJ1, …) In principle just non-branching heads Allowed: –multiple branching –crossing dependencies Input: simplified PoS.
Paris14 Less PoS-tags Simplified PoS PoS: over 300 tags –Over 100 for pronouns –Not problematic at all, often unique token/tag combinations Not all details necessary for SA Example full tagset –T501aVNW(pers,pron,nomin,vol,1,ev)ik (I) –T501oVNW(pers,pron,nomin,vol,3,ev,masc)hij (he) Example simplified tagset –VNW1VNW(pers,pron)personal pronoun –In graph: both T501a and VNW1.
Paris15 Syntactic simplifications Other simplifications Obj2 – indirect object (dative) meewerkend voorwerp Ik geef hem een boek / een boek aan hem (I give him a book) belanghebbend voorwerp Ik koop hem een boek / een boek voor hem (I buy him a book) Bepaling van gesteldheid (~predicative complement) hij verft de deur blauw (he paints the door blue) Hij vindt het boek leuk (he does like the book) Hij nam het boek lachend aan (laughing he accepted the book).
Paris16 Results Even then: Annotate did most NPs and PPs very well, but often failed for the more complex parts In some sense surprising as the results for German were much better. However: In that case written language was involved. Training for spoken language is much harder!.
Paris17 Details CGN corpus Balanced corpus: types of documents (next slide) Speaker characteristics Sex Age Geographic region Socio-economic class Level of education 2/3 Netherlands, 1/3 Belgium (Flanders) Participants were asked to speak standard language (in case they agreed beforehand to participate in CGN).
Paris18 Details CGN corpus ►many types of documents Read-aloud written: Literature read aloud (library for the blind) Written to be spoken: News broadcasts Lectures Spoken (spontaneous) Interviews Phone calls Debates Spontaneous conversations with x people (over lunch etc).
Paris19 Variation To some extent differences in written language, much more in spoken variants, esp. in spontaneous speech Separable verbs NL dat ze hem op wilde bellen (that she wanted to call him) VL dat ze hem wilde opbellen Other choice of auxiliaries NL Ze is het komen brengen (she came and brought it) VL Ze heeft het komen brengen Other words for same concept, same words for different concepts Pompbak-gootsteen (sink), namiddag (afternoon-late afternoon) Gramm/dictionaries: mostly northern written variant.
Paris Disfluencies Partially realized words hilari*ainstead of hilarisch(EN hilarious) Analyzed as if realized *** Ik doe West- en Oost-Vlaanderen I’ll take care of West- and Oost-Vlaanderen Short for: West-Vlaanderen en Oost-Vlaanderen Completely regularly analyzed as conjunction (CONJ).
Paris Disfluencies When too little of a token is realized, such a token is ignored awel genen TV meer en genen boe*a gene voetbal meer. EN: So no more tv and no more football.
Paris Ex of disfluency (repetition)
Paris Disfluencies Mixed repetition/correction Ze was bijna hileri*a hilari*a She was almost hilarious hileri*a is corrected as hilari*a, only the corrected form is included in the analysis Die verd*a die vervl*a die krankzinnige hond That damn*, that cursed*, that crazy dog Only last 3 words (that crazy dog) included in graph.
Paris24 Disfluencies Wrong pronunciation Dat is een serieus plobleem*u Dat is een serieus probleem That’s a serious problem Analysed as if the ‘correct’ word was involved ***
Paris25 Words in foreign language In spoken and written language: Words in another language, and not found in a Dutch dictionary: umbrella*v, plus*v de*v temps*v, à la carte not: rendez-vous, cinema, cognac (in Dutch dictionaries) Single words: just like their Dutch counterpart Strings: only ‘top’ label presented Sentences: not analyzed.
Paris26 Pro and con markings Markings (*a, etc) have proven to be useful for PoS and SA. But: should have been removed afterwards, i.e. all information should have been contained in tags, orthographic level should contain only orthography Problem: other groups wanted them at orthographic level for speech recognition purposes Solution: add a field without markings.
Paris27 Syntactic annotation Lacking and superfluous words There are no ‘ungrammatical’ sentences, all sentences are to be analyzed! Lacking elements:just accept it Superfluous elements:just accept it BUT there are some exceptions: repetition ‘accidental’ sentences.
Paris28 Not analyzed parts Sometimes parts of a ‘sentence’ are ‘ignored’: Reparations Ik zie hem morg*a overmorgen I’ll see him the day after tomorrow Repetitions Hij is in in vergadering He has a meeting Or not connected: ‘accidental’ sentences/units Ik heb nooit ik ben lerares I have never I am a teacher Uh-insertion (hesitation marker) Ze heeft uh zeven dochters She has seven daughters.
Paris29 Examples More of the same
Paris30 Asyndetic conjunction
Paris31 Discourse phenomena Some examples of ‘discourse’ within a sentence
Paris32 Accidental unit ‘Accidental’ unit, discourse parts not connected
Paris33 Syntactic annotation sentence vs discourse
Paris34 Atypical ‘sentences’ Often: discourse
Paris35 Complicating factors No punctuation apart from full stop, question mark, elipsis ‘wrong order’ of sentences when more people are talking at the same time! ►Tricky wrt coreference, temporal reasoning etc Spelling: incorrect (but correct with other meaning) U zij de glorie (Thine be the glory) U zei de glorie (‘zei’ meaning ‘said’) Ik zal haar eraan houden (houden aan: to keep a promise) Ik zal haar er aanhouden (aanhouden: to arrest) ►context, recordings.
Paris36 Written corpus: Lassy/SoNaR STEVIN programme (Flemish/Dutch ) D-Coi / LASSY / (SoNaR) 1M SA written text, manually corrected, plus 1.500M SA automatically ALPINO parser (Groningen) Largely inspired by CGN, based on HPSG Some differences Mentioning of ‘hidden’ subjects, objects – Hij heeft een boek gekocht.
Paris37 Alpino Alpino grammar: HPSG-based ‘Constructional’ approach: – rich lexical representations – many detailed, construction specific lexical rules (+/- 600) Grammar based parsing very efficient, esp when combined with specific rules Large lexicon ( entries, NEs) – Stored as perfect hash finite automaton (Daciuk) Crucial: Integrated tagger (=/= CGN tagger!) Left corner parser
Paris Alpino (as is) and CGN Parsing the CGN-corpus with Alpino very bad results reason might be: it uses a ‘wrong’ grammar, inadequate lexicon etc etc As we wanted both CGN and Lassy to be searchable using the same tools, CGN was ‘translated’ into the Lassy- format. There are, however, still differences in the way a few phenomena are handled..
Paris Lassy vs CGN Subject/direct objects wrt infinitives and participle Partitives (one of them said …): in CGN separate label PART, in Lassy combination of HD and MOD LASSY: head always lexically anchored In LASSY SBAR-complement always VC-label, in CGN either OBJ1 or VC … Analyses not fully identical, but 99% is!
Paris40 Syntactic annotation: Lassy.
Paris41 Syntactic annotation: CGN.
Paris42 To be taken into account In general: Take care of IPR Be prepared to consult other layers Use a flexible bug reporting system “Spoken language”: grammar/system should be very flexible Alignment may be very time consuming Be aware that, as far as consistency is concerned, not the really hard cases are the most important, but rather those the correctors don’t realize to be problematic (because in those cases they don’t consult others) GOOD LUCK !.