Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven.

2 15-11-2011Paris2 This talk... Why CGN: Spoken Dutch Corpus? At that time … Other layers –Orthographic transcription –PoS tagging Syntactic annotation –Dependencies and categories Spoken language –“standard” language, disfluencies LASSY/SoNaR: Written Dutch Corpus What to take into account when planning a ‘spoken treebank’

3 15-11-2011Paris3 Why CGN? Dutch Language Union Dutch/Flemish organization taking care of common language 1997-8: report state of the art wrt Language & Speech Technology 1998: Spoken Dutch Corpus, 5 years, 2/3 Netherlands - 1/3 Flanders, balanced 1000 hours, +/- 10M words 1 M Syntactic Annotation Both research purposes and services (EU) / industry

4 15-11-2011Paris4 At that time This talk: focus on textual aspects! -------------------------------------------------------- No taggers, parsers that could be reused Existing grammars cover(ed) the northern variant of Dutch No ‘formal’ grammar ►start from scratch

5 15-11-2011Paris5 Other layers Relevant for syntax: –Orthographic transcription –PoS tagging All layers in parallel, but per fragment: layer A finished before start layer B (except for errors) Reason: time But: gave us opportunity to express wishes/needs wrt other layers Example: handling of specific types of words.

6 15-11-2011Paris6 Transcription and PoS An example:

7 15-11-2011Paris7 Specific types of words *vwords in another language (not 'adopted' in Dutch) *a not fully realized words (gaan probe instead of gaan proberen) *xwords that could not be (fully) understood (also xxx, ggg) *umispronounced words (ploberen instead of proberen, om-uh-dat*u instead of omdat) *ddialectal words One or more words? zo’n vs zo ‘n (such a): one token! But hebde*d (litt. have you) realized as hebt*d de*d : two tokens

8 15-11-2011Paris8 Syntactic analysis: goal CGN Annotation in theory-neutral format in order to be useful for as many people as possible Categories: NP, PP, … Functions/dependencies: subject, object1, … As automatic as possible: –Tool from NEGRA-corpus: Annotate –for German –same desiderata as CGN (contrary to Dutch AMAZON-parser).

9 15-11-2011Paris Annotate Developed for NEGRA-project (Saarbrücken) – Oliver Plaehn, Thorsten Brants Semi-automatic annotation – Works with tagger and parser – Suggests structures Combined with Cascaded Markov Models (Brants) – Bootstrapping approach possible

10 15-11-2011Paris Annotate screen.

11 15-11-2011Paris11 Annotate ‘correction’ format

12 15-11-2011Paris12 Annotate export format.

13 15-11-2011Paris13 Principles of syntactic annotation Structures as flat as possible Only new level when there is a new head No branching when just one node is involved No duplication of functions (1 SU, 1 OBJ1, …) In principle just non-branching heads Allowed: –multiple branching –crossing dependencies Input: simplified PoS.

14 15-11-2011Paris14 Less PoS-tags Simplified PoS PoS: over 300 tags –Over 100 for pronouns –Not problematic at all, often unique token/tag combinations Not all details necessary for SA Example full tagset –T501aVNW(pers,pron,nomin,vol,1,ev)ik (I) –T501oVNW(pers,pron,nomin,vol,3,ev,masc)hij (he) Example simplified tagset –VNW1VNW(pers,pron)personal pronoun –In graph: both T501a and VNW1.

15 15-11-2011Paris15 Syntactic simplifications Other simplifications Obj2 – indirect object (dative) meewerkend voorwerp Ik geef hem een boek / een boek aan hem (I give him a book) belanghebbend voorwerp Ik koop hem een boek / een boek voor hem (I buy him a book) Bepaling van gesteldheid (~predicative complement) hij verft de deur blauw (he paints the door blue) Hij vindt het boek leuk (he does like the book) Hij nam het boek lachend aan (laughing he accepted the book).

16 15-11-2011Paris16 Results Even then: Annotate did most NPs and PPs very well, but often failed for the more complex parts In some sense surprising as the results for German were much better. However: In that case written language was involved. Training for spoken language is much harder!.

17 15-11-2011Paris17 Details CGN corpus Balanced corpus: types of documents (next slide) Speaker characteristics Sex Age Geographic region Socio-economic class Level of education 2/3 Netherlands, 1/3 Belgium (Flanders) Participants were asked to speak standard language (in case they agreed beforehand to participate in CGN).

18 15-11-2011Paris18 Details CGN corpus ►many types of documents Read-aloud written: Literature read aloud (library for the blind) Written to be spoken: News broadcasts Lectures Spoken (spontaneous) Interviews Phone calls Debates Spontaneous conversations with x people (over lunch etc).

19 15-11-2011Paris19 Variation To some extent differences in written language, much more in spoken variants, esp. in spontaneous speech Separable verbs NL dat ze hem op wilde bellen (that she wanted to call him) VL dat ze hem wilde opbellen Other choice of auxiliaries NL Ze is het komen brengen (she came and brought it) VL Ze heeft het komen brengen Other words for same concept, same words for different concepts Pompbak-gootsteen (sink), namiddag (afternoon-late afternoon) Gramm/dictionaries: mostly northern written variant.

20 15-11-2011Paris Disfluencies Partially realized words hilari*ainstead of hilarisch(EN hilarious) Analyzed as if realized *** Ik doe West- en Oost-Vlaanderen I’ll take care of West- and Oost-Vlaanderen Short for: West-Vlaanderen en Oost-Vlaanderen Completely regularly analyzed as conjunction (CONJ).

21 15-11-2011Paris Disfluencies When too little of a token is realized, such a token is ignored awel genen TV meer en genen boe*a gene voetbal meer. EN: So no more tv and no more football.

22 15-11-2011Paris Ex of disfluency (repetition)

23 15-11-2011Paris Disfluencies Mixed repetition/correction Ze was bijna hileri*a hilari*a She was almost hilarious hileri*a is corrected as hilari*a, only the corrected form is included in the analysis Die verd*a die vervl*a die krankzinnige hond That damn*, that cursed*, that crazy dog Only last 3 words (that crazy dog) included in graph.

24 15-11-2011Paris24 Disfluencies Wrong pronunciation Dat is een serieus plobleem*u Dat is een serieus probleem That’s a serious problem Analysed as if the ‘correct’ word was involved ***

25 15-11-2011Paris25 Words in foreign language In spoken and written language: Words in another language, and not found in a Dutch dictionary: umbrella*v, plus*v de*v temps*v, à la carte not: rendez-vous, cinema, cognac (in Dutch dictionaries) Single words: just like their Dutch counterpart Strings: only ‘top’ label presented Sentences: not analyzed.

26 15-11-2011Paris26 Pro and con markings Markings (*a, etc) have proven to be useful for PoS and SA. But: should have been removed afterwards, i.e. all information should have been contained in tags, orthographic level should contain only orthography Problem: other groups wanted them at orthographic level for speech recognition purposes Solution: add a field without markings.

27 15-11-2011Paris27 Syntactic annotation Lacking and superfluous words There are no ‘ungrammatical’ sentences, all sentences are to be analyzed! Lacking elements:just accept it Superfluous elements:just accept it BUT there are some exceptions: repetition ‘accidental’ sentences.

28 15-11-2011Paris28 Not analyzed parts Sometimes parts of a ‘sentence’ are ‘ignored’: Reparations Ik zie hem morg*a overmorgen I’ll see him the day after tomorrow Repetitions Hij is in in vergadering He has a meeting Or not connected: ‘accidental’ sentences/units Ik heb nooit ik ben lerares I have never I am a teacher Uh-insertion (hesitation marker) Ze heeft uh zeven dochters She has seven daughters.

29 15-11-2011Paris29 Examples More of the same

30 15-11-2011Paris30 Asyndetic conjunction

31 15-11-2011Paris31 Discourse phenomena Some examples of ‘discourse’ within a sentence

32 15-11-2011Paris32 Accidental unit ‘Accidental’ unit, discourse parts not connected

33 15-11-2011Paris33 Syntactic annotation sentence vs discourse

34 15-11-2011Paris34 Atypical ‘sentences’ Often: discourse

35 15-11-2011Paris35 Complicating factors No punctuation apart from full stop, question mark, elipsis ‘wrong order’ of sentences when more people are talking at the same time! ►Tricky wrt coreference, temporal reasoning etc Spelling: incorrect (but correct with other meaning) U zij de glorie (Thine be the glory) U zei de glorie (‘zei’ meaning ‘said’) Ik zal haar eraan houden (houden aan: to keep a promise) Ik zal haar er aanhouden (aanhouden: to arrest) ►context, recordings.

36 15-11-2011Paris36 Written corpus: Lassy/SoNaR STEVIN programme (Flemish/Dutch - 2004-2011) D-Coi / LASSY / (SoNaR) 1M SA written text, manually corrected, plus 1.500M SA automatically ALPINO parser (Groningen) Largely inspired by CGN, based on HPSG Some differences Mentioning of ‘hidden’ subjects, objects – Hij heeft een boek gekocht.

37 15-11-2011Paris37 Alpino Alpino grammar: HPSG-based ‘Constructional’ approach: – rich lexical representations – many detailed, construction specific lexical rules (+/- 600) Grammar based parsing very efficient, esp when combined with specific rules Large lexicon (100.000+ entries, 200.000+ NEs) – Stored as perfect hash finite automaton (Daciuk) Crucial: Integrated tagger (=/= CGN tagger!) Left corner parser

38 15-11-2011Paris Alpino (as is) and CGN Parsing the CGN-corpus with Alpino very bad results reason might be: it uses a ‘wrong’ grammar, inadequate lexicon etc etc As we wanted both CGN and Lassy to be searchable using the same tools, CGN was ‘translated’ into the Lassy- format. There are, however, still differences in the way a few phenomena are handled..

39 15-11-2011Paris Lassy vs CGN Subject/direct objects wrt infinitives and participle Partitives (one of them said …): in CGN separate label PART, in Lassy combination of HD and MOD LASSY: head always lexically anchored In LASSY SBAR-complement always VC-label, in CGN either OBJ1 or VC … Analyses not fully identical, but 99% is!

40 15-11-2011Paris40 Syntactic annotation: Lassy.

41 15-11-2011Paris41 Syntactic annotation: CGN.

42 15-11-2011Paris42 To be taken into account In general: Take care of IPR Be prepared to consult other layers Use a flexible bug reporting system “Spoken language”: grammar/system should be very flexible Alignment may be very time consuming Be aware that, as far as consistency is concerned, not the really hard cases are the most important, but rather those the correctors don’t realize to be problematic (because in those cases they don’t consult others) GOOD LUCK !.

