Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary Thematic Training.

Similar presentations


Presentation on theme: "The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary Thematic Training."— Presentation transcript:

1 The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary Thematic Training Course on Processing Morphologically Rich Languages April 2011

2 Thematic Training Course on Processing Morphologically Rich Languages Outline Introduction Syntax vs. morphology from a linguistic viewpoint Morphological coding systems in Hungarian Morphosyntactic information in Hungarian corpora Language-specific morphosyntactic problems Effects on IE, NER and MT

3 Thematic Training Course on Processing Morphologically Rich Languages Syntax vs. morphology Typological differences among languages Agglutinative lg: role of morphology is stronger (lot of information in morphemes) Isolating lg: role of syntax is stronger (less morphemes, more constructions) Focus on Hungarian (agglutinative) and English (fusional/isolating)

4 Thematic Training Course on Processing Morphologically Rich Languages Basic Hungarian syntax Lot of information encoded in morphemes No fixed word order Information structure is reflected in word order (theme- rheme, old-new) Péter szereti Marit. Peter love-3SgObj Mary-ACC ‘Peter loves Mary.’ Péter Marit szereti. ‘It is Mary who Peter loves.’ Marit szereti Péter. ‘It is Mary who Peter loves.’ Marit Péter szereti. ‘It is Peter who loves Mary.’ Szereti Péter Marit. ‘Peter LOVES Mary (and not hates).’ Szereti Marit Péter. ‘Peter LOVES Mary (and not hates).’

5 Thematic Training Course on Processing Morphologically Rich Languages Morphosyntactic features of Hungarian Nominal declination (nouns, adjectives, numerals) Verbal conjugation Several hundreds of word forms for each lemma Grammatical relations encoded primarily by morphemes -> morpho + syntactic

6 Thematic Training Course on Processing Morphologically Rich Languages Nominal suffixes A stem can be extended by: Derivational suffixes Plural Possessive Case suffixes hat-ás-a-i-nak ‘to its effects’ stem-DERIV.SUFF-POSS-POSS.PL-DAT egész-ség-ed-re ‘cheers’ stem-DERIV.SUFF-POSS.Sg2-SUB

7 Thematic Training Course on Processing Morphologically Rich Languages Case suffixes in Hungarian ~20 cases („rare” cases are not always counted: distributive-temporal (-nte), associative (-stul/-stül…)) always at the right end of the word form grammatical relations are encoded: –Arguments of the verb –Adjuncts (temporal and locative adverbials)

8 Thematic Training Course on Processing Morphologically Rich Languages …and in English Pisti szerdánként edzésre jár. Steve Wednesday-DIST-TEMP training-SUB go-3Sg Each Wednesday Steve goes to training. Szerdánként – each Wednesday Edzésre – to training

9 Thematic Training Course on Processing Morphologically Rich Languages Pisti bort iszik. Steve wine-ACC drink-3Sg Steve is drinking wine. Pisti-NOM – Steve – subject Bort – wine - object

10 Thematic Training Course on Processing Morphologically Rich Languages Possessive in Hungarian A fiú kutyája The boy dog-POSS The boy’s dog A(z ő) kutyája The (he) dog-POSS His dog Possessor in nominative Possessed with a possessive marker A fiúnak a kutyája The boy-DAT the dog- POSS Possessor in dative Possessed with a possessive marker

11 Thematic Training Course on Processing Morphologically Rich Languages …and in English The boy’s dog His dog Possessor with a possessive marker (pronoun) Possessed with no marker The dog of the boy Possessive relation is marked by a preposition

12 Thematic Training Course on Processing Morphologically Rich Languages Hungarian vs. English - nouns Number of word forms: several hundreds (HU) vs. 2-3 (EN) Means to express grammatical relations: –Suffixes (HU) –Preposition, fixed position (word order), suffix, determiner (EN) Methods for morphological parsing are very different for Hungarian and English

13 Thematic Training Course on Processing Morphologically Rich Languages Verbal suffixes A stem can be extended by: Derivational suffixes Mood markers Tense markers Person/number suffixes Objective markers Vág-at-ná-k Cut-CAUS-COND-3PlObj ‘they would have it cut’

14 Thematic Training Course on Processing Morphologically Rich Languages Mood and tense in Hungarian Mood: –Indicative: default (not marked) –Conditional: suffixes (present) – analytic form (past) –Imperative: suffixes Tense: –Present: default (not marked) –Past: suffixes –Future: analytic (auxiliary fog)

15 Thematic Training Course on Processing Morphologically Rich Languages …and in English Mood: –Indicative: default (not marked) –Conditional: past tense forms + analytic forms (auxiliary would) –Imperative: auxiliaries + grammatical structure Tense: –Present: default (not marked) –Past: suffix / irregular forms (suppletives or ablaut (vowel change)) –Future: analytic (auxiliary will)

16 Thematic Training Course on Processing Morphologically Rich Languages Person & Number Hungarian: suffixes Fut-ok Fut-sz Fut Fut-unk Fut-tok Fut-nak 3Sg is the default (not marked!) English: 3Sg + pronouns / obligatory subject I run You run He runs We run You run They run 3Sg marked!

17 Thematic Training Course on Processing Morphologically Rich Languages Derivational suffixes in Hungarian Possibility/permission: fut-hat-ok run-MOD-1Sg ‘I may run’ Reflexive: mos-akod-unk wash-REFL-1Pl ‘we wash ourselves’ Frequentative: üt-öget-sz hit-FREQ-2Sg ‘you hit sg repeatedly’ Causative: csinál-tat-nak do-CAUS-3Pl ‘they have sg done’

18 Thematic Training Course on Processing Morphologically Rich Languages … and in English Possibility/permission: auxiliaries Reflexive: pronominal objects Frequentative: adverb Causative: construction

19 Thematic Training Course on Processing Morphologically Rich Languages Hungarian vs. English - verbs Number of word forms: several hundreds (HU) vs. 4-5 (EN) Means to express grammatical relations: –Suffixes + auxiliaries (HU) –Auxiliaries + reflexive pronouns + constructions (EN) A lot of syntactic information is encoded in Hungarian morphemes

20 Thematic Training Course on Processing Morphologically Rich Languages MorphologySyntaxEnglish Nominal suffixverb-argument relation word order, preposition possessivesuffix, preposition Verbal suffixtensesuffix agreementpronoun, suffix modalityauxiliary causationconstruction aspectconstruction reflexivitypronoun

21 Thematic Training Course on Processing Morphologically Rich Languages Morphosyntactic coding systems Language independent (?) Language dependent (dis)advantages: –comparability –considering language-specific features –complexity Different information is necessary for each language

22 Thematic Training Course on Processing Morphologically Rich Languages Hungarian coding systems HUMOR –recall Thursday Session 1 –in the Hungarian National Corpus MSD –In Szeged Treebank –Parser and POS-tagger available at: szeged.hu/rgai/magyarlanchttp://www.inf.u- szeged.hu/rgai/magyarlanc KR –No database –Parser and POS-tagger available at:

23 Thematic Training Course on Processing Morphologically Rich Languages MSD Morphosyntactic Description International coding system: –English –Romanian –Slovenian –Czech –Bulgarian –Estonian –Hungarian

24 Thematic Training Course on Processing Morphologically Rich Languages MSD - 2 Positional codes A given position encodes a given type of information Position 0: part-of-speech Position 1: (sub)type within POS Further positions: other grammatical information (person, number, case, etc.) Irrelevant positions are marked with a hyphen (-)

25 Thematic Training Course on Processing Morphologically Rich Languages KR Created for Hungarian Hierarchical attribute-value matrices Default values (3Sg, singular…) Derivational information is encoded Compounds are also segmented

26 Thematic Training Course on Processing Morphologically Rich Languages MSD vs. KR Differences between the two systems: –derivation –compounds Harmonization efforts in order to build a morphological parser the output of which is in total harmony with the Szeged Treebank (magyarlanc) (Farkas et al. 2010)

27 Thematic Training Course on Processing Morphologically Rich Languages Nouns in MSD kutya Nc-sn ‘dog’ kutyámatkutya Nc-sa---s1 ‘my dog-ACC’ kutyaházaikrólkutyaház Nc-ph---p3 ‘about their doghouse’ ObamáhozObama Np-st ‘to Obama’

28 Thematic Training Course on Processing Morphologically Rich Languages Verbs in MSD futokfut Vmip1s---n ‘I run’ futhatszfut Voip2s---n ‘you can run’ ütögettéküt Vfis3p---y ‘they were hitting it’ csináltattunkcsinál Vsis1p---n ‘we had sg made’

29 Thematic Training Course on Processing Morphologically Rich Languages Morphosyntactically annotated Hungarian corpora Hungarian National Corpus –100-million-word balanced reference corpus of present-day Hungarian –Word forms automatically annotated for stem, part of speech and inflectional information –http://corpus.nytud.hu/mnsz/index_eng.htmlhttp://corpus.nytud.hu/mnsz/index_eng.html Szeged Treebank –1-million words, 82K sentences –Manually annotated for lemma, POS-tags –Constituency and dependency trees –http://www.inf.u-szeged.hu/rgai/nlphttp://www.inf.u-szeged.hu/rgai/nlp

30 Thematic Training Course on Processing Morphologically Rich Languages Szeged Treebank Manually annotated treebank for Hungarian –Covers various linguistics styles literature, newspapers, laws, student essays, computer books, etc. multilingual connection: Orwell’s 1984; Win2000 manual in Hungarian –Available free of charge for research Developed by –University of Szeged, HLT group –MorphoLogic Ltd. –Academy of Sciences, Research Institute for Linguistics

31 Thematic Training Course on Processing Morphologically Rich Languages Szeged Treebank 2. TEI XML format Manually annotated –sentence split & word segmentation –morphological analysis –PTB-style syntactic structure –Verb argument structure –converted / extended to Dependency Grammar format manually

32 Thematic Training Course on Processing Morphologically Rich Languages Szeged Treebank 3. Several versions Constituency and dependency versions Old MSD codes New (harmonized) MSD codes (dependency) parser under development Being extended with folklore texts

33 Thematic Training Course on Processing Morphologically Rich Languages Dependency vs. constituency Each node corresponds to a word -> no virtual nodes (CP, I’…) in dependency trees Constituency grammars said to be good for languages with fixed word order Syntactic relations are determined –by the position in the tree (constituency grammar) –by dependency relations (labeled edges) (dependency)

34 Thematic Training Course on Processing Morphologically Rich Languages Constituency trees in SzT2.0 Based on generative syntax (É. Kiss et al. 1999) Syntactic features of Hungarian also considered (i.e. not hardcore Chomskyan trees) Verb-argument relations are encoded by labels Very detailed information: different grammatical role for each case suffix Semantic information also can be found (temporal and locative adverbials)

35 Thematic Training Course on Processing Morphologically Rich Languages Aggie all relative-POSS-ACC the day before yesterday see-PAST-3Sg-Obj guest-ESS ‘Aggie received all of her relatives the day before yesterday.’

36 Thematic Training Course on Processing Morphologically Rich Languages

37 Dependency trees in Szeged Dependency Treebank Based on SzT2.0 Automatic conversion and manual correction Word forms are the nodes of the tree Simplified relations for nominal arguments: SUBJ, OBJ, DAT,OBL, ATT Semantic information kept Sentences without 3Sg copula are distinctively marked

38 Thematic Training Course on Processing Morphologically Rich Languages Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions.

39 Thematic Training Course on Processing Morphologically Rich Languages Virtual nodes No overt copula in present tense 3Sg Only subject and predicative noun/adjective manifest No syntactic structure in SzT (grammatical roles are not marked) Virtual nodes in SzDT

40 Thematic Training Course on Processing Morphologically Rich Languages I like to go to school because it is good to be at school though not always.

41 Thematic Training Course on Processing Morphologically Rich Languages Szeged Treebank vs. Szeged Dependency Treebank Labeled relations in both cases -> not so sharp contrast Virtual nodes in SzDT -> grammatical structure marked for every sentence (IE, MT) No word order constraints in SzDT Word forms are marked Other possibilities: morpheme-based syntax (Prószéky et al. (1989), Koutny, Wacha (1991))

42 Thematic Training Course on Processing Morphologically Rich Languages Language-specific morphosyntactic problems Morphology vs. syntax: –Pseudo-subjects –Pseudo-objects –Pseudo-datives Morphological analysis of unknown words Lemmatization of named entities

43 Thematic Training Course on Processing Morphologically Rich Languages Pseudo-subjects a noun in nominative is not the subject of the sentence -> special attention required when parsing Possessor: a kisfiú labdája the boy ball-3SgPOSS the boy’s ball Predicative noun: István juhász maradt. Stephen shepherd remain-PAST Stephen remained a shepherd. Object: A kutyám kergeti a macska. The dog-POSS chase-3SgObj the cat ‘The cat is chasing my dog.’ (garden path sentence) A fiam szereti a lányod. The son-1SgPOSS love-3SgObj the daughter-2SgPOSS ‘My son loves your daughter’ or ‘Your daughter loves my son’

44 Thematic Training Course on Processing Morphologically Rich Languages Solutions Possessor: –SzT: one NP includes the possessor and the possessed ((a kisfiú) labdája) –SzDT: ATT relation Predicative noun: PRED relation –Virtual node in SzDT Object: OBJ relation –Sometimes contextual information is needed even for humans…

45 Thematic Training Course on Processing Morphologically Rich Languages Pseudo-objects Adverbials with an apparently accusative ending: Futottam egy jót. Run-PAST-1Sg a good-ACC I have had a good run. Nagyot aludtam. Big-ACC sleep-PAST-1Sg I have slept a lot. Intransitive verbs -> cannot be an object -> MODE relation

46 Thematic Training Course on Processing Morphologically Rich Languages Pseudo-datives Not all (semantic) subjects are in nominative: Dative subject: Sándornak kell elrendeznie az ügyeket. Alexander-DAT must arrange-INF-3Sg the issue-PL Alexander has to arrange the issues. DAT in both corpora Certain auxiliaries with dative subjects (exceptions) Dative-nominative parallelism in possessive as well

47 Thematic Training Course on Processing Morphologically Rich Languages Unknown words Unknown words can be: –Compounds –Named entities –Derivations fémkapunk félmillió csokinyúl NATO-hoz Methods for analysis (Zsibrita et al. 2010): –Segmentation into two or more analyzable parts –Expert rules to filter impossible combinations (*V+N) –Analysis of the last part goes to the whole word –Substitution for hyphenated words (pre-defined patterns for each morphological class)

48 Thematic Training Course on Processing Morphologically Rich Languages félmillió fél+millió Mc-snl félNhalf ADJhalf NUMhalf Vbe afraid millióNUMmillion Expert rules: NUM + NUM * non-NUM + NUM

49 Thematic Training Course on Processing Morphologically Rich Languages fémkapunk fém+kap+unk Vmip1p---n fém+kapu+nk Nc-sn---p1 fémNmetal kapVget kapuNgate unkS1Pl (verb) nkS1PlPoss (noun) Expert rules: N + N N-nonNOM + V * N-NOM + V

50 Thematic Training Course on Processing Morphologically Rich Languages csokinyúl csoki+nyúl Vmip3s---n Nc-sn cso+kinyúl (?) Vmip3s---n csokiNchocolate nyúlNrabbit Vstretch kinyúlVstretch out Expert rules: N + N N-nonNOM + V * N-NOM + V

51 Thematic Training Course on Processing Morphologically Rich Languages NATO-hoz NATO: V Vmip3s---n NATO-hoz (kalaphoz) NATO: N Np-st Ordering of rules: 1.substitution 2.segmentation NATO? hozVbring Sto Expert rules: N S N-nonNOM V * N-NOM V V V Substitution: NATO- -> kalap ‘hat’

52 Thematic Training Course on Processing Morphologically Rich Languages Lemmatization Lemmatization (i.e. dividing the word form into its root and affixes) is not a trivial task in morphologically rich languages such as Hungarian common nouns: relying on a good dictionary NEs: cannot be listed Problem: the NE ends in an apparent suffix

53 Thematic Training Course on Processing Morphologically Rich Languages Lemmatization of NEs each ending that seems to be a possible suffix is cut off the NE in step-by-step fashion Citroenben Citroenben (lemma) Citroen + ben ‘in (a) Citroen’ Citroenb + en ‘on (a) Citroenb’ Citroenbe + n ‘on (a) Citroenbe’ Each possible lemma undergoes a Google and a Yahoo search – the most frequent one is chosen (Farkas et al. 2008)

54 Thematic Training Course on Processing Morphologically Rich Languages NLP applications NER –NEs with suffixes Information extraction –Modality, uncertainty –Causation Machine translation –Morphemes vs. structures

55 Thematic Training Course on Processing Morphologically Rich Languages Named Entities NEs should be recognized They should be morphosyntactically tagged -> proper syntactic/semantic analysis A Citroenben a Peugeot meghatározó tulajdonhányadot szerez. Mini dictionary + suffix list + semantic frame

56 Thematic Training Course on Processing Morphologically Rich Languages aDETthe benSin Citroenben? enSon meghatározóADJdominant nSon otSACC Peugeot? szerezVacquire tSACC tulajdonrészNinterest

57 Thematic Training Course on Processing Morphologically Rich Languages Possible analyses Citroenben Citroen + ben ‘Citroen- INE’ Citroenb + en ‘Citroenb- SUP’ Citroenbe + n ‘Citroenbe-SUP’ Peugeot Peugeo + t ‘Peugeo- ACC’ Peuge + ot ‘Peuge- ACC’

58 Thematic Training Course on Processing Morphologically Rich Languages A semantic frame [1=V("szerez"|"vásárol "|"vesz"|"megvesz"|"megvásárol"|"felvásárol")+subject= 2+direct_object=3] [2=N] [3=N("részesedés"|"tulajdon"|"tulajdonrész"|"rész„| ”tulajdonhányad”)+compl1=4+modified_by_adj=5] [4=N+case=ine+ceg] [5=A+measure+modified_by_number=6] [6=NB]

59 Thematic Training Course on Processing Morphologically Rich Languages Analysis A Citroenben a Peugeot meghatározó tulajdonhányadot szerez. Tulajdonhányadot -> ACC/OBJ (3) Citroenben -> INE (4) Peugeot -> NOM/SUBJ (2) ‘Peugeot acquires a dominant interest in Citroen.’

60 Thematic Training Course on Processing Morphologically Rich Languages Uncertainty Text Mining: –derive facts from free text –uncertainty and negation have an impact on the quality/nature of the information extracted applications have to treat sentences / clauses containing uncertain or negated information differently from factual information Uncertainty: possible existence of a thing (neither its existence nor its non-existence is claimed)

61 Thematic Training Course on Processing Morphologically Rich Languages Uncertainty detection Uncertainty detection in English: cues (words with uncertain content) One typical means to express uncertainty in Hungarian: -hat/het High school grades may influence health. A középiskolai jegyek kihathatnak az egészségre. Morphological analysis should reflect modality (Voip3s---n)

62 Thematic Training Course on Processing Morphologically Rich Languages Causation Semantic/thematic relations to be determined properly AGENT != SUBJECT Varrattam egy ruhát. sew-CAUS-PAST-1Sg a dress-ACC ‘I had a dress sewn.’ Varrattam Marival egy ruhát. sew-CAUS-PAST-1Sg Mari-INS a dress-ACC ‘I had Mary sew a dress.’ Varrtam Marival egy ruhát. sew-PAST-1Sg Mari-INS a dress-ACC ‘I sewed a dress with Mary.’ Causative information should be encoded (Vsip3s---n)

63 Thematic Training Course on Processing Morphologically Rich Languages Argument structure of causative verbs AgentBeneficiaryPatient Varrattam egy ruhát. ?I (NOM)ruha (ACC) Varrattam Marival egy ruhát. Mari (INS)I (NOM)ruha (ACC) Varrtam Marival egy ruhát. I (NOM) + Mari (INS) ?ruha (ACC)

64 Thematic Training Course on Processing Morphologically Rich Languages Machine translation Morpheme-based translation would be ideal Easier alignment of translational units Good morphological parser needed Easier to execute in dependency grammar Morpheme-based dependency structures

65 Thematic Training Course on Processing Morphologically Rich Languages Alignments at | varr | t | ruha have | sewn | dress ban | ház | am in | house | my

66 Thematic Training Course on Processing Morphologically Rich Languages Problems Not practical: no corpus available at the moment Portmanteau morphs – alignment problems Zero morphs – how many of them? 3 zero morphs in Hungarian nouns: könyv-Ø-Ø-Ø vs. könyveit book-Ø-Ø-Ø book-POSS-POSS.PL-ACC (Mel’cuk 2006)

67 Thematic Training Course on Processing Morphologically Rich Languages Morphosyntactic codes might help Csinálhattátok Vois2p---y Reordering rules Vcsináldo ohatcan i-- stPAST 2ptokyou yáit csinálh attátok you could do it

68 Thematic Training Course on Processing Morphologically Rich Languages An example hat | csinál / | \ t á tok can | do / | \ d Ø you could / \ you do

69 Thematic Training Course on Processing Morphologically Rich Languages Syntax vs. case suffix Pseudo-subjectExtra rules; PRED, OBJ difficult for humans Pseudo-objectList of adverbs with accusative ending Pseudo-dativeList of verbs with dative subject Unknown words (lemmas+suffixes) Guessing (rules) Information extraction Thematic/semantic relations Proper morphosyntactic codes + rules Uncertainty detectionProper morphosyntactic codes Machine translation (morpheme-based) Proper morphosyntactic codes

70 Thematic Training Course on Processing Morphologically Rich Languages Summary Syntax-morphology interface in Hungarian Morphological coding systems Syntactic annotation in Hungarian corpora Morphosyntactic problems: –NER –IE –MT

71 Thematic Training Course on Processing Morphologically Rich Languages References É. Kiss K., Kiefer F., Siptár P.: Új magyar nyelvtan, Osiris Kiadó, Bp., Farkas Richárd, Szeredi Dániel, Varga Dániel, Vincze Veronika 2010: MSD-KR harmonizáció a Szeged Treebank 2.5-ben. In: Tanács Attila, Vincze Veronika (szerk.): VII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, pp MSD-KR harmonizáció a Szeged Treebank 2.5-ben Farkas, Richárd; Vincze, Veronika; Nagy, István; Ormándi, Róbert; Szarvas, György; Almási, Attila 2008: Web-based lemmatisation of Named Entities. In: Horák, Ales; Kopeček, Ivan; Pala, Karel; Sojka, Petr (eds.): Proceedings of the 11th International Conference on Text, Speech and Dialogue (TSD2008), Berlin, Heidelberg, Springer Verlag, LNCS 5246, pp Web-based lemmatisation of Named Entities. Koutny I., Wacha B.: Magyar nyelvtan függőségi alapon. Magyar Nyelv Vol. 87 No. 4. (1991) 393–404. Mel’cuk, Igor 2006: Aspects of the Theory of Morphology. Mouton de Gruyter. Prószéky, G., Koutny, I., Wacha, B.: Dependency Syntax of Hungarian. In: Maxwell, Dan; Klaus Schubert (eds.) Metataxis in Practice (Dependency Syntax for Multilingual Machine Translation), Foris, Dordrecht, The Netherlands (1989) 151–181 Zsibrita János, Vincze Veronika, Farkas Richárd 2010: Ismeretlen kifejezések és a szófaji egyértelműsítés. In: Tanács Attila, Vincze Veronika (szerk.): VII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged, Szegedi Tudományegyetem, pp Ismeretlen kifejezések és a szófaji egyértelműsítés


Download ppt "The Syntax-Morphology Interface and Natural Language Processing Veronika Vincze University of Szeged Hungary Thematic Training."

Similar presentations


Ads by Google