Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kenneth Beesley/ 30 July 2000 / page 1 Semitic Languages, Linguistics and Computers Kenneth R. BEESLEY Xerox Research Centre Europe (XRCE)

Similar presentations


Presentation on theme: "Kenneth Beesley/ 30 July 2000 / page 1 Semitic Languages, Linguistics and Computers Kenneth R. BEESLEY Xerox Research Centre Europe (XRCE)"— Presentation transcript:

1 Kenneth Beesley/ 30 July 2000 / page 1 Semitic Languages, Linguistics and Computers Kenneth R. BEESLEY Xerox Research Centre Europe (XRCE) ken.beesley@xrce.xerox.combeesley@xrce.xerox.com University of Malta March 2001

2 Kenneth Beesley/ 30 July 2000 / page 2 Ken Beesley: Brief Introduction B.A., Linguistics and Computer Science, Brigham Young University, 1978B.A., Linguistics and Computer Science, Brigham Young University, 1978 Diploma, Linguistics and Phonetics, Univ. of Glasgow, 1979Diploma, Linguistics and Phonetics, Univ. of Glasgow, 1979 D.Phil., “Epistemics” (Cognitive Science), Univ. of Edinburgh, 1983D.Phil., “Epistemics” (Cognitive Science), Univ. of Edinburgh, 1983 ALPNET, computer assisted translation, 1984-1990ALPNET, computer assisted translation, 1984-1990 1988-1990 Arabic morphology project, exposure to Finite-State Morphology (“Two-Level Morphology”) from Lauri Karttunen at COLING 19881988-1990 Arabic morphology project, exposure to Finite-State Morphology (“Two-Level Morphology”) from Lauri Karttunen at COLING 1988 Microlytics (Xerox spinoff), 1990-1993Microlytics (Xerox spinoff), 1990-1993 Xerox Corporation 1993-presentXerox Corporation 1993-present Computational Morphology projects: Arabic, Spanish, Portuguese, Italian, Dutch, (Malay), (Aymara); also teaching finite-state programming techniquesComputational Morphology projects: Arabic, Spanish, Portuguese, Italian, Dutch, (Malay), (Aymara); also teaching finite-state programming techniques Some people are into finite-state programming for the mathematics and algorithms; I’m in it because it lets me build working systems for interesting natural languages.Some people are into finite-state programming for the mathematics and algorithms; I’m in it because it lets me build working systems for interesting natural languages.

3 Kenneth Beesley/ 30 July 2000 / page 3 Overview of Today’s Talk Formal MorphologyFormal Morphology –Morphotactics—study and description of word formation –Morphophonology—study and description of alternations Challenges/Issues in Semitic morphologyChallenges/Issues in Semitic morphology Computational Morphology (Finite-State Morphology paradigm)Computational Morphology (Finite-State Morphology paradigm) –General challenges/successes around the world –Semitic languages—always seem to be a bit harder Significant computational work already done on Semitic languagesSignificant computational work already done on Semitic languages Hope to inspire moreHope to inspire more

4 Kenneth Beesley/ 30 July 2000 / page 4 Concatenative-Polysynthetic (Inuktitut) Lexical:natsiq+viniq+tuq+lauq+si+ma+vi+t+li Surface:natsi viniq tu lauq si ma vi l li natsiq“seal” (open-class stem)natsiq“seal” (open-class stem) viniq“meat” (closed-class substem)viniq“meat” (closed-class substem) tuq“eat” (closed-class substem)tuq“eat” (closed-class substem) lauq“before”lauq“before” siperfectivesiperfective maresulting statemaresulting state viquestion markerviquestion marker tyoutyou libutlibut “(but) have you ever eaten seal meat before?”

5 Kenneth Beesley/ 30 July 2000 / page 5 InuktitutInuktitut Paris+mut+nngau+juma+niraq+lauq+si+ma+nngit+junga Pari mu nngau juma nira lauq si ma nngit tunga Paris‘Paris’Paris‘Paris’ mutterminalis-casemutterminalis-case nngaudirection-tonngaudirection-to jumawantjumawant niraqdeclare thatniraqdeclare that lauqpastlauqpast siperfectivesiperfective maresulting statemaresulting state nngitnegativenngitnegative junga1P pres. indicjunga1P pres. indic “I never said that I wanted to go to Paris”

6 Kenneth Beesley/ 30 July 2000 / page 6 Concatenative-Agglutinative (Aymara) Lexical:uta+ma+na-ka+p+xa+raki-i+wa Surface:uta ma n ka p xa rak i wa uta=house (noun stem)uta=house (noun stem) +ma=2nd person possessive (your)+ma=2nd person possessive (your) +na=in (case suffix)+na=in (case suffix) -ka=locative (also verbalizes)-ka=locative (also verbalizes) +p=plural+p=plural +xa=perfect aspect+xa=perfect aspect +raki =also+raki =also -i=3rd person present tense-i=3rd person present tense +wa=affirmative sentencial+wa=affirmative sentencial “also they are in your house”

7 Kenneth Beesley/ 30 July 2000 / page 7 AymaraAymara Morphophonemic: Ch’uñu +: +wi +na -ka +si -ka -iri +: +ya:t(a) +wa Surface: Ch’uñü wi n ka s k irï yät wa ch’uñuN‘freeze-dried potatoes’ +:N>Vbe/make … +wiV>Nplace-of +nain (location) -kaN>Vbe-in (location) +sicontinuative -kaimperfect -iriV>None who +:N>Vbe +ya:ta1P recent past +waaffirmative sentencial ‘I was (one who was) always at the place for making ch’uñu ’

8 Kenneth Beesley/ 30 July 2000 / page 8 Theory-Neutral Morphological Analysis Black-Box Morphological Analyzer Words Analyses Analysis undoes the morphotactic and morphophonological processes, separating and identifying the morphemes Generation is ideally just the inverse of analysis. Ch’uñüwinkaskirïyätwa

9 Kenneth Beesley/ 30 July 2000 / page 9 The Claim/Goal of Xerox Finite-State Morphology Both the morphotactics and the morphophonological alternations can be described with regular expressions, or equivalent shorthand notations, which are compiled into finite-state transducers (networks)Both the morphotactics and the morphophonological alternations can be described with regular expressions, or equivalent shorthand notations, which are compiled into finite-state transducers (networks) Morphotactic Description (regular expression or lexc) Alternation Rules (regular expression) Compiler FST.o.=>FST “Lexical Transducer” Combine via Composition at compile-time

10 Kenneth Beesley/ 30 July 2000 / page 10 A Properly Defined “Lexical Transducer” FST can Perform Morphological Analysis and Generation inflected form canonical forminflection codes vouloir+IndP+SG+P3 veut veut vouloir+IndP+SG+P3 Finite-state network bidirectionalbidirectional –same network for both analysis and generation efficientefficient –process thousands of words/second compactcompact –less than 1MB in compressed form

11 Kenneth Beesley/ 30 July 2000 / page 11 Why is finite-state power interesting? Formally constrained (not just a bunch of ad hoc code)Formally constrained (not just a bunch of ad hoc code) Flexible—grammars compile into finite-state automata (networks) that can themselves be combined and modified without needing to change the original grammarFlexible—grammars compile into finite-state automata (networks) that can themselves be combined and modified without needing to change the original grammar Networks provide efficient storageNetworks provide efficient storage Networks can be “applied” very efficiently—morphological analyzers typically run at thousands of words per second on modern machinesNetworks can be “applied” very efficiently—morphological analyzers typically run at thousands of words per second on modern machines Networks are bi-directionalNetworks are bi-directional The application code is language-independentThe application code is language-independent

12 Kenneth Beesley/ 30 July 2000 / page 12 Some Aymara alternation rules a -> ä, i -> ï, u -> ü || _ “:”a -> ä, i -> ï, u -> ü || _ “:” [ a | i | u ] -> 0 || _ “-”[ a | i | u ] -> 0 || _ “-” c h i “-” -> s || _ [ t | s ]c h i “-” -> s || _ [ t | s ] s t ä (->) t ä s || k i _ t as t ä (->) t ä s || k i _ t a You can see and download the set of real Aymara alternation rules at http://www.xrce.xerox.com/research/mltt/aymara

13 Kenneth Beesley/ 30 July 2000 / page 13 Finite-State Morphology Software Implementations—Development EnvironmentsSoftware Implementations—Development Environments –“Two-Level” Morphology (e.g. PC-KIMMO) –Xerox Finite-State Morphology (lexc, xfst, twolc, …) –AT&T Library, Lextools –Univ. of Groningen, Fsa Utils 6 Morphological ApplicationsMorphological Applications –All the commercially interesting Indo-European languages –Also Finnish, Hungarian, Turkish, Swahili, Korean, Japanese –Significant research in Irish, Basque, Malay, Aymara, …

14 Kenneth Beesley/ 30 July 2000 / page 14 Criticism of Traditional Finite-State Morphotactics Two-Level and Finite-State Morphology in general have been widely criticized for handling only “concatenative morphotactics”. “Only restricted infixation and reduplication can be handled adequately with the present system. Some extensions or revisions will be necessary for an adequate description of languages possessing extensive infixation or reduplication.” (Koskenniemi, 1983, p. 27) In particular, it is often charged that finite- state morphology is not capable of handling Semitic languages.

15 Kenneth Beesley/ 30 July 2000 / page 15 The Challenge of Fixed-length Reduplication in Tagalog (Antworth 1990:156-162) pili ‘choose’=>pipilipili ‘choose’=>pipili tahi ‘sew’=>tatahitahi ‘sew’=>tatahi kuha ‘take’=>kukuhakuha ‘take’=>kukuha Antworth defines a morphophonemic lexical prefix RE+ plus alternation rules that realize R as the first following consonant and E as the first following vowel. Lexical: RE+pili RE+tahi RE+kuha Surface: p i pili t a tahi k u kuha Thus solution is adequate and even elegant for such fixed- length reduplication.

16 Kenneth Beesley/ 30 July 2000 / page 16 Challenge: Malay/Indonesian Full-Stem Reduplication Simple reduplication:buku+redup[~]Stem: buku (“book”) buku-buku“books” Prefixed reduplication:bagi+meN[redup[~]]Stem: bagi (“divide”) membagi-bagi“divide into separate parts” pijit+meN[redup[~]]Stem: pijit (“get a massage”) memijit-mijit“squeeze” Redup, prefix-suffix:merah+ke[redup[~]]anStem: merah (“red”) kemerah-merahan“reddish” Prefix-suffix, redup:ubah+redup[per~an]Stem: ubah (“difference”) perubahan-perubahan“alternations/changes”

17 Kenneth Beesley/ 30 July 2000 / page 17 The Xerox compile-replace algorithm An algorithm that takes a finite-state network as an argument and returns a modified (still finite-state) networkAn algorithm that takes a finite-state network as an argument and returns a modified (still finite-state) network Can be applied to the upper-side and/or the lower-side of a network, perhaps multiple times.Can be applied to the upper-side and/or the lower-side of a network, perhaps multiple times. compile-replacecompile-replace –finds delimited substrings of the form ^[ string ^], where the string is just a string of symbols, joined by concatenation, but which happens to have the format of a regular expression –compiles the string as a regular expression, and then –replaces the delimited substring with the result of the compilation.

18 Kenneth Beesley/ 30 July 2000 / page 18 The (Xerox) finite-state iteration operator ^nn concatenations, for any integer n A^2 denotes two concatenations of the language A with itself, equivalent to [A A]. A = {bagi} A = {bagi} A^2 = {bagibagi} Finite-state languages and relations are closed under n-ary concatenation.

19 Kenneth Beesley/ 30 July 2000 / page 19 Iteration in Morphotactics: Malay define pref 0.x. “^[” “{” ; define root b a g i | p e r a t u r a n ; define suff “+Noun”:0 [ [ “+Pl”.x. “}” “^” “2” “^]” ] | [ “+Sg”.x. 0 ] ] ; | [ “+Sg”.x. 0 ] ] ; define Nouns (pref) root suff ; The resulting intermediate FST will relate string pairs like the following (we filter out strings with unmatched delimiters ^[ and ^] ) Upper: bagi+Noun+Sg 0 0bagi+Noun+Pl Lower: bagi0 0 ^[ {bagi 0 }^2 ^]

20 Kenneth Beesley/ 30 July 2000 / page 20 compile-replace: before and after Upper: bagi+Noun+Pl peraturan+Noun+Pl Lower: ^[{bagi}^2^] ^[{peraturan}^2^] xfst[]: compile-replace lowerxfst[]: compile-replace lower Upper: bagi+Noun+Pl peraturan+Noun+Pl Lower: bagibagi peraturanperaturan Before: After: And it applies similarly to all delimited regular-expression substrings on the lower side. There must be a finite number of them. Note that this operation is performed just once at compile-time.

21 Kenneth Beesley/ 30 July 2000 / page 21 Another Challenge: Arabic Stem “Interdigitation” wasayaktubuwnahaA wa+=“and” sa+=[future marker] ya+=[imperfect prefix] ktb=rootk t b CCVC=Form I imperfect templateCCVC  ktub (stem) u=Active-voice vocalization u +u:na=they [masc. Plural] (imperfect suffix) +ha:=it/them (direct-object clitic pronoun suffix) English gloss: “and they will write it” Stem “Interdigitation”

22 Kenneth Beesley/ 30 July 2000 / page 22 Some Formal Analyses of Semitic Stems Harris, 1944 $ b r k t bHarris, 1944 $ b r k t b n_a_i_ _a_a_ n_a_i_ _a_a_ n$abir katab n$abir katab McCarthy, 1981 nMcCarthy, 1981 n $ b r k t b $ b r k t b CCVCVC CVCVC CCVCVC CVCVC a i a a i a n$abir katab n$abir katab “Root-Pattern” “Root-Template- Vocalization” Another alternative is simply to ignore or deny the concept of roots and treat stems as monolithic morphemes.

23 Kenneth Beesley/ 30 July 2000 / page 23 Finite-State Computational Semitic Kay, 1987 Arabic stem interdigitation via multi-level transducers (Kiraz, 2000)Kay, 1987 Arabic stem interdigitation via multi-level transducers (Kiraz, 2000) Lavie et al., 1988 Two-Level Morphology adapted to Hebrew verbsLavie et al., 1988 Two-Level Morphology adapted to Hebrew verbs Kataja & Koskenniemi, 1988 Ancient AkkadianKataja & Koskenniemi, 1988 Ancient Akkadian – Concatenating languages are just a special case – Morphotactics defined using regular expressions/operations – Roots and patterns formalized as regular languages – Roots are INTERSECTED with patterns, rather than concatenated, to form stems Sublexicon of Roots Sublexicon of Patterns ?* k ?* t ?* b ?* CaCaC Pre-intersected by awk scripts Pre-intersected by awk scripts katab katab Then compiled by TwoL Then compiled by TwoL

24 Kenneth Beesley/ 30 July 2000 / page 24 Beesley: Arabic Stem Intersection at Runtime ALPNET (88-90) k t bALPNET (88-90) k t b wa+sa+ya+CCuC+u:na+ha: wa+sa+ya+CCuC+u:na+ha: –Roots and patterns resided in separate sublexicons – Root and pattern sublexicons were traversed in parallel at runtime – Intersection was simulated in C code (“detouring”) at runtime – ktb and CCuC were returned as separate morphemes in the analyses – Still mostly a “Two-Level” System Xerox (1996-98) Reimplementation using Xerox Finite-State MorphologyXerox (1996-98) Reimplementation using Xerox Finite-State Morphology On-line demo available: http://www.xrce.xerox.com/research/mltt/arabicOn-line demo available: http://www.xrce.xerox.com/research/mltt/arabichttp://www.xrce.xerox.com/research/mltt/arabic Use any Java-enabled browserUse any Java-enabled browser Beesley: Stem Intersection at Compile-time

25 Kenneth Beesley/ 30 July 2000 / page 25 Xerox Arabic Morphological Analyzer About 4930 roots in the underlying dictionaryAbout 4930 roots in the underlying dictionary Each root is encoded to show which patterns it can combine withEach root is encoded to show which patterns it can combine with Roots and patterns are intersected to form over 90,000 stemsRoots and patterns are intersected to form over 90,000 stems With various combinations of prefixes and suffixes, the system encodes 72,000,000 fully-voweled words, with their morphological analysesWith various combinations of prefixes and suffixes, the system encodes 72,000,000 fully-voweled words, with their morphological analyses In addition, it analyzes unvoweled and partially voweled spellingsIn addition, it analyzes unvoweled and partially voweled spellings The compiled analyzer network is currently storable in about 5 MBThe compiled analyzer network is currently storable in about 5 MB The web demo is Unicode based and renders Arabic script as you typeThe web demo is Unicode based and renders Arabic script as you type Roots, patterns and other affixes are separated and returnedRoots, patterns and other affixes are separated and returned

26 Kenneth Beesley/ 30 July 2000 / page 26 Intersecting Stems on One Side of a Transducer at Compile Time Start with a Two-Level LexiconStart with a Two-Level Lexicon Compose FS Intersecting Rules at Compile TimeCompose FS Intersecting Rules at Compile Time Upper: wa+sa+ya+[ktb & CCuC]+u:na+ha: Upper: wa+sa+ya+[ktb & CCuC]+u:na+ha: Lower: wa+sa+ya+[ktb & CCuC]+u:na+ha: Lower: wa+sa+ya+[ktb & CCuC]+u:na+ha:.o. Finite-State Stem-Intersection Rules ResultResult Upper: wa+sa+ya+[ktb & CCuC]+u:na+ha: Upper: wa+sa+ya+[ktb & CCuC]+u:na+ha: Lower: wa+sa+ya+ ktub +u:na+ha: Lower: wa+sa+ya+ ktub +u:na+ha: Then apply the finite-state morphophonological alternation/realization rules, handling weak roots, hamza orthography in general, assimilation, deletion, …Then apply the finite-state morphophonological alternation/realization rules, handling weak roots, hamza orthography in general, assimilation, deletion, …

27 Kenneth Beesley/ 30 July 2000 / page 27 Finite-State Merge: fast special-case intersection.m>. is the “merge to the right” operator and.m>. is the “merge to the right” operator and.. {CVVCVC}. kaatab {ktb}.m>. {CVVCVC}. kuutib

28 Kenneth Beesley/ 30 July 2000 / page 28 The compile-replace algorithm: before and after Upper: ^[{ktb}.m>.{CVVCVC}..{CVVCVC}..{CVVCVC}..{CVVCVC}..{CVVCVC}..{CVVCVC}. { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/11/3302365/slides/slide_28.jpg", "name": "Kenneth Beesley/ 30 July 2000 / page 28 The compile-replace algorithm: before and after Upper: ^[{ktb}.m>.{CVVCVC}..{CVVCVC}..{CVVCVC}..{CVVCVC}..{CVVCVC}..{CVVCVC}..{CVVCVC}..{CVVCVC}..{CVVCVC}..{CVVCVC}..{CVVCVC}..{CVVCVC}.

29 Kenneth Beesley/ 30 July 2000 / page 29 The compile-replace algorithm A general compile-time technique that allows the regular- expression compiler to apply to and modify its own output.A general compile-time technique that allows the regular- expression compiler to apply to and modify its own output. Somewhat similar in operation to “eval” in LISP and Perl.Somewhat similar in operation to “eval” in LISP and Perl. Appears to handle some classic examples of non-concatenative morphotactics: full-stem reduplication and Semitic stem interdigitation, eitherAppears to handle some classic examples of non-concatenative morphotactics: full-stem reduplication and Semitic stem interdigitation, either –Two-way root-pattern theory, or –Three-way root-template-vocalization theory We’ve only begun to explore the possibilities.We’ve only begun to explore the possibilities.

30 Kenneth Beesley/ 30 July 2000 / page 30 What is Finite-State Computing Good For? Mostly “lower-level” natural language processingMostly “lower-level” natural language processing –Tokenization –Spelling checking/correction –Phonology –Morphological Analysis/Generation –Part-of-Speech Tagging –“Shallow” Syntactic Parsing and “Chunking” Finite-state techniques cannot do everything; but for tasks where they do apply, they are extremely attractive.

31 Kenneth Beesley/ 30 July 2000 / page 31 What about Maltese? Necessary preliminary work has already startedNecessary preliminary work has already started –Corpora –Lexicography –Formal linguistic description Finite-state implementationFinite-state implementation –Xerox finite-state “calculus” already licensed at Univ. of Malta –The compile-replace algorithm will soon be released –The Book (Beesley and Karttunen, forthcoming) Unique opportunityUnique opportunity –Semitic component –Routinely written, in a culture with high literacy

32 Kenneth Beesley/ 30 July 2000 / page 32 Final Observations Successful computational linguistic projects are often the result of cooperation between a computational linguist and a more traditional descriptive linguistSuccessful computational linguistic projects are often the result of cooperation between a computational linguist and a more traditional descriptive linguist Computational linguistics can be commercially rewardngComputational linguistics can be commercially rewardng Computational linguistics is a healthy discipline from the descriptive point of viewComputational linguistics is a healthy discipline from the descriptive point of view –Your grammars can literally be tested on millions of words –Any mistakes or gaps in your grammars soon become apparent


Download ppt "Kenneth Beesley/ 30 July 2000 / page 1 Semitic Languages, Linguistics and Computers Kenneth R. BEESLEY Xerox Research Centre Europe (XRCE)"

Similar presentations


Ads by Google