MT For Low-Density Languages Ryan Georgi Ling 575 – MT Seminar Winter 2007.

MT For Low-Density Languages Ryan Georgi Ling 575 – MT Seminar Winter 2007

What is “Low Density”?

In NLP, languages are usually chosen for: In NLP, languages are usually chosen for: Economic Value Economic Value Ease of development Ease of development Funding (NSA, anyone?) Funding (NSA, anyone?)

What is “Low Density”? As a result, NLP work until recently has focused on a rather small set of languages. As a result, NLP work until recently has focused on a rather small set of languages. e.g. English, German, French, Japanese, Chinese e.g. English, German, French, Japanese, Chinese

What is “Low Density”? “Density” refers to the availability of resources (primarily digital) for a given language. “Density” refers to the availability of resources (primarily digital) for a given language. Parallel text Parallel text Treebanks Treebanks Dictionaries Dictionaries Chunked, semantically tagged, or other annotation Chunked, semantically tagged, or other annotation

What is “Low Density”? “Density” not necessarily linked to speaker population “Density” not necessarily linked to speaker population Our favorite example, Iniktitut Our favorite example, Iniktitut

So, why study LDL?

Preserving endangered languages Preserving endangered languages Spreading benefits of NLP to other populations Spreading benefits of NLP to other populations (Tegic has T9 for Azerbaijani now) (Tegic has T9 for Azerbaijani now) Benefits of wide typological coverage for cross- linguistic research Benefits of wide typological coverage for cross- linguistic research (?) (?)

Problem of LDL?

“The fundamental problem for annotation of lower-density languages is that they are lower density” – Maxwell & Hughes “The fundamental problem for annotation of lower-density languages is that they are lower density” – Maxwell & Hughes Easiest NLP development (and often best) done with statistical methods Easiest NLP development (and often best) done with statistical methods Training requires lots of resources Training requires lots of resources Resources require lots of money Resources require lots of money Cost/Benefit chicken and the egg Cost/Benefit chicken and the egg

What are our options? Create corpora by hand Create corpora by hand Very time-consuming (= expensive) Very time-consuming (= expensive) Requires trained native speakers Requires trained native speakers Digitize printed resources Digitize printed resources Time-consuming Time-consuming May require trained native speakers May require trained native speakers e.g. orthography without unicode entries e.g. orthography without unicode entries

What are our options? Traditional requirements are going to be difficult to satisfy, no matter how we slice it. Traditional requirements are going to be difficult to satisfy, no matter how we slice it. We need to, then: We need to, then: Maximize information extracted from resources we can get Maximize information extracted from resources we can get Reduce requirements for building a system Reduce requirements for building a system

Maximizing Information with IGT

Interlinear Glossed Text Interlinear Glossed Text Traditional form of transcription for linguistic field researchers and grammarians Traditional form of transcription for linguistic field researchers and grammarians Example: Example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”

Benefits of IGT As IGT is frequently used in fieldwork, it is often available for low-density languages As IGT is frequently used in fieldwork, it is often available for low-density languages IGT provides information about syntax, morphology, IGT provides information about syntax, morphology, The translation line is usually a high-density language that we can use as a pivot language. The translation line is usually a high-density language that we can use as a pivot language.

Drawbacks of IGT Data can be ‘abormal’ in a number of ways Data can be ‘abormal’ in a number of ways Usually quite short Usually quite short May be used by grammarian to illustrate fringe usages May be used by grammarian to illustrate fringe usages Often purposely limited vocabularies Often purposely limited vocabularies Still, in working with LDL it might be all we’ve got Still, in working with LDL it might be all we’ve got

Utilizing IGT First, a big nod to Fei (this is her paper!) First, a big nod to Fei (this is her paper!) As we saw in HW#2, word alignment is hard. As we saw in HW#2, word alignment is hard. IGT, however, often gets us halfway there! IGT, however, often gets us halfway there!

Utilizing IGT Take the previous example: Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”

Utilizing IGT Take the previous example: Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday” The interlinear already aligns the source with the gloss The interlinear already aligns the source with the gloss Often, the gloss uses words found in the translation already Often, the gloss uses words found in the translation already

Utilizing IGT Alignment isn’t always this easy… Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes)

Utilizing IGT Alignment isn’t always this easy… Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes) We can get a little more by stemming… We can get a little more by stemming…

Utilizing IGT Alignment isn’t always this easy… Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes) We can get a little more by stemming… We can get a little more by stemming… …but we’re going to need more. …but we’re going to need more.

Utilizing IGT Thankfully, with an English translation, we already have tools to get phrase and dependency structures that we can project: Thankfully, with an English translation, we already have tools to get phrase and dependency structures that we can project: (Source: Will & Fei’s NAACL 2007 Paper!)

Utilizing IGT What can we get from this? What can we get from this? Automatically generated CFGs Automatically generated CFGs Can infer word order from these CFGs Can infer word order from these CFGs Can infer possible constituents Can infer possible constituents …suggestions? …suggestions? From a small amount of data, this is a lot of information, but what about… From a small amount of data, this is a lot of information, but what about…

Reducing data Requirements with Prototyping

Grammar Induction So, we have a way to get production rules from a small amount of data. So, we have a way to get production rules from a small amount of data. Is this enough? Is this enough? Probably not. Probably not. CFGs aren’t known for their robustness CFGs aren’t known for their robustness How about using what we have as a bootstrap? How about using what we have as a bootstrap?

Grammar Induction Given unannotated text, we can derive PCFGs Given unannotated text, we can derive PCFGs Without annotation, though, we just have unlabelled trees: Without annotation, though, we just have unlabelled trees: ROOT ROOT C2 C2 X0 X1 Y2 X0 X1 Y2 the dog Z3 N4 the dog Z3 N4 fell asleep fell asleep Such an unlabelled parse doesn’t give us S -> NP VP, though. Such an unlabelled parse doesn’t give us S -> NP VP, though. p=0.003 p=0.02 p=5.3e-2p=0.09 p=0.45e-4

Grammar Induction Can we get labeled trees without annotated text? Can we get labeled trees without annotated text? Haghighi & Klein (2006) Haghighi & Klein (2006) Propose a way in which production rules can be passed to a PCFG induction algorithm as “prototypical” constituents Propose a way in which production rules can be passed to a PCFG induction algorithm as “prototypical” constituents Think of these prototypes as a rubric that could be given to a human annotator Think of these prototypes as a rubric that could be given to a human annotator e.g. for English, NP -> DT NN e.g. for English, NP -> DT NN

Grammar Induction Let’s take the possible constituent DT NN Let’s take the possible constituent DT NN We could tell our PCFG algorithm to apply this as a constituent everywhere it occurs We could tell our PCFG algorithm to apply this as a constituent everywhere it occurs But what about DT NN NN? (the train station)? But what about DT NN NN? (the train station)? We would like to catch this as well We would like to catch this as well

Grammar Induction K&H’s solution? K&H’s solution? distributional clustering distributional clustering “a similarity measure between two items on the basis of their immediate left and right contexts” “a similarity measure between two items on the basis of their immediate left and right contexts” …to be honest, I lose them in the math here. …to be honest, I lose them in the math here. Importantly, however, weighting the probability of a constituent with the right measure improves from the base unsupervised level of f-measure 35.3 to 62.2 Importantly, however, weighting the probability of a constituent with the right measure improves from the base unsupervised level of f-measure 35.3 to 62.2

So… what now?

Next Steps By extracting production rules from a very small amount of data using IGT and using Haghighi & Klein’s unsupervised methods, it may be possible to bootstrap an effective language model from very little data! By extracting production rules from a very small amount of data using IGT and using Haghighi & Klein’s unsupervised methods, it may be possible to bootstrap an effective language model from very little data!

Next Steps Possible applications: Possible applications: Automatic generation of language resources Automatic generation of language resources (While a system with the same goals would only compound error, automatically annotated data could be easier for a human to correct rather than hand-generate) (While a system with the same goals would only compound error, automatically annotated data could be easier for a human to correct rather than hand-generate) Assist linguists in the field Assist linguists in the field (Better model performance could imply better grammar coverage) (Better model performance could imply better grammar coverage) …you tell me! …you tell me!

MT For Low-Density Languages Ryan Georgi Ling 575 – MT Seminar Winter 2007.

Similar presentations

Presentation on theme: "MT For Low-Density Languages Ryan Georgi Ling 575 – MT Seminar Winter 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MT For Low-Density Languages Ryan Georgi Ling 575 – MT Seminar Winter 2007.

Similar presentations

Presentation on theme: "MT For Low-Density Languages Ryan Georgi Ling 575 – MT Seminar Winter 2007."— Presentation transcript:

Similar presentations

About project

Feedback