Statistical Alignment and Machine Translation

Statistical Alignment and Machine Translation
인공지능 연구실 정 성 원

Contents Machine Translation Text Alignment Word Alignment
Length-based methods Offset alignment by signal processing techniques Lexical methods of sentence alignment Word Alignment Statistical Machine Translation

Different Strategies for MT (1)
Interlingua (knowledge representation) (knowledge-based translation) English (semantic representation) French (semantic representation) semantic transfer English (syntactic parser) French (syntactic parser) syntactic transfer English Text (word string) French Text (word string) word-for-word

Different Strategies for MT (2)
Machine Translation : important but hard problem Why is ML Hard? word for word Lexical ambiguity Different word order syntactic transfer approach Can solve problems of word order Syntactic ambiguity semantic transfer approaches can fix cases of syntactic mismatch Unnatural, unintelligible interlingua ML은 NLP의 가장 중요한 과제, 하지만 매우 어렵다. 현재 나와 있는 시스템은 엉망이다. 최종 목표는 유창한 번역, 하지만, 제한적인 영역에서 밖에 못한다. 그럼 왜 어렵느냐? 이것을 설명하기 위해서 F13.1을 보자 1. 가장 간단한 접근 단어대 단어 번역 (문제점) 1) 다른 언어간에는 정확하게 맞는 단어가 없다.(의미상으로 조금씩 다르다) ex) suit 의 의미 lawsuit, set of garments(의복) 따라서 독립 단어보다 큰 문맥을 보고 결정해야 한다. 2. 단어 순서가 다르다. - syntactic transfer approach (문장적 변환 접근방법) 이 제시되었다. 방법 : 처음 소스 text를 parse하고, 그 parse tree를 target 언어에 통어적 tree로 변환한다. 그리고 나서 이 통어적 tree로 부터 변환을 이끌어 낸다. syntactic transfer는 단어의 순서에 대한 문제를 풀었다. 하지만 syntactically 정확한 변환은 부적당한 semantic을 가끔 가진다. ex) Ich esse gern ( I like to eat) 은 독일어에서 동사-부사 관계지만 영어로 번역할 때 동사 부사로 같은 뜻을 만들 수 없다. 3. semantic transfer approaches Source sentence의 meaning을 표현한다. 이것은 syntactic mismatch된 예를 교정해야 하지만, 양이 너무 많으므로 일반적으로 할 수 없다. 번역이 원문에 충실하여 글자 그대로는 정확하더라도, 이해할 수 없는 문장이 되어 버릴 수 있다. ex) La botella entro a la cueva floatando, the bottle entered the cave floating. The bottle floated into the cave. literal translation에 의지하지 않는 다른 방법은 interlingua를 통해서 번역하는 것이다. interlingua is a knowledge representation formalism (지식 표현 형식) 즉, 지식을 표현할 수 있는 중간 단계의 언어를 둔다. 특별한 언어들의 표현 의미에 독립적이다. 이점 : 다른 많은 언어로 변환하는데 효과적이다 문제점 : 효과적인 디자인의 어려움, 포괄적인 지식 표현 기호 사용의 어려움 자연언어에서 knowledge representation language 으로 변환을 결정하는데 따른 모호성

MT & Statistical Methods
In theory, each of the arrows in prior figure can be implemented based on a probabilistic model. Most MT systems are a mix of prob. and non-prob. components. Text alignment Used to create lexical resources such as bilingual dictionaries and parallel grammars, to improve the quality of MT More work on text alignment than on MT in statistical NLP. 여기서 statistical method는 어디에서 써 먹었느냐? 이 이론 13.1에서 각각의 화살표는 probabilistic model에 기초하여 구현될 수 있다. 여기에 보이지 않는 몇몇은 statistically 구현될 수 있다. ex) a word sense disambiguator 대부분의 프로그램들은 probabilistic model과 nonprobabilistic model을 섞어 쓴다. 하지만 완벽한 statistical translation system을 13.3에서 볼수 있다. 장의 분리를 하는 이유 MT context, task of text alignment 안의 중요한 하나의 일이다. task of text alignment lexical resource의 생성에 거의 사용되어 진다. (ex) bilingual dictionary and parallel grammers ) 더불어 MT이외에 오히려 SLNP에서 더 선호되며, 부분적으로 parser나 disambiguators 같은 MT-specific하지 않은 시스템에 적용되어 above-mention face가 될 수 있다. 그래서 우리는 체계적으로 text alignment에 대해서 보고, word alignment에 대해서 간단히 살펴 볼 것이다. word alignment는 text alignment의 다음 단계로 parallel text에서 bilingual dictionary를 유도하기 위한 것이다.

Text Alignment Parallel texts or bitexts Alignment
Same content is available in several languages Official documents of countries with multiple official languages -> literal, consistent Alignment Paragraph to paragraph, sentence to sentence, word to word Usage of aligned text Bilingual lexicography Machine translation Word sense disambiguation Multilingual information retrieval Assisting tool for translator 다국어 text를 위한 SNLP method들은 다양한 연구에서 적용되었다. 이들 연구 대부분은 parallel texts나 bitexts를 포함 사용한다. parallel texts나 bitexts – 문서 번역을 위하여 몇몇 언어들안에서 유용한 같은 content 모음 이런 문서들은 통용언어가 다양한 나라에서 많이 볼 수 있다. ( Canada, Switzerland, HongKong같은…) 문서의 특징, 이런 text들을 사용하는 한가지 이유는 많은 양을 쉽게 획득할 수 있다는 것이다. 또한, 이런 text의 본성은 SNLP에 도움을 줄 수 있다. 즉 정확성의 요구에 따라서 literal translation이 잘 되어 있다. 다른 것들은 꾸준히 나오지 않고, 좋은 literal translation이 되어 있지 않아서 별로 도움이 안된다. 위에서 본 paralled text들은 online상에서도 유용하다. 처음 일은 하나의 언어의 paragraph나 sentence에 부합하는 다른 언어의 그것을 전체 큰 범위에서 alignment 하는 것이다. 이것은 메우 연구가 잘 되어 있고 매우 성공적인 방법들이 이미 제안되어 졌다. 이것이 달성되고 나서 두번째 문제점은 어떤 다른 단어에 의해서 단어들이 translate되어지는 경향을 학습하는 것이다. (즉, 단어의 뜻은 다른 단어의 영향을 받아 결정되므로) 이것은 text로부터 bilingual 사전을 습득하는 문제이다. 이섹션은 text alignment문제를 취급하고 다음 섹션은 word alignment와 aligned text로부터 bilingual dictionary를 유도하는 과정을 설명한다.

Aligning sentences and paragraphs(1)
Problems Not always one sentence to one sentence Reordering Large pieces of material can disappear Methods Length based vs. lexical content based Match corresponding point vs. form sentence bead 사용할 multilingual text corpora를 만들기 위한 가장 의무적인 first step은 text alignment이다. 이것은 ML에서만 쓰이는 것이 아니라 다른 domain의 다른 지식 source의 multilingual corpora를 사용하기 위한 첫 step이 될 수 있다. ex) word sense disambiguation, multilingual information retrieval, 번역을 돕는 실제적인 툴이 될 수 있다. text alignment가 쉽지 않은 이유 translators들은 항상 하나의 sentence를 하나의 sentence로 번역하지 않는다. (비록 실제는 이런 것이 대부분이지만…) 따라서, 이 chapter의 outset은 인간 번역사들이 구성요소를 바꾸고 재정리하여 확장하는 것을 실현것이다. 한 예로 그림 13.2 를 보면… 내용과 구성이 매우 다르며, 많은 양의 reordering뿐만 아니라 구성요소가 지워진 것도 있다. sentence alignment problem 어떤 언어의 몇몇 group sentences를 다른 언어의 group sentences로 어떻게 적당히 부합시킬 것인가에 관한 문제 이것은 inserstions나 deletions에 의해서 비어버리게(없어져버리게) 되는 경우도 생길 수 있다. 이같은 grouping을 sentence alignment나 bead라고 부른다. 두 언어 사이에 얼마나 overlapping되는가가 중요하다. 즉 한단어, 두단어가 overlap되는 것은 중요하지 않고, 절단위로 overlap이 되면 alignment의 부분이라고 말할 수 있다. 가장 일반적인 case에서 한 sentence는 한 sentence로 변환된다 (1:1, 90%정도 차지) 그렇지 않은것이 있다. (1:2, 2:1, 1:3, 3:1) Using this framework each sentence can occur in only one bead. 13.2 는 2:2 alignment이다. 문장들의 부분이 섞여있기 때문이다. sentence level에서 align하면 이렇게 하나의 sentence에서 다른 sentence로 옮겨야 한다. 추가적인 문제로 crossing dependencies의 경우가 real text내에서는 매우 많다는 것이다. 특정 알고리즘들을 이용해서 우리는 그 같은 경우에 있어서 정확하게 제어할 수 없다. -> Statistical string match 연구에서 alignment problem과 correspondence problem을 구별할 수 있다. alignment problem들이 허용하지 않는 crossing dependency들의 제한 사항의 추가를 통해서 이것으로 many to many를 묘사할(해결할) 수 있다. (2;2, 3:3, 3:2 … ) 마지막으로 일부러, 혹은 실수로 sentence들은 아마 번역중에 추가나 제거가 있을 수 있다 그 결과로 1:0, 0:1의 결과를 얻는다. 우리는 고려할 논문을 표 13.1에 나열해 놨다. 일반적인 방법을 분류하자면, 1. simply lengthbased vs. use lexical(어휘) content 2. average alignment vs. align sentence to form sentence beads. methods 중 많은 것이 text사이의 가장 좋은 alignment를 발견하기 위하여 dynamic programming method를 사용한다.

BEAD : n:m grouping S, T : text in two languages S = (s1, s2, … , si) T = (t1, t2, … , tj) 0:1, 1:0, 1:1, 2:1, 1:2, 2:2, 2:3, 3:2 … Each sentence can occur in only one bead No crossing S T s1 . si t1 . tj b1 b2 b3 b4 b5 . bk

Dynamic Programming(1)

Dynamic Programming(2)
가장 짧은 길 계산 dmin(vij) i 1 2 3 4 5 j 22(v12) 20(v21) 11(v32) 5(v43) 4(v51) 14(v22) 12(v31) 6(v41) 6(v51) 18(v22) 10(v32) 3(v51)

Length-based methods Rationale Length Pros
Short sentence -> short sentence Long sentence -> long sentence Ignore the richer information but quite effective Length # of words or # of characters Pros Efficient (for similar languages) rapid sentence alignment의 최초 연구중 많은 부분이 paralled corpora안의 text unit의 길이를 단지 비교하는 모델을 사용했다. 이것은 text안의 유용한 정보를 무시해서 이상할 것 같지만, 매우 유용하다는 것이 밝혀졌고, 대량 text를 처리하는데 매우 빠른 속도를 낸다. (당연!!!) 이 방법의 근거는 짧은 sentence는 짧은 sentence로 번역될 것이고, 긴 sentence는 긴 sentence로 번역된다는 것이다. 길이(Length)는 보통 단어의 수나 character의 수로 정의된다. Gale and Church (1993) 통계적 접근에서 두 text S와 T의 alignment A의 가장 높은 가능성은 다음에 주어진 식이다. (식13.3) 위에 식에서 확률을 측정하기 위하여 sentence를 bead으로 나누고 각 bead을 독립사건으로 취급하여 각 bead의 확률을 곱한다. 여기서 문제는 주어진 sentence에서 alignment bead의 type이 확률 측정을 어떻게 하느냐? (즉 어떤 bead의 type이 이 sentence에 가장 적합한가?) Gale and Church는 이것을 간단하게 source와 translation sentences의 문자의 개수에 의존하여 선택했다. 비슷한 언어의 직역에 유용하다. UBS corpus를 사용했다. corpus내의 text들은 Paragraph level 에서 평범하게 나열(aligned)되어 있다. 왜냐하면 paragraph 구조는 corpus내에서 명확하게 구분되며, 그 level에서는 어떤 모호함(confusion) 은 수작업으로 check되었고 제거되었다. 이 first step이 중요하다. Brown et al. character가 아니라 단어를 길이로 사용했다. Gal and Church는 character보다 단어가 길이의 변동이 심하므로 부정적으로 봤다. 하지만, Brown은 이것을 전체 article이 아닌 다른 연구를 위한 corpus의 부분집합을 align할려고 하는 목적이 있었다. Wu 광둥어를 가지고 적용을 해 보았다. 생각보다 좋은 결과를 얻을 수 있었다. lexical cue사용

Gale and Church (1) Find the alignment A ( S, T : parallel texts )
Decompose the aligned texts into a sequence of aligned beads (B1,…Bk) The method length of source and translation sentences measured in characters similar language and literal translations used for Union Bank of Switzerland(USB) Corpus English, French, German aligned paragraph level Gale and Church (1993) 통계적 접근에서 두 text S와 T의 alignment A의 가장 높은 가능성은 다음에 주어진 식이다. (식13.3) 위에 식에서 확률을 측정하기 위하여 sentence를 bead으로 나누고 각 bead을 독립사건으로 취급하여 각 bead의 확률을 곱한다. 여기서 문제는 주어진 sentence에서 alignment bead의 type이 확률 측정을 어떻게 하느냐? (즉 어떤 bead의 type이 이 sentence에 가장 적합한가?) Gale and Church는 이것을 간단하게 source와 translation sentences의 문자의 개수에 의존하여 선택했다. 비슷한 언어의 직역에 유용하다. UBS corpus를 사용했다. corpus내의 text들은 Paragraph level 에서 평범하게 나열(aligned)되어 있다. 왜냐하면 paragraph 구조는 corpus내에서 명확하게 구분되며, 그 level에서는 어떤 모호함(confusion) 은 수작업으로 check되었고 제거되었다. 이 first step이 중요하다. algorithm L1 sentence의 개수 몇 개와 L1 sentence의 개수 몇 개를 어떻게 유사하게 alignment할 것인가를 sentence length를 사용해서 해결한다. 가능한 alignment들은 {1:1, 1:0, 0:1, 2:1, 1:2, 2:2}로 한정한다. 이것은 dynamic programming algorithm을 사용해서 가장 적합한 text alignment를 쉽게 찾는것을 가능하게 만든다.

Gale and Church (2) D(i,j) : the lowest cost alignment between sentences s1,…,si and t1,…,tj

Gale and Church (3) L1 alignment 1 L1 alignment 2 L2
t1 cost(align(s1, t1)) S1 S2 S3 S4 + cost(align(s1, s2, t1)) t1 t2 cost(align(s2, t2)) + + cost(align(s3, )) cost(align(s3, t2)) t2 + cost(align(s3, t2)) t3 t3 cost(align(s4, t3))

Gale and Church (4) l1, l2 : the length in characters of the sentences of each language in the bead 두 언어 사이의 character의 길이 비 normal distribution ~ (, s2) average 4% error rate 2% error rate for 1:1 alignments

Other Researches Brown et.al(1991c) Wu(1994)
대상 : Canadian Hansard(English , French) 방법 : Comparing sentence lengths in words rather than characters 목적 : produce an aligned subset of the corpus 특징 : EM algorithm Wu(1994) 대상 : Hong Kong Hansard(English, Cantonese) 방법 : Gale and Church(1993) Method 결과 : not as clearly met when dealing with unrelated language 특징 : use lexical cues

Offset alignment by signal processing techniques
Showing roughly what offset in one text aligns with what offset in the other. Church(1993) 배경 : noisy text(OCR output) 방법 character sequence level에서 cognate 정의 -> 순수한 cognate + proper name + numbers dot plot method(character 4-grams) 결과 : very small error rate 단점 different character set no or extremely few identical character sequences sentence의 bead를 align position하는 것보다 position offset을 align하는 것이 더 나아 보인다. Church 앞의 것은 노이즈에 약하다. OCR의 경우 구두점을 놓칠 수 있고, 전자 문서의 경우는 원하지 않는 markup이 들어갈 수 있다. 즉, sentence의 경계를 정하는 것이 어려워진다. Church는 cognates(동족)을 사용해서 alignment를 유도한다. cognates : 같은 언어적 조상으로부터 파생한 유사한 단어들 ex) superieur (프랑스어), superior 그러나 이것은, 문자 sequence level에서 찾는 것이다. 이것은 source와 target 언어들사이의 충분한 구별 문자 순서의 공급에 의존한다 하지만 Church는 이것이 Roman 알파벳을 사용하는 대부분의 언어에 있다는 것을 알아냈다. 심지어 non-Roman 알파벳 쓰기 시스템은 이름과 숫자에 의해서 자유롭게 흩트리는 것을 제공한다. 사용된 method는 dot-plot를 만든다. dot-plot 1. source와 translated를 연결한다. 2. 각각을 축으로해서 2차원 그래프를 만든다. Church는 4gram으로 unit를 match시켰다. 의미 있는 부분은 희소한 사분면 2개로 bitext map이라 부른다. 이것은 diagonal line을 가진다. 이것은 두 언어사이의 cognate의 경향에 의해서 나왔다. heuristic search는 diagonal을 통해서 best path를 찾는다. 이것이 두 text의 term의 offset의 alignment를 제공한다. 또한 n-gram은 weight를 역으로 해서 준다. 왜냐하면 n-gram이 희소한 것이 더 중요하기 때문이다. n-gram이 너무 자주 나오는 것은 무시한다. 이것은 bead처럼 전체 sentence에 대해서 align을 시도하지 않는다. 따라서 다른 것들하고 부합되는 성능 비교를 할 수 없다. 이것이 이 모델이 no quantitative evaluation of performance라 부른 이유이다. 단점 : 매우 드문 동일한 sequence가 text에 나타나면 반영하지 못한다. (전체 parallel text를 alianing할 수 없다. 이 문제는 다른 character set을 사용하는 언어들 사이에 나타날 수 있다. ( ex) 동유럽어와 아시아 언어) Fung and Mckeown 다음과 같은 조건에서 일을 할 수 있는 알고리즘을 찾으려 시도했다. 1. sentence의 경계를 찾지 않고 2. 단지 roughly한 pararell text에서 ( 어떤 sections은 부합되는 section을 가지고 있지 않은, 또는 많이 가지고 있는) 3. 관계가 없는 언어 쌍 특별히 그들은 이 기술을 영어 광둥어 parallel corpus에 적용하려 하였다. 각각의 단어에 대한 신호를 만들어 그것을 처리하였다. ( arrival vector ) ex) arrival vector 어떤 단어가 (1, 263, 267, 519)에 위치한다면 vector는 (262, 2, 252) 이것을 가지고 영어와 중국어를 비교 모든 pair를 찾기 위하여 매우 유사한 signal을 가지는 작은 것들을 다소 준비한다. 이 frequency와 position이 매우 다르면 두개는 일치하지 않는다고 봤다. 아니면 Dynamic Time Warping을 사용하였다. 이것이 위에것과 유사한 점은 English text와 Cantonese text를 dot pots로 words의 쌍으로 구성하면 유사한 점이 발견된다는 것이다. 두번째에 dynamic programming 알고리즘을 사용해서 두 text의 best match를 발견할 수 있다. 이것은 순수하게 language 에 독립적이고 lexical content에 민감하다.

DOT-PLOT Uni-gram bi—gram a c g t ● a c g t ●

Fung and Mckeown 조건 대상 : English and Cantonese 방법 :
without having found sentence boundary in only roughly parallel texts with unrelated language 대상 : English and Cantonese 방법 : arrival vector small bilingual dictionary A word offset : (1,263,267,519) => arrival vector : (262,4,252). Choose English, Cantonese word pairs of high similarity => small bilingual dictionary => anchor of text alignment Strong signal in a line along the diagonal in dot plot => good alignment

Lexical methods of sentence alignment(1)
Align beads of sentences in robust ways using lexical information Kay and Röscheisen(1993) 특징 : lexical cues, a process of convergence 알고리즘 Set initial anchors until most sentences are aligned Form an envelope of possible alignments Choose pairs of words that tend to co-occur in these potential partial alignment Find pairs of source and target sentences which contain many possible lexical correspondences. length-based method는 노이즈와 완전하지 못한 입력등에 강하지 못한 약점이 있다. 따라서 aligning sentence의 목표를 버리고 text의 offset을 가지고 하는 방법을 보았다. 이장은 다시 몇몇 방법을 보겠는데, 처음 방법과 같은 bead를 가지고 하지만, 더욱 lexical 정보를 추가하여 노이즈에 강한 것을 보겠다. Kay and Roscheisen 앞의 것은 실제로 사용하기 불가능하다. (노이즈등 기타 이유로) 하지만, lexical 정보는 많은 alignment를 가능하게 할 수 있으며, 유사한 길이의 sentence가 두 언어에 나타나는 어떤 경우에는 꼭 필요하다. 이 lexical 단서의 사용은 더 높은 레벨의 paragraph alignment를 요구하지 않는다. 이 방법은 부분적인 단어 alignment에서 가장 가망성 있는 sentence level 의 alignment로의 수렴과정이다. 단어 alignment는 두 단어는 그들의 분포가 같다면 부합한다는 가정을 기본으로 한다. 그 과정들은 기본적으로 다음과 같다. texts align을 위한 맨 처음과 나중 sentences를 놓자. 그것이 처음 주자이다. 그리고 나서 모든 sentence들이 aligned될 때가지 다음을 수행한다. 1. 소스 언어와 타겟 언어 안의 모든 sentence들의 list의 카티즌 곱으로 부터 가능한 모든 alignment를 구성한다. 그들 crocss anchor나 그들 respective distance from an anchor 가 너무 다르면 제외한다. 그 차이는 anchor의 증가로 부터 distance의 증가를 허용한다. 그림 13.5 참조 2. 잠재적인 부분 alignment의 동시 발생 경향이 있는 단어의 쌍을 선택한다. 단어들 중 비슷하게 쓰이 것들 끼리 분류해서 단어들을 선택한다. 3. 소스와 타겟 sentence에서 매우 가능한 lexical 일치의 쌍을 찾는다. 이 쌍중에 가장 확실한 것은 최종 결과의 부분으로 사용될 부분적 alignment의 집합 유도에 사용한다. 이것을 우리의 anchors의 list에 더하고 1번부터 다시 반복한다.

96% coverage after four passes on Scientific American articles 7 errors after 5 passes on 1000 Hansard sentences 단점 computationally intensive pillow shaped envelope => text moved, deleted 이 접근 방법의 정확도는 annealing schedule에 달려있다. 만약 반복도중에 연관된 많은 수의 pair를 포함시켜면 몇번 반복 안해도 되지만 결과는 나빠진다. 일반적으로 약 5번의 반복이 만족할만한 결과를 위해 필요하다. 이 방법은 가능한 alignment type의 한정을 가정하지 않는다. 그리고 매우 강건하다(robust), 결과 또한 좋다. Kay and Roscheisent 은 96%의 coverage ( 4번 pass를 통해서 ) 이 방법은 계산에 집중되어 있다. 따라서 큰 text에 서 시작한다면 검색을 위해서 큰 envelope(봉투)가 구성될 것이다. 더욱이, text의 매우 큰 section이 주변으로 움직이거나 삭제되면 어떤 sentence의 정확한 알고리즘은 search envelope의 바깥에 위치하게 된다.

Chen(1993) Similar to the model of Gale and Church(1993) Simple translation model is used to estimate the cost of a alignment. 대상 Canadian Hansard, European Economic Community proceedings.(millions of sent.) Estimated error rate : 0.4 % most of errors are due to sentence boundary detection method => no further improvement Chen (1993) Chen은 간단히 단어 대 단어 변환 모델의 생성에 의한 sentence alignment이다. 최고의 alignment는 주어진 변환 모델로 corpus를 generating할 때 likelihood가 maximize되는 하나이다. best alignment는 dynamic programming으로 찾을 수 있다 length-based는 노이즈에 약하고 앞의 lexical method는 매우 느려서 실제로 적용하기에는 무리가 있지만, 이 방법은 실제 적용하기에 충분히 빠르고, 노이즈에 강하며 더욱 정확하다. 이 모델은 기본적으로 Gale and Church(1993) 과 비슷하다. 여기에 translation model이 어떤 alignment의 비용을 평가하는데 사용될 수 있다. 그래서, S와 T두 text를 align하기 위하여, 우리는 그것들을 sentence bead들의 연속인 Bk로 나누며, 각각은 0이나 각각 언어의 sentence를 담고 있다. 그래서 bead의 연속은 그 corpus를 cover한다. 그때 sentence의 bead들 사이는 독립적이라 가정하면 가장 가능성있는 alignment A 는 다음과 같다. 여기서 P(L)은 L bead들의 alignment에서 generate된 하나의 가능성이다. 연구가 더욱 정확한 가능성을 estimate 하기 위하여 translation model에 대해서 고려했기 때문에 더욱 정확하고 속도가 있다. 모델은 단어 순서의 문제를 무시했고, translatoin안에서 한단어 보다 더 부합되는 단어의 가능성을 무시했다. 이것은 word bead를 사용했고, 1:0이나 0:1 1:1 word bead들로 제한하였다. 이 모델의 essence는 한 단어가 일반적으로 다른 단어로 translate될 때 부합될 가능성은 1:1 word bead가 다른 것들 보다 매우 높을 것이라는 것이다. 여기서 detail한 translation모델은 생략한다. 그에 가까운 것을 13.3에서 볼 수 있을 것이다. alignment의 확률을 위해서 프로그램은 bead안의 sentence로 부터 이끌어낸 가능한 word beading의 전체를 합하지 않고 그중 가장 좋은 것을 선택한다. 즉 gready search로 best word beading을 찾아낸다. 프로그램은 1:0, 0:1 bead로 시작해서 greedily하게 1:1 bead로 교체해 가면서 더 이상 이득이 없을 때 까지 확률을 증가 시킨다. Chen의 model 의 parameter들은 Viterbi version 의 EM algorithm으로 평가될 수 있다. 이 모델은 사람이 작성한 100 sentence pairs의 작은 corpus를 통해서 더 빨라질 수 있다.

Haruno and Yamazaki(1996) Align structurally different languages. A variant of Kay and Roscheisen(1993) Do lexical matching on content words only POS tagger To align short texts, use an online dictionary Knowledge-rich approach The combined methods good results on even short texts between very different languages Haruno and Yamazaki (1996) 본질적으로 Kay와 Roscheisen의 변형이다. 하지만 여러가지 흥미로운 점이 들어있다. 첫번째로, 그들은 구조적으로 매우 다른 두 언어, 영어와 일본어 같은, 를 위하여 실제적으로 alignment를 방해하는 lexical matching이 되는 기능 words를 포함시켰다. 그래서 저자들은 모든 기능성 단어를 제거하고 lexical matching을 content words로 하였다. 이것은 두 언어 사이에 단어 구분을 위한 part of speech taggers의 부분으로 사용되었다. 두번째로 만약 짧은 text의 align을 시도한다면, 반복되는 단어가 충분하지 않으므로, online dictionary를 단어 pair의 match를 위해 사용한다. 이런 두개의 technique은 knowledge-poor approach에서 knowledge-rich로 이동하게 만든다. 실제적으로 tagger나 online dictionary 같은 knowledge source는 매우 유용하고, 이것은 궁핍한 관념적인 바탕 사용에서 오는 실수를 방지한다. 또 다른 측면에서, 좀더 technical text를 취급합 때 text에 부합되는 단어를 찾는 것이 중요하며, 사전의 사용은 이것을 대신할 수 없다. 따라서 조합된 방법을 사용하여 그들은 매우 좋은 결과를 달성할 수 있으며, 심지어 다른 text의 짧은 text에서도 좋은 결과를 얻는다.

Word Alignment 용도 방법 Use of existing bilingual dictionaries
terminology databases, bilingual dictionaries 방법 text alignment -> word alignment χ 2 measure EM algorithm Use of existing bilingual dictionaries align된 text는 bilingual dictionary와 전문용어 database를 유도하는데 사용한다. 이것은 보통 2 단계안에 끝난다. 첫째, text alignment는 word alignment로 확장된다. 이때 frequency갈은 몇몇척도가 aligned pair들을 고르는데 사용된다. 예) adeptes – produces의 pair가 단 한번만 나타나면 우리는 아마 이것을 사전에 넣지 않을 것이다. word alignment를 위한 한가지 접근 방법은 section 5.3.3에서 짧게 논의되었다. word alignment 는 association의 측정 방법을 기본으로 한다. 이것은 (ex X^2) 은 bitext로 부터 word를 alignment를 계산의 효과적인 방법이다. 많은 경우에 있어서, 그것은 특별히 매우 신뢰성 있는 threshold를 사용하면 매우 효과적이다. 그러나 assocation방법은 잘못된 걸과를 발생 시킬 수도 있다. 한예로 house가 있다. 이같은 pair ( chambre-house ) 는 association measure에서 무시되는 정보 소스의 account를 획득한다면 구별해 낼 수 있다. : 사실 주어진 한 단어는 다른 언어에서 보통 하나의 단어로 translation된다. 물론 이것은 aligned text안에서 단어의 부분적인 사실일 뿐이다. 하지만 one-to-one correspondence의 가정은 높은 정확성을 가진 결과를 보여준다. 이런 형태의 정보를 구체화시키는 대부분의 알고리즘들은 EM algorithm으로 implementation할 수 있다.

Statistical Machine Translation(1)
Language Model P(e) Translation Model P(f/e) Decoder ê = arg maxe P(e/f) e f ê Noisy channel model in MT Language model Translation model Decoder 2.2.4절에서 살펴본 noisy channel model을 사용한다. 우리는 noisy channel모델을 영어 문장 e를 프랑스 문장 f로 바꾼다. 그것을 decoder로 바꾼다. Language model 이것은 문장 e의 가능성을 우리에게 준다. 이것은 n-gram(6장)이나 probabilistic grammar (11장) 에 기초한다.

Translation model compute p(f/e) by summing the probabilities of all alignments f e . fj . eaj .. e: English sentence l : the length of e in words f : French sentence m : the length of f fj : word j in f aj : the position in e that fj is aligned with eaj : the word in e that fj is aligned with p(wf/we) : translation prob. Z : normalization constant 중간 서메이션 2개 : 프랑스 단어를 영어 단어로 alignment를 하는 모든 가능성의 합 즉, 모든 alignment의 가능성의 합에 의해서 P(f|e)를 구한다. 두가지 가정 1. 각각의 프랑스 단어는 하나의 정확한 영어 단어로 generated되고 2. 각각의 프랑스 단어는 sentence내의 다른 프랑스 단어에 독립적이다.

Decoder search space is infinite => stack search Translation probability : p(wf/we) Assume that we have a corpus of aligned sentences. EM algorithm search space가 무한정하다. 따라서 우리는 heuristic search algorithm이 필요하다. 하나 가능한 것이 stack search 알고리즘이다. 우리는 부분적 translation hypotheses를 stack에 유지할 수 있다. 포인트는 우리는 이런 가설들을 확장하고 그것의 선행된 크기를 이용해 stack에서 제거시킨다. 이것은 최고의 translation을 찾을 수 없지만 효과적으로 implement할 수 있게 한다.

Problems distortion fertility : The number of French words one English word generate. Experiment 48% of French sentences were decoded correctly incorrect decodings ungrammatical decodings 왜곡 한 예로 문장 시작부분의 영어 단어가 문장 끝부분의 프랑스 단어와 align되었 다면 이 두 단어의 위치의 왜곡이 alignment의 가능성을 감소시킨다. Fertility 한 영어 단어가 얼마나 많은 french단어로 말하여 지는가

Detailed Problems model problems Fertility is asymmetric Independence assumption Sensitivity to training data Efficiency lack of linguistic knowledge No notion of phrase Non-local dependencies Morphology Sparse data problems 모델에 대한 문제점 Ferility is asymetric 가끔 한 프랑스 단어가 여러 개의 영어 단어와 부합된다. Independence assumption 독립가정으로 한 단어가 다른 단어에 영향을 주지 않는다. 많이 쓰이지만, 일반적인 언어 현상을 무시한 것이다. Sensitivity to training data training data에 민감하다. Efficiency 효율성을 위해 30단어 이상의 sentence는 제외 시켰다. 언어적 지식의 결핍 No notion of phrases 모델은 단지 각각의 단어에 연관될 뿐이다. 그래서 fertility 현상이 생긴다. phrase에 대한 개념이 없다. phrase의 개념을 넣으면 어느정도 해결 가능할 것이다. Non-local dependencies 독립 가정에 영향을 받는다. Morphology 형태론적으로 연관된 단어들이 다른 symbol로 취급되었다. 예를 들어서 Spare data problem 다른 단어에 도움을 받지 못하고 traning corpus에 독립적인 parameter들은 신뢰할 수 없다.

Statistical Alignment and Machine Translation

Similar presentations

Presentation on theme: "Statistical Alignment and Machine Translation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Alignment and Machine Translation

Similar presentations

Presentation on theme: "Statistical Alignment and Machine Translation"— Presentation transcript:

Similar presentations

About project

Feedback