Presentation is loading. Please wait.

Presentation is loading. Please wait.

Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project.

Similar presentations


Presentation on theme: "Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project."— Presentation transcript:

1 Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project

2 Contents Entropy (Information Theory) Mathematical Formulation Cross Entropy Transliterability and Transliteration Performance WAVE Phonology Syllables Some Syllabification rules

3 What is Entropy Entropy is the amount of information obtained in each message received It characterizes our uncertainty about our source of information (Randomness) Expected value function of information content in random variable Based on Shannon's: A Mathematical Theory of Communication

4 Properties and Mathematical Formulation Based on Shannon's: A Mathematical Theory of Communication

5 Explanation of property 3 1/2 1/6 1/2 1/3 1/2 1/3 2/3 1/2 1/6 1/3 Based on Shannon's: A Mathematical Theory of Communication

6 The Formula for Entropy Based on Shannon's: A Mathematical Theory of Communication

7 Properties Based on Shannon's: A Mathematical Theory of Communication

8 The Notion of Cross Entropy

9 Transliterability and Transliteration performance Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

10 Transliterability Measure The measure with the desirable qualities which could measure the ease of Transliterability among languages: 1.Rely purely on orthographic features of the language only( easily calculated based on parallel names corpora) 2.Capture and weigh the inherent ambiguity in transliteration at the character level. (i.e., the average number of character mappings) 3.Weigh the ambiguous transitions for a given character, according to the transition frequencies. Perhaps highly ambiguous mappings occur rarely The Transliterability measure Weighted Average Entropy (WAVE), does out work Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

11 WAVE Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

12 Motivation From the adjacent table we can conclude that frequency of occurrence of unigram ‘a’ is nearly 150 times more frequent than unigram ‘x’ Which implies capturing ambiguities of ‘a’ will be more beneficial than those of ‘x’ The term ‘frequency(i)’ captures this effect Table IV shows the mappings from the source to target languages We can observe that the uni-gram c has mapping to 2 characters स and क Whereas p has only one which is प The term ‘Entropy(i)’ captures this information and ensures that c is weighted more than p Ref: Compositional Machine Transliteration,(2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya

13 Plot Between WAVE and Transliteration Quality The following plots are drawn between log(WAVE) and accuracy measure (for approximately 15k of training corpus) for language pairs of En-Hi, En-Ka, En-Ma, Hi-En, Ka-En, Hi-Ka, Ma-Ka We can see that as the value of WAVE decreases the accuracy is decreasing exponentially The left-top 2 in each of the plots is between Hindi and Marathi languages that share the same orthography and have large one-to-one character mappings between them We can observe that different n-grams have almost similar results which means we can choose the uni-gram model to generalize the model Based on these observations we can term two languages with small WAVE 1 measure as more easily transliterable.

14 Phonology Phonetics: Concerned with how speech sounds are produced in a vocal tract as well as with the with the physical properties of the speech sound waves generated by the larynx and vocal tract Phonology: Reference to the abstract principles that govern the distribution of sounds in a language It is the subfield of linguistics that studies the structure and systematic patterning of sounds in human language Linguistics, An Introduction to Language and Communication, Adrian Akmajian

15 Views of Phonology Phonology broadly has 2 views: Description of the sounds of a particular language and the rules governing the distribution of these sounds Ex: Phonology of English, German or other language Part of the general theory of human language that is concerned with universal properties of natural language sound system English languages has 44 phonetic sounds (20 vowel sounds and 24 consonant sounds) These phonemes can be generalized such that it can be adapted to many languages Linguistics, An Introduction to Language and Communication, Adrian Akmajian

16 English Language Phonemes Generalized

17 Syllables A syllable is a unit of organization for a sequence of speech sounds They are often considered the phonological “building blocks” of words Syllabic writing began several hundred years before the first letters. A word that consists of a single syllable is called monosyllable. Similar terms include disyllable for a word of 2 syllables, trisyllable for a word of 3 syllables and polysyllable which may refer to more than 3 syllables Linguistics, An Introduction to Language and Communication, Adrian Akmajian

18 Syllable A syllable has the following structure: Across the world’s languages the most common type of syllable has the structure CV(C), that is, a single consonant C followed by a single vowel V, followed in turn (optionally) by a single consonant Onset O Syllable (σ) Nucleus N Coda C

19 Syllable Grouping Consider the word napkin whose splitting can be done as “nap-kin” napkin σ1σ1 σ2σ2 OnOn NæNæ CpCp OkOk NiNi CnCn Linguistics, An Introduction to Language and Communication, Adrian Akmajian

20 Some Syllabification Rules Aspiration Rule: Phonemes with the features [-continuant, -voiced] are aspirated in syllable-initial position /p/ is a [-continuant, -voiced] phoneme If the intervocalic consonant p in the sequence apa is the onset of the second syllable it will be aspirated. If it is the coda of the first syllable it will not be aspirated As you pronounce the sequence aps, place your hand in front of your mouth. You will feel a small puff of air that accompanies the release of the p, regardless of weather you stress the first a or the second The presence of aspiration is the evidence we need to conclude that the world apartment is syllabified as “a-part-ment”

21 Maximal Onset Principle The Principle: The sequence of consonants that combine to form an onset with vowel on the right are those that correspond to the maximal sequence that is available at the beginning of a syllable anywhere in the language Illustration: Consider the word “constructs” which is bisyllabic Between the 2 vowels is the sequence n-s-t-r which is to be split Since the maximal sequence that occurs at the beginning of a syllable in English is str- we need to split it as “n-str” Therefore the word is syllabified as “con-structs” Why not other: Assume it is “ns-tr” then the t would appear in syllable initial position which should be aspirated which is not true over here. Other can be ruled out similarly

22 References A Mathematical Theory of Communication (1948), C.E.Shannon, The Bell System Technical Journal, July 1948 Compositional Machine Transliteration (2010), A Kumaran, Mitesh M Khapra, Pushpak Bhattacharya Linguistics, An Introduction to Language and Communication, Adrian Akmajian, Richard A Demers, Ann K Farmer, Robert M Harnish Wiki articles on entropy, phonology and transliteration


Download ppt "Entropy in Machine Transliteration & Phonology Bhargava Reddy 110050078 B.Tech Project."

Similar presentations


Ads by Google