Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kovid Kapoor - 08005037 Aashimi Bhatia – 08D04008 Ravinder Singh – 08005018 Shaunak Chhaparia – 07005019.

Similar presentations

Presentation on theme: "Kovid Kapoor - 08005037 Aashimi Bhatia – 08D04008 Ravinder Singh – 08005018 Shaunak Chhaparia – 07005019."— Presentation transcript:

1 Kovid Kapoor - 08005037 Aashimi Bhatia – 08D04008 Ravinder Singh – 08005018 Shaunak Chhaparia – 07005019

2 Examples of ancient languages which were lost Motivation : Why should we bother about such languages? The Manual process of Decipheration Motivation for a Computational Model A Statistical Method for Decipheration Conclusions

3 A language is said to be “lost” when modern scholars cannot reconstruct text written in it. Slightly different from a “dead” language – a language which people can translate to/from, but noone uses it anymore in everyday life. Generally happens when one language gets replaced by another. For eg, native American languages were replaced by English, Spanish etc.

4 Egyptian Hieroglyphs A formal writing system used by ancient Egyptians, containing of logographic and alphabetic symbols. Finally deciphered in the early 19 th century, following a lucky finding of “Rosetta Stone”. Ugaritic Language Tablets with engravings found in the lost city of Ugarit, Syria. Researchers recognized that it is related to Hebrew, and could identify some parallel words.

5 Indus Script Written in and around Pakistan around 2500 BC Over 4000 samples of the text have been found. Still not deciphered successfully! What makes it difficult to decipher?

6 Historical knowledge expansion Very helpful in learning about the history of the place where the language was written. Alternate sources of information : coins, drawings, buried tombs. These sources not as precise as reading the literature of the region, which gives a clear idea. Learning about the past explains the present A lot of the culture of a place is derived from ancient cultures. Boosts our understanding of our own culture.

7 From a linguistic point of view We can figure out how certain languages were developed through time. Origin of some of the words explained.

8 Similar to a cryptographic decryption process Frequency analysis based techniques used First step : identify the writing system Logographic, alphabetic or syllaberies? Usually determined by the number of distinct symbols. Identify if there is a closely related known language Hope for finding bitexts : translations of a text of the language in a known language, like Latin, Hebrew etc. indus-script

9 Earliest attempt made by Horapollo in the 5 th century. However, explanations were mostly wrong! Proved to be an impediment on the process for 1000 years! Arab historians able to partly decipher in the 9 th and 10 th centuries. Major Breakthrough : Discovery of Rosetta Stone, by Napolean’s troops. indus-script

10 The stone has a decree issued by the king in three languages : hieroglyphs, demotic, and ancient Greek! Finally deciphered in 1820 by Jean-François Champollion. Note that even with the availability of a bitext, full decipheration took 20 more years! /thumb/c/ca/Rosetta_Stone_BW.jpeg/200px- Rosetta_Stone_BW.jpeg

11 The inscribed words consisted of only 30 distinct symbols. Very likely to be alphabetical. The location of the tablets found suggested that it is closely related to Semitic languages Some words in Ugaritic had the same origin as words in Hebrew For eg, the Ugaritic word for king is the same as the Hebrew word. indus-script

12 Lucky discovery : Hans Bauer assumed that the writings on an axe found was the word “axe”! Led to revision of some earlier hypothesis, and resulted in decipherment of the entire script! ormrevealed/scripts/ugaritic.jpg

13 Very time taking exercise; years, even centuries taken for the successful decipherment. Even when some basic information about the language is learnt, like the syntax structure, a closely related languages, long time required to produce character and word mappings.

14 Once some knowledge about the language has been learnt, is it possible to use a program to produce word mappings? Can the knowledge of a closely related language be used to decipher a lost language? If possible, would save a lot of efforts and time. Successful archaeological decipherment has turned out to require a synthesis of logic and intuition…that computers do not (and presumably cannot) possess. – Andrew Robinson

15 Notice that manual efforts have some guiding principles A common starting point is to compare letter and word frequencies with a known language Morphological analysis plays a crucial role as well Highly frequent morpheme correspondences can be particularly revealing. The model tries to capture these letter/word level mappings and morpheme correspondences.

16 We are given a corpus in the lost language, and a non- parallel corpus in a related language from the same family. Our primary goals : Finding the mapping between the alphabets of the lost and known language. Translate words in the lost language into corresponding cognates of the known languages

17 We make several assumptions in this model : That the writing system is alphabetic in nature Can be easily verified by counting the number of symbols in the found record. That the corpus has been transcribed into an electronic format Means that each character is uniquely identified. About the morphology of the language : Each word consists of a stem, prefix and suffix, where the latter two may be omitted Holds true for a large variety of human languages

18 The inventories and the frequencies in the known language are given. In essence, the input consists of two parts : A list of unanalyzed words in a lost language A morphologically analyzed syntax in a known related language

19 Consider the following example, consisting of words in a lost language closely related to English, but written using numerals. 15234 --asked 1525 --- asks 4352 --- desk Notice the pair of endings, -34 and -5, with the same initial sequence 152- Might correspond to –ed and –s respectively. Thus, 3=e, 4=d and 5=s

20 Now, we can say that 435=des, and using our knowledge of English, we can suppose that this word is very likely to be desk. As this example illustrates, we proceed by discovering both character- and morpheme-level mappings. Another intuition the model should capture is the sparsity of the mapping. Correct mapping will preserve phonetic relations b/w the two related languages Each character in the unknown language will map to a small number of characters in the related language.

21 We assume that each morpheme is probabilistically generated jointly with a latent counterpart in the lost language The challenge: Each level of correspondence can completely describe the observed data. So using a mechanism based on one leaves no room for the other. The solution: Using a Dirichlet Process to model probabilities (explained further).

22 There are four basic layers in the generative process Structural Sparsity Character-edit Distribution Morpheme-pair Distributions Word Generation


24 We need a control on the sparsity of the edit-operation probabilities, encoding the linguistic intuition that character-level mapping should be sparse. The set of edit operations include character substitutions, insertions and deletions. We assign a variable λ e corresponding to every edit operation e. The set of character correspondences with the variable set to 1 { (u,h) : λ (u,h) = 1 }conveys a set of phonetically valid correspondences. We define a joint prior over these variables to encourage sparse character mappings.

25 This prior can be viewed as a distribution over binary matrices and is defined to encourage every row and column to sum to low values integer values (typically 1) For a given matrix, define a count c(u) which is the number of corresponding letters that u has in that matrix. Formally, c(u) = ∑ h λ (u,h) We now define a function f i = max(0, |{u : c(u) = i}| - b i ) For any i other than 1, f i should be as low as possible. Now the probability of this matrix is given by

26 Here Z is the normalization factor and w is the weight vector. w i is either zero or negative, to ensure that the probability is high for a low value of f. The values of b i and w i can be adjusted depending on the number of characters in the lost language and the related language.

27 We now draw a base distribution G 0 over character edit sequences. The probability of a given edit sequence P(e) depends on the value of the indicator variable of individual edit operations λ e, and a function depending on the number of insertions and deletions in the sequence, q(# ins (e), # del (e)). The factor depending on the number of insertions and deletions depends on the average word lengths of the lost language and the related language.

28 Example: Average Ugaritic word is 2 letters longer than an average Herbew word Therefore, we set our q to be such as to disallow any deletions and allow 1 insertion per sequence, with the probability of 0.4 The part depending on the λ e s makes the distribution spike at 0 if the value is 0 and keeps it unconstrained otherwise (spike-and slab priors)

29 The base distribution G 0 along with a fixed parameter α define a Dirichlet process, which provides probability over morpheme-pair distributions. The resulting distributions are likely to be skewed in favor of a few frequently occurring morpheme-pairs, while remaining sensitive to character-level probabilities of the base distribution. Our model distinguishes between the 3 kinds of morphemes- prefixes, stems and suffixes. We therefore use different values of α

30 Also, since the suffix and prefix depend on the part of speech of the stem, we draw a single distribution G stm for the stem, we maintain separate distributions G suf|stm and G pre|stm for each possible stem part-of-speech.

31 Once the morpheme-pair distributions have been drawn, actual word pairs may now be generated. Based on some prior, we first decide if a word in the lost language has a cognate in the known language. If it does, then a cognate word pair (u, h) is produced: Otherwise, a lone word u is generated.

32 This model captures both character and lexical level correspondences, while utilizing morphological knowledge of the known language. An additional feature of this multi-layered model structure is that each distribution over morpheme pairs is derived from the single character-level base distribution G 0. As a result, any character-level mappings learned from one correspondence will be propagated to other morpheme distributions. Also, the character-level mappings obey sparsity constraints

33 Applied on Ugaritic language Undeciphered corpus contains 7,386 unique word types. The Hebrew Bible used for known language corpus, which is close to ancient Ugaritic. Assume morphological and POS annotations availability for the Hebrew lexicon.

34 The method identifies Hebrew cognates for 2,155 words, covering almost 1/3 rd of the Ugaritic vocabulary. The baseline method correctly maps 22 out of 30 characters to their Hebrew counterparts, and translates only 29% of all the cognates This method correctly translates 60.4 % of all cognates. This method yields correct mapping for 29 out of 30 characters.

35 Even with character mappings, many words can be correctly translated only by examining their context. The model currently fails to take the contextual information into account.

36 We saw how language decipherment is an extremely complex task. Years of efforts required for successful decipheration of each lost language. Depends on the amount of available corpus in the unknown language. But availability does not make it easy. Statistical model has shown promise. Can be developed further and used for more languages.

37 Wikipedia article on Decipherment of Hieroglyphs ic_writing ic_writing Lost Languages: The Enigma of the World's Undeciphered Scripts by Andrew Robinson (2009) rtainment/books/non-fiction/article5859173.ece rtainment/books/non-fiction/article5859173.ece A Statistical Model for Lost Language Decipherment Benjamin Snyder, Regina Barzilay, and Kevin Knight ACL (2010) ( 010.pdf) 010.pdf

38 A staff talk from Straight Dope Science Advisory Board – How come we can’t decipher the Indus Script? (2005) come-we-cant-decipher-the-indus-script come-we-cant-decipher-the-indus-script Wade Davis on Endangered Cultures (2008)

Download ppt "Kovid Kapoor - 08005037 Aashimi Bhatia – 08D04008 Ravinder Singh – 08005018 Shaunak Chhaparia – 07005019."

Similar presentations

Ads by Google