Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Translation across Indian Languages Dipti Misra Sharma LTRC, IIIT Hyderabad Patiala 15-11-2013.

Similar presentations


Presentation on theme: "Machine Translation across Indian Languages Dipti Misra Sharma LTRC, IIIT Hyderabad Patiala 15-11-2013."— Presentation transcript:

1 Machine Translation across Indian Languages Dipti Misra Sharma LTRC, IIIT Hyderabad Patiala 15-11-2013

2 Outline Introduction Information Dynamics in language Machine Translation (MT)‏ Approaches to MT Practical MT systems Challenges in MT Ambiguities Syntactic differences in L1 an L2 MT efforts in India –Sampark : IL to IL MT systems – Objective – Design – Issues Conclusions

3 Introduction Natural Language Processing (NLP) involves  Processing information contained in natural languages Natural as opposed to formal/artificial Formal languages : Programming languages, logic, mathematics etc Artificial : Esperanto

4 Natural Language Processing (NLP) Helps in Communication between Man-machine  Question answering systems, eg interactive railway reservation Man – man  Machine translation

5 Communication Transfer of information from one to the other Language is a means of communication Therefore, one can say It encodes what is communicated We apply the processes of Analysis (decoding) for understanding Synthesis (encoding) for expression (speaking)

6 What do we communicate ? Information Spain delivered a football masterclass at Euro 2012 Intention Emphasis/focus  Euro 2012 bagged/won by Spain  Spain bags Euro 2012 Introduces variation

7 How do we communicate ? We use linguistic elements such as Words (country, park, the, is, Bandipur, of, as, and, considered, National, a, spot, beautiful, tourist, life, in, best, wild, sanctuaries, the, one) Arrangement of the words (Sentences)‏ Words are related to each-other to provide the composite meaning (Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country)

8 How do we communicate ? Contd.. Arrangement of sentences (Discourse)‏ Sentences or parts of sentences are related to each other to provide a cohesive meaning *Considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km. Bandipur National park is a beautiful tourist spot. Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km Languages differ in the way they organise information in these entities All of these interact in the organisation of information

9 Information Dynamics in Language (1/4) Languages encode information Hindi: cuuhe maarate haiM kutte ' rat-pl' 'kill-hab' 'pres-pl' 'dog-pl' rats kill dogs Hindi sentence is ambiguous Possible interpretations Dogs kill rats Rats kill dogs However, English sentence is not ambiguous

10 Information Dynamics in Language (2/4)‏ Ambiguity in Hindi is resolved if, cuuhe maarate haiM kuttoM ko rats kill-hab pres-pl dogs-obl acc Hindi encodes information in morphemes English encodes information in positions Languages encode information differently

11 English does not explicitly mark accusative case (except in pronouns) – no morpheme No lexical item/morpheme for yes no questions (Eng: Is he coming ? Hindi : kyaa vah aa rahaa hai?) Position plays an important role in encoding information in English Subject is sacrosanct Hindi encodes information morphologically

12 Information Dynamics in Language (3/4)‏ Another example, This chair has been sat on – The chair has been used for sitting – Someone sat on this chair, and it is known – The sentence does not mention someone Languages encode information partially

13 Information Dynamics in Language (4/4)‏ English pronouns he, she, it Hindi pronounvaha He is going to Delhi ==> vaha dilli jaa rahaa hai She is going to Delhi ==> vaha dillii jaa rahii hai It broke ==> vaha TuuTa ?? Information does not always map fully from one language into another Conceptual worlds may be different Gender Information

14 Information in Language Languages encode information differently Languages code information only partially Tension between BREVITY and PRECISION

15 Human beings use  World knowledge  Context (both linguistic and extra-linguistic)  Cultural knowledge and  Language conventions to resolve ambiguities Can all this knowledge be provided to the machine ?

16 Languages differ Script (For written language)‏ Vocabulary Grammar These differences can be considered as a measure of language distance

17 Language Distance Script -------------- Vocabulary----------Grammar Urdu-> Hindi Telugu -> Hindi Telugu->Hindi English -> Hindi English-> Hindi English->Hindi

18 Machine Translatoion Machine translation aims at automatic translation of a text in source language to a text in the target language. Mohan gave Hari a book -> Mohan ne Hari ko kitAba dI

19 Machine Translation Let us view MT as a problem of Language encoding (source) - analysis Decoding (target) - synthesis

20 English to Hindi : An Example SL (Eng) sentence : I met a boy who plays cricket with you everyday Mapped to TL(Hin) : I a boy met who everyday with you cricket plays TL synthesis : mEM eka laDake se milA jo roza tumhAre sAtha kriketa khelatA hE OR mEM roza tumhAre sAtha kriketa khelanevAle eka laDake se milA OR meM eka Ese laDake se milA jo roza tumhAre sAtha kriketa khelatA hE

21 Machine Translation : Challenges Languages encode information differently Language codes information only partially Tension between BREVITY and PRECISION Brevity wins leading to inherent ambiguity at different levels

22 Linguistic Issues in MT (1/2)‏ Look at the word 'plot' in the following examples (a) The plot having rocks and boulders is not good. (b) The plot having twists and turns is interesting. 'plot' in (a) means 'a piece of land' and in (b) 'an outline of the events in a story'

23 Linguistic Issues in MT (2/2)‏  Ambiguity in Language Lexical level Sentence level  Structural differences between SL and TL

24 Lexical ambiguity  Lexical ambiguity can be both for Content words – nouns, verbs etc Function words – prepositions, TAMs etc  Content words ambiguity is of two types Homonymy Polysemy

25 Homonymy ‏ A word has two or more unrelated senses Example : I was walking on the bank (river-bank)‏ I deposited the money in the bank (money-bank)‏

26 Polsysemy ‏ 'Act', an English noun 1. It was a kind act to help the blind man across the road (kArya)‏ 2. The hero died in the Act four, scene three (aMka)‏ 3. Don't take her seriously, its all an act (aBinaya)‏ 4. The parliament has passed an Act (dhArA)‏

27 Function words can also pose problems (1/5)‏ Prepositions English prepositions in the target language ‏ Tense Aspect Modality (TAM)‏ Lexical correspondence of TAM

28 Function words can also pose problems (2/5)‏ Function words can also be ambiguous For example – English preposition 'in' (a) I met him in the garden mEM usase bagIce meM milA (b) I met him in the morning mEM usase subaha 0 milA 'Ambiguity' here refers to the 'appropriate correspondence' in the target language.

29 Function words can also pose problems(3/5)‏  He bought a shirt with tiny collars. usane chote kOlaroM vAlI kamIza kharIdI ‘he tiny collars with shirt bought’ ‘with’ gets translated as ‘vAlI’ in hindi He washed a shirt with soap. usane sAbuna se kamIza dhoI ‘he soap with shirt washed’ ‘with’ gets translated as ‘se’.

30 Function words can also pose problems (4/5)‏ TAM Markers m ark tense, aspect and modality  Consist of inflections and/or auxiliary verbs in Hindi  An important source of information  Narrow down the meaning of a verb (eg. lied, lay)‏

31 Function words can also pose problems (4/5)‏ TAM Markers m ark tense, aspect and modality  Consist of inflections and/or auxiliary verbs in Hindi  An important source of information  Narrow down the meaning of a verb (eg. lied, lay)‏

32 Function words can also pose problems (5/5)‏ English Simple Past vs Habitual' 1a. He stayed in the guest house during his visit to our University in Jan (rahA)‏ 1b. He stayed in the guest house whenever he visited us (rahatA thA)‏ 2a. He went to the school just now (gayA)‏ 2b. He went to the school everyday (jAtA thA)‏

33 Sentence level ambiguity o I met the girl in the store + Possible readings a) I met the girl who works in the store b) I met the girl while I was in the store o Time flies like an arrow. + Possible parses: a) Time flies like an arrow (N V Prep Det N) b) Time flies like an arrow (N N V Det N) c) Time flies like an arrow (V N Prep Det N) (flies are like an arrow) d) Time flies like an arrow (V N Prep Det N) (manner of timing)

34 Differences in SL and TL Lexical level (a) One word may translate into different words in different contexts (WSD) English 'plot' → zamiin, kathanak (b) A SL word may not have a corresponding word in the TL (Gaps) English 'reads' in 'This book reads very well' (d) Pronouns across Indian languages Hindi 'vaha' → Telugu 'adi', 'atanu', 'aame'

35 Differences in SL and TL Structural differences (a) word order (English – Hindi) (b) nominal modification (Hindi – Tamil, Telugu etc) (i) relative clause vs relative participles Telugu 'nenu tinnina camcaa' Hindi : *meraa khaayaa cammaca Maine jis cammaca se khaayaa hai vah cammac (ii) missing copula (Hindi – Telugu, Bengali, Tamil etc) Telugu : raamudu mancivaadu Hindi : Ram acchaa ladakaa hai

36 Human beings use World Knowledge Context Cultural knowledge and Language conventions To resolve ambiguities and interpret meaning

37 What to do for the machine ? Challenging problem!!! Providing all the knowledge may: - take too much of time and effort - be difficult/become complex - not be possible (world knowledge acquired from experience) Therefore, Break the problem into smaller problems Choose the solution as per the nature of problem Build language resources to the extent possible and continue to add to it Engineer knowledge efficiently

38 Approaches to MT (1/2)‏ Rule-based or Transfer based  Uses linguistic rules to map SL and TL, such as Maps grammatical structures ‏ Disambiguation rules Knowledge-based Extensive knowledge of the domain Concepts in the language Ability to reason

39 Approaches to MT (2/2)‏ Example-based Mapping is based on stored example translations Translation memory based Uses phrases/words from earlier translation as examples Statistical Does not formulate explicit linguistic knowledge Develops rules based on probabilities Hybrid Mixes two or more techniques

40 A Glance at MT Efforts in India (1/4)‏  Domain Specific Mantra system (C-DAC, Pune) Translation of govt. appointment letters Uses Tree Adjoining Grammar Public health compaign documents Angla Bharati approach (C-DAC Noida & IIT Kanpur)

41 A Glance at MT Efforts in India (2/4)‏  Application Specific Matra (Human aided MT) (NCST,now C-DAC, Mumbai)  General Purpose (not yet in use)‏ Angla Bharati approach (IIT Kanpur ) UNL based MT (IIT Bombay) Shiva: EBMT (IIIT Hyderabad/IISc Bangalore) Shakti: English-Hindi MT system (IIIT Hyderabad)

42 MT Efforts in India (3/4) Major Government funded MT projects in consortium mode Indian Language to Indian Language Machine Translation (ILMT) (Lead Institute - IIIT, Hyderabad) English to Indian Language Machine Translation Mantra, Shakti etc (Lead inst - C-DAC, Pune) Anglabharati (Lead inst – IIT, Kanpur) Sanskrit to Hindi MT System (Lead Inst – University of Hyderabad)

43 MT Efforts in India (4/4) Anusaaraka : Language Accesspr cum MT System (IIIT, Hyderabad, Chinmaya Shodh Sansthan)

44 Our Focus Sampark : Indian Language to Indian Language MT systems

45 Sampark : Indian Language to Indian Language MT Systems Consortium mode project Funded by DeiTY 11 Partiicpating Institutes Nine language pairs 18 Systems

46 Participating institutions  IIIT, Hyderabad (Lead institute)  University of Hyderabad  IIT, Bombay  IIT, Kharagpur  AUKBC, Chennai  Jadavpur University, Kolkata  Tamil University, Thanjavur  IIIT, Trivandrum  IIIT, Allahabad  IISc, Bangalore  CDAC, Noida

47 Objectives  Develop general purpose MT systems from one IL to another for 9 language pairs Bidirectional  Deliver domain specific versions of the MT systems. Domains are: Tourism and pilgrimage One additional domain (health/agriculture, box office reviews, electronic gadgets instruction manuals, recipes, cricket reports)  By-products basic tools and lexical resources for Indian languages: POS taggers, chunkers, morph analysers, shallow parsers, NERs, parsers etc. Bidirectional bilingual dictionaries, annotated corpora, etc.

48 Language Pairs (Bidirectional)  Tamil-Hindi  Telugu-Hindi  Marathi-Hindi  Bengali-Hindi  Tamil-Telugu  Urdu-Hindi  Kannada-Hindi  Punjabi-Hindi  Malayalam-Tamil

49 User Scenario Web based system for tourism/ pilgrimage domain. A common traveler/tourist/piligrim to access info in his language. Access to selected Government portals in agriculture/health Automatic MT in domain General purpose web based translation Potential to attach to major search engines such as Google, Yahoo, Microsoft, Web-duniya

50 Design and Approach Largely transfer based – Analysis, Transfer, Generate Modular (module could be Pipeline architecture Hybrid – some modules statistical, some rule based Analysis : Shallow parser No deep parsing in the first phase

51 Approach Largely transfer based – Analysis, Transfer, Generate Modular – Modules could be statistical or rule based depending on the nature of problem (Hybrid) Pipeline architecture Analysis : Shallow parsing followed by a simple parser

52 Design o Design decisions based on - the commonality in Indian languages - easy to extend to other languages o Phase the development - Phase 1 o Analysis at sentence level o Shallow parser o Simple parser o Transfer : map lexicon, structures, script o Generate the target

53 Design Contd Phase 2 Extend the analysis to discourse level Anaphora resolution Relations between clauses (discourse connectives) Word Sense Disambiguation (WSD) Named Entity Recognition (NER) Multi Word Expressions (MWE) Explore SMT for transfer rules

54 Transfer based MT Source Sentence Source Analysis Analysis Analysis in Target Language Target Sentence Transfer Generation

55 Form (Input sentence/text) Meaning Analysis Form Generation L1 Various types of linguistic information helps in arriving from form to meaning It is complex. Modularization helps in simplifying it.

56 Modularize Word Structure In context Morph Analyser Syntactic What is functions as Semantic What it means (POS tagger) (WSD) Relations between words Local (local word grouping,/ chunking) Non-local (Subject,object/karaka)

57 Form (Input sentence/text) Meaning Analysis Form Generation Semantic analysis POS Chunking parsing Morph Analysis Formal semantics All this information is implicit in language. How to make it explicit? Build resources – Dictionaries, Verb frames, Treebanks

58 Sampark Architecture

59 Details Standards Annotation standards – POS and Chunk Input – output of each module Representation - SSF Data format – Dictionaries Emphasis on proper software engineering Development environment – Dashboard Blackboard architecture CVS for version control etc.

60 Machine Learning: Separating engines from language data Module for Task (T) Sentence in Language (L) Training data (lang. L) Engine for task T Out Manual Correction

61 Horizontal Tasks  H1 POS Tagging & Chunking engine  H2 Morph analyser engine  H3 Generator engine  H4 Lexical disambiguation engine  H5 Named entity engine  H6 Dictionary standards  H7 Corpora annotation standards  H8 Evaluation of output (comprehensibility)  H9 Testing & integration

62 Vertical Tasks for Each Language  V1 POS tagger & chunker  V2 Morph analyzer  V3 Generator  V4 Named entity recognizer  V5 Bilingual dictionary – bidirectional  V6 Transfer grammar  V7 Annotated corpus  V8 Evaluation  V9 Co-ordination

63 Vertical Tasks for Each Language  V1 POS tagger & chunker  V2 Morph analyzer  V3 Generator  V4 Named entity recognizer  V5 Bilingual dictionary – bidirectional  V6 Transfer grammar  V7 Annotated corpus  V8 Evaluation  V9 Co-ordination

64 An Example : Hindi to Panjabi System ਭਾਰਤ ਵਿੱਚ ਆਰੀਆਂ ਦਾ ਆਗਮਨ ਈਸਾ ਦਾ ਕੋਈ 1500 ਸਾਲ ਪੂਰਵ ਹੋਇਆ. ਆਰੀਆਂ ਦਾ ਪਹਲੀ ਖੇਪ ਰਿਗਵੈਦਿਕ ਆਰੀਆ ਕਹਾ ਹੈਂ. ਰਿਗਵੇਦ ਦਾ ਰਚਨਾ ਇਹ ਸਮਾਂ ਹੋਈ. ਰਿਗਵੇਦ ਦਾ ਕਈ ਬਾਤੇ ਅਵੇਸਤਾ ਨਾਲ ਮਿਲਦੀ ਹਨ. ਅਵੇਸਤਾ ਈਰਾਨੀ ਭਾਸ਼ਾ ਦਾ ਪ੍ਰਾਚੀਨਤਮ ਗ੍ਰੰਥ ਹੈਂ. भारत में आर्यों का आगमन ईसा के कोई 1500 वर्ष पूर्व हुआ । आर्यों की पहली खेप ऋग्वैदिक आर्य कहलाती है । ऋग्वेद की रचना इसी समय हुई । ऋग्वेद की कई बाते अवेस्ता से मिलती हैं । अवेस्ता ईरानी भाषा के प्राचीनतम ग्रंथ है ।

65 Panjabi to Hindi सरदार उपासक सिंह भारत का एक प्रमुख स्वतंत्रता संगरामिया था. अमर बिंब बन जाने की कला में उन की कोई सानी नहीं. उन ने केंद्रीय असंबली की बैठक में बम फेंक कर भी भागने से अस्वीकार कर दिया था. उपासक सिंह को 23 मार्च 1931 को उन के साथियों, राजगुरू और सुखदेव का से फ़ांसी और लटका दिया गया था. संपूर्ण देश ने उन की शहादत को याद किया. ਸਰਦਾਰ ਭਗਤ ਸਿੰਘ ਭਾਰਤ ਦੇ ਇੱਕ ਪ੍ਰਮੁੱਖ ਅਜ਼ਾਦੀ ਸੰਗਰਾਮੀਏ ਸਨ। ਅਮਰ ਬਿੰਬ ਬਣ ਜਾਣ ਦੀ ਕਲਾ ਵਿੱਚ ਉਨ੍ਹਾਂ ਦਾ ਕੋਈ ਸਾਨੀ ਨਹੀਂ। ਉਨ੍ਹਾਂ ਨੇ ਕੇਂਦਰੀ ਅਸੰਬਲੀ ਦੀ ਬੈਠਕ ਵਿੱਚ ਬੰਬ ਸੁੱਟ ਕੇ ਵੀ ਭੱਜਣ ਤੋਂ ਇਨਕਾਰ ਕਰ ਦਿੱਤਾ ਸੀ। ਭਗਤ ਸਿੰਘ ਨੂੰ 23 ਮਾਰਚ 1931 ਨੂੰ ਉਨ੍ਹਾਂ ਦੇ ਸਾਥੀਆਂ, ਰਾਜਗੁਰੂ ਅਤੇ ਸੁਖਦੇਵ ਦੇ ਨਾਲ ਫ਼ਾਂਸੀ ਤੇ ਲਟਕਾ ਦਿੱਤਾ ਗਿਆ ਸੀ। ਸਾਰੇ ਦੇਸ਼ ਨੇ ਉਨ੍ਹਾਂ ਦੀ ਸ਼ਹਾਦਤ ਨੂੰ ਯਾਦ ਕੀਤਾ।

66 Panjabi to Hindi सरदार उपासक सिंह (NER) भारत का एक प्रमुख स्वतंत्रता संगरामिया था. अमर बिंब (WSD) बन जाने की कला में उन की कोई सानी (Agreement) नहीं. उन ने (word generation) केंद्रीय असंबली की बैठक में बम फेंक कर भी भागने से अस्वीकार कर दिया था. उपासक सिंह को 23 मार्च 1931 को उन के साथियों, राजगुरू और सुखदेव का से (function word substitution) फ़ांसी और लटका दिया गया था. संपूर्ण देश ने उन की शहादत को याद किया.

67 Evaluation Testing, system integration, and evaluation team – Involvement of industry Regular In-house subjective evaluation Third party evaluation on system submission

68 Achievements of ILMT Project Phase I 18 MT systems built among Indian languages Shallow parser for all 9 Indian languages Lexical resources for all 9 languages  Largely built from scratch  Developed standards for all stages  Developed open architecture

69 Achievements -Deployment Deployed and running over web – 8 systems (sampark.org.in )‏ Others deployed over ILMT test site 4 more ready to go to Sampark soon Rest are being evaluated and tested internally (require a few more months to go to Sampark site after reaching quality levels) Constant qualilty improvement going on for various existing modules New modules are under testing and would be soon integrated

70 Future Tasks Enhance the quality of MT output Enhancing dictionaries Increasing coverage of grammar Adding new technology to ILMT systems Full sentence parsing Discourse processing - anaphora Target some users

71 Some Possibilities Possible tie up with search engines companies Possible tie up with content companies such as - Dainik Jagran, Web duniya, Rediff, Yahoo Identify translation bureaus and agencies Build MT workbench for their use, their domains, etc.  Poised for major public impact with a unique technology.

72 Future Systems Add language pairs Gujrati – Hindi Kashmiri – Hindi Manipuri – Hindi Oriya – Hindi Etc

73 Future Systems Add language pairs Gujrati – Hindi Kashmiri – Hindi Manipuri – Hindi Oriya – Hindi Etc

74 CONCLUSION Developing MT systems, though a challenging task, is a useful effort particularly in the multilingual context of India


Download ppt "Machine Translation across Indian Languages Dipti Misra Sharma LTRC, IIIT Hyderabad Patiala 15-11-2013."

Similar presentations


Ads by Google