Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multi-Lingual Wordnets: Coimbatore Workshop (11-14 June, 2009) at Amrita University Pushpak Bhattacharyya Computer Science and Engineering Department Indian.

Similar presentations


Presentation on theme: "Multi-Lingual Wordnets: Coimbatore Workshop (11-14 June, 2009) at Amrita University Pushpak Bhattacharyya Computer Science and Engineering Department Indian."— Presentation transcript:

1 Multi-Lingual Wordnets: Coimbatore Workshop (11-14 June, 2009) at Amrita University Pushpak Bhattacharyya Computer Science and Engineering Department Indian Institute of Technology Bombay amritesharyai namah: Obeiscance to amma

2 Objective of the wordnet workshop PAN-Indian Wordnets Involving languages from the North East, the Western part, the Northern part and the Southern part of India –Sanskrit –Assamese, Bodo, Nepali, Manipuri –Hindi, Kashmiri –Marathi Konkani –Tamil, Telugu, Kannad, Malayalam –English Meeting minds of those who LOVE WORDS AND THEIR RELATIONSHIPS

3 Ambiguity The Crux of the problem

4 Stages of language processing Phonetics and phonology Morphology Lexical Analysis Syntactic Analysis Semantic Analysis Pragmatics Discourse

5 Phonetics Processing of speech Challenges –Homophones: bank (finance) vs. bank (river bank) –Near Homophones: maatraa vs. maatra (hin) –Word Boundary aajaayenge (aa jaayenge (will come) or aaj aayenge (will come today) I got [ua]plate –Phrase boundary mtech1 students are especially exhorted to attend as such seminars are integral to one's post-graduate education –Disfluency: ah, um, ahem etc.

6 Morphology Word formation rules from root words Nouns: Plural (boy-boys); Gender marking (czar-czarina) Verbs: Tense (stretch-stretched); Aspect (e.g. perfective sit-had sat); Modality (e.g. request khaanaa  khaaiie) First crucial first step in NLP Languages rich in morphology: e.g., Dravidian, Hungarian, Turkish Languages poor in morphology: Chinese, English Languages with rich morphology have the advantage of easier processing at higher stages of processing A task of interest to computer science: Finite State Machines for Word Morphology

7 Lexical Analysis Essentially refers to dictionary access and obtaining the properties of the word e.g. dog noun (lexical property) take-’s’-in-plural (morph property) animate (semantic property) 4-legged (-do-) carnivore (-do) Challenge: Lexical or word sense disambiguation

8 Lexical Disambiguation First step: part of Speech Disambiguation Dog as a noun (animal) Dog as a verb (to pursue) Sense Disambiguation Dog (as animal) Dog (as a very detestable person) Needs word relationships in a context The chair emphasised the need for adult education Very common in day to day communications Satellite Channel Ad: Watch what you want, when you want (two senses of watch) e.g., Ground breaking ceremony/research

9 Technological developments bring in new terms, additional meanings/nuances for existing terms –Justify as in justify the right margin (word processing context) –Xeroxed: a new verb –Digital Trace: a new expression –Communifaking: pretending to talk on mobile when you are actually not –Discomgooglation: anxiety/discomfort at not being able to access internet –Helicopter Parenting: over parenting

10 Syntax Processing Stage Structure Detection S NP VP V NP I like mangoes

11 Parsing Strategy Driven by grammar S-> NP VP NP-> N | PRON VP-> V NP | V PP N-> Mangoes PRON-> I V-> like

12 Challenges in Syntactic Processing: Structural Ambiguity Scope 1.The old men and women were taken to safe locations (old men and women) vs. ((old men) and women) 2. No smoking areas will allow Hookas inside Preposition Phrase Attachment I saw the boy with a telescope (who has the telescope?) I saw the mountain with a telescope (world knowledge: mountain cannot be an instrument of seeing) I saw the boy with the pony-tail (world knowledge: pony-tail cannot be an instrument of seeing) Very ubiquitous: newspaper headline “20 years later, BMC pays father 20 lakhs for causing son’s death”

13 Structural Ambiguity… Overheard –I did not know my PDA had a phone for 3 months An actual sentence in the newspaper –The camera man shot the man with the gun when he was near Tendulkar (P.G. Wodehouse, Ring in Jeeves) Jill had rubbed ointment on Mike the Irish Terrier, taken a look at the goldfish belonging to the cook, which had caused anxiety in the kitchen by refusing its ant’s eggs… (Times of India, 26/2/08) Aid for kins of cops killed in terrorist attacks

14 Headache for Parsing: Garden Path sentences Garden Pathing –The horse raced past the garden fell. –The old man the boat. –Twin Bomb Strike in Baghdad kill 25 (Times of India 05/09/07)

15 Semantic Analysis Representation in terms of Predicate calculus/Semantic Nets/Frames/Conceptual Dependencies and Scripts John gave a book to Mary Give action: Agent: John, Object: Book, Recipient: Mary Challenge: ambiguity in semantic role labeling –(Eng) Visiting aunts can be a nuisance –(Hin) aapko mujhe mithaai khilaanii padegii (ambiguous in Marathi and Bengali too; not in Dravidian languages)

16 Pragmatics Very hard problem Model user intention –Tourist (in a hurry, checking out of the hotel, motioning to the service boy): Boy, go upstairs and see if my sandals are under the divan. Do not be late. I just have 15 minutes to catch the train. –Boy (running upstairs and coming back panting): yes sir, they are there. World knowledge –WHY INDIA NEEDS A SECOND OCTOBER (ToI, 2/10/07)

17 Discourse Processing of sequence of sentences Mother to John: John go to school. It is open today. Should you bunk? Father will be very angry. Ambiguity of open bunk what? Why will the father be angry? Complex chain of reasoning and application of world knowledge Ambiguity of father father as parent or father as headmaster

18 Complexity of Connected Text John was returning from school dejected – today was the math test He couldn’t control the class Teacher shouldn’t have made him responsible After all he is just a janitor

19 Lexical Knowledge Structures Indian Scenario

20 Hindi Wordnet Dravidian Language Wordnets North East Language Wordnet Marathi Wordnet Sanskrit Wordnet English Wordnet Bengali Wordnet Punjabi Wordnet Konkani Wordnet Linked Wordnets

21 Great Linguistic Diversity Major streams –Indo European –Dravidian –Sino Tibetan –Austro-Asiatic Some languages are ranked within 20 in the world in terms of the populations speaking them –Hindi and Urdu: 5 th (~500 milion) –Bangla: 7 th (~300 million) –Marathi 14 th (~70 million)

22 Major Language Processing Initiatives Mostly from the Government: Ministry of IT, Ministry of Human Resource Development, Department of Sceince and Technology Recently great drive from the industry: NLP efforts with Indian language in focus –Google –Microsoft –IBM Research Lab –Yahoo –TCS

23 Technology Development in Indian Languages (TDIL) Started by the Ministry of IT in 2000 13 resource center across the country Responsibility for two languages: one major and one minor For example, –IIT Bombay: Marathi and Konkani –IIT Kanpur: Hindi and Nepali –ISI Kolkata: Bangla and Santhaali –Anna University: Tamil

24 Achievements in TDIL: Lexical Resources Wordnets: Hindi and Marathi (IIT Bombay) Ontologies: Tamil concept hierarchy (Tanjavur University, AU-KBC) Semantically rich lexicons: IIT Kanpur, IIITH, IIT Bombay Corpora: Central Institute of Indian Languages (CIIL) Web Content: All 13 centers, Gujarathi content is exhaustive and of good quality

25 Recent Initiatives NLP Association of India: 2 years old: recently efforts are on making tools and resources freely available on the website of NLPAI LDC-IL (like the Linguistic Data Consortium at UPenn) –Approved by the planning commission National Knowledge Commission: special drive on translation (human and machine)

26 Recent Initiatives cntd Consortia set up already for IL-IL MT, E-IL MT and CLIA SAALP: South Asian Association for Language Processing (formed with SAARC countries)

27 Industry Scenario: English How to use NLP to increase the search engine performance (precision, recall, speed) Google, Rediff, Yahoo, IRL, Microsoft: all have search engine, IR, IE R & D projects outsourced from USA and being carried out in India.

28 Industry Scenario: Indian Language English-Hindi MT is regarded as critical IBM Research lab has massive English Hindi Parallel Corpora (news domain) –Statistical Machine Translation Microsoft India at Bangalore has opened a Multilingual Computing Division Google and Yahoo India is actively pursuing IL search engine

29 Related work Eurowordnet (Vossen, 1999) and Balkanet (Christodoulakis, 2002) –where synsets of multiple languages are linked among themselves and to the Princeton Wordnet (Miller et. al., 1990; Fellbaum, 1998) –through Inter-lingual Indices (ILI)

30 Our experience: Multilingual Wordnets for Indian Languages

31 Wordnet work at IIT Bombay http://www.cfilt.iitb.ac.in Follow the design principle(s) of the Princeton Wordnet for English paying particular attention to language specific phenomena (such as complex predicates) Hindi Wordnet –Total Number of Synsets: >30,000 –Total Number of Unique Words: >65,000 Marathi Wordnet –Total Number of Synsets: >18,000 –Total Number of Unique Words: >30,000

32 HWN and MWN created using different principles (Tatsam, i.e., Sanskrit words borrowed as such: very often) HWN entry: {peR, vriksh, paadap, drum, taru, viTap, ruuksh, ruukh, adhrip, taruvar} ‘tree’ jaR,tanaa, shaakhaa, tathaa pattiyo se yukt bahuvarshiya vanaspati ‘perennial woody plant having root, stem, branches and leaves’ peR manushya ke lie bahut hi upayogii hai ‘trees are useful to men’ MWN entry: {jhaaR, vriksh, taruvar, drum, taruu, paadap} ‘tree’ mule, khoR, phaanghaa, pane ityaadiinii yokt asaa vanaspativishesh ‘perennial woody plant having root, stem, branches and leaves’ tii damuun jhaadacyaa saavlit baslii ‘Being tired/exhausted she sat under the shadow of the tree’

33 Hindi WN: recently made free

34 A glimpse of the wordnet खोड रान बा ग आंबा लिंबू मूळमूळ मुळे,खोड,फांद्या,पाने इत्यादींनी युक्त असा वनस्पतिविशेष:"झाडे पर्यावरण शुद्ध करण्याचे काम करतात" झाड, वृक्ष, तरू वनस्पती MERONYMYMERONYMY HOLONYMYHOLONYMY H Y P E R N Y M Y H Y P O N Y M Y GLOSS

35 Marathi WN created from Hindi: expansion approach: issues For a concept, words exist in both Hindi and Marathi: most common For a concept, words exist in Hindi but not in Marathi –{ दादा [daadaa, grandfather], बाबा [baabaa, grandfather], आजा [aajaa, grandfather], दद्दा [daddaa, grandfather], पितामह [pitaamaha, grandfather], प्रपिता [prapitaa, grandfather]} are words in Hindi for paternal grandfather. There are no equivalents in Marathi. For a concept, words exist in Marathi and not in Hindi –{ गुढीपाडवा [gudhipaadvaa, newyear], वर्षप्रतिपदा [varshpratipadaa, new year]} are words in Marathi which do not have any equivalents in Hindi.

36 Analogy with English {mama}: uncle from mother’s side {chacha}: uncle from father’s side No natural words in English Introduce multiwords –{uncle, maternal uncle} and {chacha, paternal uncle} Makes the lexical resource look unnatural to a native speaker Pitfall of expansion approach? WN users tend to look upon and use the lexical resource as an ordinary dictionary.

37 Other concerns Identical word –Faux Amis: “false friends” or “false cognates” samaadhaan- solution (Hindi), satisfaction (Marathi) shikshaa- education (Marathi), punishment (Marathi) –Narrowing of meaning –Widening of meaning Identical Meaning –Richness of vocabulary in Hindi and not in Marathi and vice versa (like the words for snow in Eskimo language)

38 Narrowing and Widening of meaning Same Word Same Word Marathi Hindi Marathi Hypernymy/ hyponymy Hypernymy/ hyponymy

39 Dictionary standardization

40 Large Scale Nation Wide Projects in Consortia Mode English to Indian Language Machine Translation Indian Language to Indian Language Machine Translation Cross Lingual Information Access –Each of about 800 Crores of Rupees, equivalent to about 200 million dollars –In each participation by 10 different institutes across the length and breadth of the country

41 Adopted Standard SensesHindiMarathiBangaliOriyaTamil (W 1, W 2, W 3, W 4, W 5, W 6 ) (W 1, W 2, W 3 ) (W 1, W 2, W 3, W 4 ) (W 1, W 2, W 3 ) (sun) ( सूर्य, सूरज, भानु, भास्कर, प्रभाकर, दिनकर, अंशुमान, अंशुमाली ) ( सूर्य, भानु, दिवाकर, भास्कर, रवि, दिनेश, दिनमणी )... (cub, lad, laddie, sonny, sonny boy) ( लड़का, बालक, बच्चा, छोकड़ा, छोरा, छोकरा, लौंडा ) ( मुलगा, पोरगा, पोर, पोरगे ) ……… (son, boy) ( पुत्र, बेटा, लड़का, लाल, सुत, बच्चा, नंदन, पूत, चिरंजीव, चिरंजी ) ( मुलगा, पुत्र, लेक, चिरंजीव, तनय ) ………

42 Advantages of the concept based multilingual dictionary (1/2) Economy of labor and storage –Semantic features like [±Animate, ±Human, ±Masculine, etc.] assigned to a nominal concept and not to any individual lexical item of any language –Semantic features, such as [+Stative (e.g., know), +Activity (e.g., stroll), +Accomplishment (e.g., say), +Semelfactive (e.g., knock), +Achievement (e.g., win)] are assigned to a verbal concept.

43 Advantages of the concept based multilingual dictionary (2/2) Bilingual pairwise dictionaries can be generated automatically. The model admits of the possibility of extracting a domain specific dictionary for all or any specific language pair. The language group which lacks competence in the pivot language- which in our case is Hindi- can benefit from the already worked out languages. –E.g. Tamil and Malayalam

44 Word alignment in the dictionary model Even if we choose the right sense of a word in the source language (SW1), there is still the hurdle of choosing the appropriate target language word. Lexical choice is a function of complex parameters like situational aptness and native speaker acceptability.

45 Example Concept: ‘the state of having no doubt of something’ –Hindi: {nishshank, anaashankita, aashankahiin,befikr, bekhtak, sangshayhiin} –Marathi: {nihshanka, nirdhaasta, nirbhrot, shankaarahita} Third member in the Hindi synset aashankahiin is appropriately mapped to the fourth member in the Marathi synset shankaarahita and not to the first one.

46 Links set up between words English synset Hindi synsetMarathi synset लड़का /HW1, बालक /HW2, बच्चा /HW3, छोकड़ा /HW4, छोरा /HW5, छोकरा /HW6, लौंडा /HW7 मुलगा /HW1, पोरगा /HW6, पोर /HW2, पोरगे /HW6 male- child/HW1, boy /HW2

47 Linguistic challenges (1/2) Using a synthetic expression –‘ornaments and other gifts given to the bride by the bridegroom on the day of wedding’ chadhaava (Hindi) – विवाहसमयी वराकडून वधुला दिले जाणारे दागिने ‘at-the-time-of- wedding–bridegroom–bride– given–ornament’ (Marathi) Using transliteration, if the synthetic expression is larger –seharaa (~garland: complicated cultural expression) –Seharaa (transliterated in Marathi) Reciprocally, maahervaashiin ‘a woman who has come to stay at her parents' place after her marriage: no equivalent in Hindi

48 Linguistic challenges (2/2) Singleton Hindi pivot synset  expressed through more than one finer concept in Marathi fikaa in Hindi: ‘food prepared with less sugar, salt or spice’, Marathi equivalent: three distinct words expressing three distinct finer concepts –agodh ‘less sweet’ –aLanii ‘less salty –miLamiLat ‘less spicy’. These three words cannot be taken as the members of a single synset in Marathi

49 Computational Aspects

50 Dictionary development framework

51 Dictionary entry template ID:: 02691516 CAT:: verb CONCEPT:: be in a state of movement or action EXAMPLE:: "The room abounded with screaming children" SYNSET-ENGLISH :: (abound, burst, bristle)

52 Language and Task Configuration window

53 Synset entry and word-alignment interface

54 Conclusion (1/3) Linked wordnets: Immense Lexical Resource Great benefits to machine translation, cross lingual search Very useful for language teaching, pedagogy, comparative linguistics Akin to Eurowordnet, but critical differences due to typical Indian language characteristics Great Unifier of the country

55 Conclusion (2/3) Computational challenges: –Maintenance of multilingual data –their insertion, deletion and updating in a spatially and temporally distributed situation

56 Conclusion (3/3) Advantages of the framework –a linguistically sound basis of the dictionary framework –economy of representation and –avoidance of duplication of effort


Download ppt "Multi-Lingual Wordnets: Coimbatore Workshop (11-14 June, 2009) at Amrita University Pushpak Bhattacharyya Computer Science and Engineering Department Indian."

Similar presentations


Ads by Google