Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS626-460: Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Topic: Hindi Wordnet, Formalization.

Similar presentations


Presentation on theme: "CS626-460: Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Topic: Hindi Wordnet, Formalization."— Presentation transcript:

1 CS626-460: Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Topic: Hindi Wordnet, Formalization etc.

2 Lexical Matrix

3 Creation of Synsets Three principles: Minimality Minimality Coverage Coverage Replacability Replacability

4 Synsets {house} is ambiguous. {house, home} has the sense of a social unit living together; Is this the minimal unit? {family, house, home} will make the unit completely unambiguous. For coverage: {family, household, house, home} ordered according to frequency. Replacability of the most frequent words is a requirement.

5 Synset creation From first principles –Pick all the senses from good standard dictionaries. –Obtain synonyms for each sense. –Needs hard and long hours of work.

6 Synset creation (continued) From the wordnet of another language in the same family –Pick the synset and obtain the sense from the gloss. –Get the words of the target language. –Often same words can be used- especially for t%sama words. –Translation, Insertion and deletion. Hindi Synset: AnauBavaI jaanakar maMjaa huAa (experienced person) Marathi Synset: AnauBavaI t& jaaNata &ata

7 Gloss and Example Crucially needed for concept explication, wordnet building using another wordnet and wordnet linking. {earthquake, quake, temblor, seism} -- (shaking and vibration at the surface of the earth resulting from underground movement along a fault plane of from volcanic activity) Hindi Synset: { BaUkMp, BaUcaala, BaUDaola }; pRqvaIko pRYzBaagaka ihlanaa ; gaujaratmaoM hue BaUkMpmaoM Anaok laaoga maaro gayao. (shaking of the surface of earth; many were killed in the earthquake in Gujarat) (shaking of the surface of earth; many were killed in the earthquake in Gujarat) Marathi Synset: { BaUkMp, QarNaIkMp }; pRqvaIcaa pRYzBaaga halaNyaacaI ik/yaa ; gaujaraqamaQyao Jaalaolyaa BaUkMpat Anaok laaok maarlao gaolao.

8 Glossstudy Hyponymy Dwelling,abode bedroom kitchen house,home A place that serves as the living quarters of one or mor efamilies guestroom veranda bckyard hermitage cottage Meronymy Hyponymy MeronymyMeronymy Hypernymy WordNet Sub-Graph

9

10 Needed for word sense disambiguation. Needed for word sense disambiguation. Makes explicit the semantic relations. Makes explicit the semantic relations. Tries to link correctly the exact place of a particular sense in the structure of a language. Tries to link correctly the exact place of a particular sense in the structure of a language. Conceptual categories of nouns, verbs, adjectives and adverbs are placed in a directed acyclic graph structure. Conceptual categories of nouns, verbs, adjectives and adverbs are placed in a directed acyclic graph structure. Ontology

11 Wordnet defines an ontology earthquake, quake, temblor, seism -- => geological phenomenon -- => natural phenomenon -- => phenomenon -- Property inheritance possible. Important for sense disambiguation Ontology is shallow for non-noun POS.

12 Hindi Wordnet

13 A small part of Hindi Wordnet

14

15

16

17

18

19 उपयोगकर्ता टिप्पणी 1. गजब। तारीफ करने के लिए शब्द नहीं। भारतीय मनीषा कहती है कि जो उपयोगी है वह सुंदर है। मुझे जैसे भाषाविद के लिए अत्यंत उपयोगी। आभार। - आचार्य कामता प्रसाद - आचार्य कामता प्रसाद 2. बहुत महत्वपूर्ण कार्य है यूँ कहिए कि अद्भुत और अनुपमेय कार्य है। - डॉ॰ जगदीश व्योम 3. http://desh- duniya.blogspot.com/2006/07/blog- post_21.html http://desh- duniya.blogspot.com/2006/07/blog- post_21.html http://desh- duniya.blogspot.com/2006/07/blog- post_21.html

20 हिन्दी शब्दकोश संकेत - स्थल http://www.cfilt.iitb.ac.in/wordnet/ webhwn/hindi_version.html http://www.cfilt.iitb.ac.in/wordnet/ webhwn/hindi_version.html http://www.cfilt.iitb.ac.in/wordnet/ webhwn/hindi_version.html http://www.cfilt.iitb.ac.in/wordnet/ webhwn/hindi_version.html

21 Synset Entry Interface

22

23 Marathi WN from Hindi WN

24 WordNet Building Approaches Building WordNet from scratch Building WordNet from scratch Time consuming and needs extensive manual efforts Time consuming and needs extensive manual efforts Alternative approach can be building the WordNet by using other WordNet as base Alternative approach can be building the WordNet by using other WordNet as base Marathi WordNet is being built by using same approach from Hindi WordNet Marathi WordNet is being built by using same approach from Hindi WordNet

25 Building MWN from HWN Consider a Synset from HWN corresponding to some concept. For e.g. Consider a Synset from HWN corresponding to some concept. For e.g. Synset in hindi for the concept of “tree” is: {ped, vriksh, paadap, drum, taru, vitap, ruuksh, ruukh, adhrip, taruvar} Construct Marathi Synset with the same id representing same concept as follows: Construct Marathi Synset with the same id representing same concept as follows: {jhaad, vriksh, taruvar, drum, taru, paadap}

26 Building MWN from HWN(cont.) If some sense is only in Hindi, but not in Marathi, corresponding Synset in Marathi can’t be created. For e.g. If some sense is only in Hindi, but not in Marathi, corresponding Synset in Marathi can’t be created. For e.g. {daadaa, baabaa, aajaa, daddaa, pitaamaha, prapitaa} If some sense is only in Marathi but not in Hindi, then create synset with different id. If some sense is only in Marathi but not in Hindi, then create synset with different id. For e.g. {powaadaa} – Song praising the bravery of Maratha Warriors

27 Building MWN from HWN(cont.) When sense is present in both, then semantic relations in HWN are borrowed directly in MWN. When sense is present in both, then semantic relations in HWN are borrowed directly in MWN. Lexical relationships are added manually. Lexical relationships are added manually. If the sense is only present in Marathi, then all the relationships are to be established manually. If the sense is only present in Marathi, then all the relationships are to be established manually.

28 Challenges The quality and effectiveness of WN depends largely on how the base WN is. The quality and effectiveness of WN depends largely on how the base WN is. Some words are polysemous in one language, but not in other. Some words are polysemous in one language, but not in other. Same word can have drastically different meaning in two languages. Same word can have drastically different meaning in two languages. Words which have subtly different meaning in two languages can be misunderstood to have same meaning. Words which have subtly different meaning in two languages can be misunderstood to have same meaning.

29 Hindi WSD

30 Approach to WSD …. Hindi Wordnet Hindi Document Intersection Similarity Context Bag Semantic Bag

31 The WSD Algorithm Parameters Wordnet Relations: synonymy, hypernymy, hyponymy, meronymy relations, their Glosses and Example sentences for semantic Bag. Wordnet Relations: synonymy, hypernymy, hyponymy, meronymy relations, their Glosses and Example sentences for semantic Bag. Word Context Size: Current, previous and following sentences in which word forms for context Bag. Word Context Size: Current, previous and following sentences in which word forms for context Bag.

32 The WSD Algorithm…. Let ‘ w ’ be the word whose disambiguation is to be done. Let ‘ w ’ be the word whose disambiguation is to be done. Construct the context Bag. Construct the context Bag. Construct the semantic Bag. Construct the semantic Bag. Using the ‘ Intersection Similarity ’, find the Overlap. Using the ‘ Intersection Similarity ’, find the Overlap. Output the sense ‘ s ’ as the most probable sense which has the maximum Overlap. Output the sense ‘ s ’ as the most probable sense which has the maximum Overlap.

33 Evaluation Presently, the system disambiguates nouns only. Presently, the system disambiguates nouns only. The test corpora has been taken from CIIL, Mysore. The test corpora has been taken from CIIL, Mysore. The system has been tested on corpus from 8 domains and each corpus containing around 2000 words on an average. The system has been tested on corpus from 8 domains and each corpus containing around 2000 words on an average.

34

35

36 Result

37 Discussion Agriculture has given maximum correct result while children literature has given minimum correct result. Agriculture has given maximum correct result while children literature has given minimum correct result. 25 % of the words are found relevant though they don ’ t match exactly the sense. 25 % of the words are found relevant though they don ’ t match exactly the sense.


Download ppt "CS626-460: Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Topic: Hindi Wordnet, Formalization."

Similar presentations


Ads by Google