Indo WordNet A WordNet for Hindi Debasri Chakrabarti, Dipak Kumar Narayan, Prabhakar Pandey, Madhu Prasad Sharma Centre for Technology Development for Indian Languages Computer Science and Engineering Department, IIT Bombay
Introduction WordNet – A lexical database Searching the dictionary conceptually Different organizing principle for different syntactic category Synsets or the Synonymy Sets are the basic building blocks Lexical knowledge base is the heart of any intelligent information processing system
WordNet for Hindi Hindi WordNet is an on-line lexical database for Hindi language Design has been inspired by the famous English WordNet Unique features Graded antonyms and meronymy relationships Efficient underlying database design Cross part of speech linkage
Semantic relations in WordNet Synonymy Hypernymy / Hyponymy Antonymy Meronymy / Holonymy Gradation Entailment Troponymy
Semantic Relations Synonymy {Gar ‚ kmara} {Gar ‚ Aavaasa} True synonyms are rare Synonymy related to a context {Gar ‚ kmara} {Gar ‚ Aavaasa} {Gar ‚ janmakuMDlaIya sqaana} {Gar ‚ svadoSa}
Semantic Relations Saor pSau sajaIva Aist%va Hypernymy and Hyponymy Relation between word meaning (synsets) X is a hyponym of Y if X is a kind of Y Hyponymy is transitive and asymmetrical Hypernymy is inverse of Hyponymy lionanimalliving entityentity Saor pSau sajaIva Aist%va
Semantic Relations Antonymy Meronymy and Holonymy Oppositeness in meaning Relation between word forms Meronymy and Holonymy Part-whole relation, branch is a part of tree X is a meronymy of Y if X is a part of Y Meronym is transitive and asymmetrical Holonymy is inverse relation of Meronymy
Troponym and Entailment { Kra-Ta laonaa – saaonaa £ Troponym { laÐgaD,anaa ‚ kdmatala krnaa – calanaa £ ¡ fusafusaanaa – baaolanaa £
Antonymy Relation CaoTa – baD,a AcCa – baura rat – idna rama – ravaNa Size CaoTa – baD,a Quality AcCa – baura State rat – idna Personality rama – ravaNa Direction pUva- – piScama Action laonaa – donaa Amount kma – jyaada Place dUr – pasa Time saubah – Saama Gender baoTa – baoTI
Meronymy Relation maaqaa – SarIr p%qar – maUit- poD, – jaMgala Component-object maaqaa – SarIr Stuff-object p%qar – maUit- Member-collection poD, – jaMgala Feature-Activity BaaYaNa – samaaraoh Place-Area idllaI – Baart Phase-State javaanaI – ]ma` Resource-process klama – laoKna Position-Area icaik%sak – icaik%saa
Gradation bacapna ‚ javaanaI ‚ bauZ,apa baD,a ‚ maÐJalaa ‚ CaoTa State bacapna ‚ javaanaI ‚ bauZ,apa Size baD,a ‚ maÐJalaa ‚ CaoTa Light ]jaalaa ‚ QauÐQalaa ‚ AÐQaora Gender mad- ‚ napuMsak ‚ AaOrt Temperature garma ‚ gaunagaunaa ‚ zMDa Color gaaora ‚ saaÐvalaa ‚ kalaa Time idna ‚ gaaoQaUila ‚ rat Quality AcCa ‚ saamaanya ‚ Kraba Action saaonaa ‚ }ÐGanaa ‚ jaaganaa Manner tojaI sao ‚ maQyama gait sao ‚ QaIro – QaIro
Classification of verbs Simple verbs (sarla iËyaa) : saaonaa‚ Kanaa Conjunct verbs (saMyau@t iËyaa) Compound verbs (samaaisak iËyaa) Á Kanaa–pInaa Causative verbs (p`orNaa%mak iËyaa) Á saulavaanaa
WordNet Sub-Graph Gar , gaRh AQyana kxa Aavaasa , inavaasa Sayana kxa Gloss AQyana kxa Hyponymy Aavaasa , inavaasa Sayana kxa rsaao[-Gar Gar , gaRh manauYyaaoM ka Cayaa huAa vah sqaana jaao dIvaaraoM sao Gaor kr banaayaa jaata hO Aitiqa gaRh baramada Aa^Mgana AaEama JaaopD,I saMrcanaa Meronymy M e r o n y m Hypernymy
Design and Implementation Basic relations or lexical links are between synonym sets Lexical database is stored in MySQL package Sub-tasks identified Database design Data entry interface Implementation of Organizer Utility Application programs to access and display the information in the lexical database
Data Entry Interface GUI designed in Java/JFC Separate screen for data entry of different categories Automatic generation of synset id’s Screen to view the entered data
Synset Entry Interface
Organizer Utility Designed to preprocess the data Reflexive pointers are generated e.g. if A hypernym of B then B hyponym of A is automatically generated Each semantic relation is mapped to a separate table (normalized) Font conversion Roman Hindi DV-TTYogesh
Storage Structure Relation between Synsets Relation between Word-forms tblNounHypernyms Relation between Word-forms tblNounAntonyms Synset_Id HyperSynset_Id Synset_Id Synset_Word Anto_Id Anto_Word Anto_Type
System Statistics Over 8500 synsets entered in the database MySQL used as the back-end database server Data entry interface designed in Java/JFC Organizer utility written in perl Web based data retrieval system developed in HTML and PHP DV-TTYogesh Font used to display Hindi Text
Application of WordNet Word Sense Disambiguation Interface to Internet Search Engines Text classification Information Retrieval system Document Similarity
Conclusion The structure of Hindi Language have been studied and new features have been introduced in the Hindi WordNet Currently over 8500 synsets have been inserted into the database The MySQL database has been found to be quite efficient The web interface for querying the lexical database is under continuous evolution