Download presentation
Published byBrice Welch Modified over 9 years ago
1
Tools and Interfaces for Wordnet construction, linking and maintenance
Abhishek G. Nanda Under the guidance of: Prof. Pushpak Bhattacharyya
2
Wordnet Language - Means of communication using encoded information
Words - Units used for communicating information Semantics - Meanings of words and word forms
3
Wordnet Dictionary - List of alphabetically arranged words with meanings Thesaurus - List of alphabetically arranged concepts with word forms What is Wordnet?
4
Wordnet Lexical database of words
Arranged based on concepts Grouped based on synonymy Synonymy - Property of different words sharing same meaning in a context. Eg. buy and purchase Polysemy - Property of words having different meanings based in different contexts. Eg. bank as financial institution and as river bank
5
Wordnet - Lexical Matrix
Word Meanings Word Forms F1 F2 F3 … Fn M1 (depend) E1,1 (bank) E1,2 (rely) E1,3 M2 E2,2 (embankment) E2,… M3 E3,2 E3,3 Mm Em,n
6
Wordnet - Relations Semantic Relations Lexical Relations
Hypernymy and Hyponymy Meronymy and Holonymy Entailment Troponymy Coordinate terms Lexical Relations Antonymy Gradation
7
Wordnet - Relations Hypernymy and Hyponymy Meronymy and Holonymy
is a kind of leaf is the hypernym of neem leaf neem leaf is the hyponym of leaf Meronymy and Holonymy part-whole root is the meronym of tree tree is the holonym of root
8
Wordnet - Relations Entailment Troponymy Coordinate terms implication
snore entails sleep Troponymy manner elaboration roar is the troponym of speak Coordinate terms Common hypernym wolf and dog are coordinate terms
9
Wordnet - Relations Antonymy Gradation opposites
fat is the antonym of thin Gradation Intermediate concepts in antonymy morning -> noon -> evening
10
Wordnet - Wordnets PWN - Princeton WordNet for English language
EuroWordNet - Wordnet for European languages HWN - Hindi Wordnet for Hindi language
11
Hindi Wordnet Relations borrowed - synynymy, hypernymy, holonymy, troponymy, entailment, etc. Defines 8 part-whole relationships Defines 3 types of antonymy relations Gradable antonym (गर्म-ठंडा) Complementary antonym (जीवित-मृत) Converse antonym (लेना-देना)
12
Hindi Wordnet Gradation Intermediate terms
Pre-Intermediate terms Post-Intermediate terms Eg. सूखा - शुष्क - नम - तर - गीला 10 domains of interpretation. Eg. State, Size, Gender, etc.
13
Hindi Wordnet - Verbs Simple Verb - One root. Eg. खाना
Compound Verb - Made up of another POS. Eg. मीठा लगना Combination Verb - Made of related two verbs. Eg. पढ़ना-लिखना Onomatopoeic Verb - Eg. खटखटाना from खटखट Conjunct Verb - Hidden sense of action. Eg. ले जाना
14
Hindi Wordnet - Verbs Causative verbs
First causative verb - Eg. सुलाना (to make somebody sleep) Second causative verb - Eg. सुलवाना (to make somebody sleep through the effort of a third person)
15
Hindi Wordnet - Creation
Principles for Wordnet creation Minimality - Minimal set. Eg. {घर, कमरा, कक्ष} Coverage - Coverage of words. Eg. {घर, कमरा, कक्ष} Replaceability - Mutual replaceability in a context. Eg. अमेरिका में दो साल बिताने के बाद श्याम स्वदेश/घर लौटा
16
Sanskrit Wordnet Concept-based Multilingual dictionary Need
Loss of synonymy when moving across languages. Eg. dark and evil are synonymous in English but counterparts अंधेरा and दुष्ट are not. Number of lexicographers required - O(n2)
17
Sanskrit Wordnet - Concept based Multilingual dictionary
Concepts L1 (English) L2 (Hindi) L3 (Sanskrit) Concept ID: Concept description (W1, W2, W3, ..) (W4, W5, W6, ..) (W7, W8, W9, ..) 4066: any of various long-tailed primates (excluding the prosimians) (monkey) (बंदर, बन्दर, बानर, वानर, कीश, कपि, मर्कट, ..) (वानरः, कपिः, प्लवङ्गः, प्लवगः, शाखामृगः, वलीमुखः, मर्कटः, ..) 2186: a typical star that is the source of light and heat for the planets in the solar system (sun) (सूर्य,सूरज, भानु, दिवाकर, भास्कर, प्रभाकर, दिनकर, रवि, ..) (सूर्यः, सविता, आदित्यः, मित्रः, अरुणः, भानुः, पूषा, अर्कः, ..)
18
Sanskrit Wordnet - Challenges
Observed during construction of Marathi Wordnet: Single word to synthetic expression. Eg. bankrupt -> दिवाला निकालना Culture specific concepts. Eg. girlfriend. Requires transliteration such as महिलामित्र Splitting of concepts. Eg. फ़ीका (tasteless) in Hindi -> अगोड (less sweet), अळणी (less salty), मिळमिळत (less spicy) in Marathi
19
Sanskrit Wordnet - Challenges
Observed during Indo Wordnet workshop at Coimbatore, June 2009: Varied usage across regions and people. Eg. In Kashmiri, separate words for drinking water and water in Muslim community but one word in hindu community. Single-word and multi-word expressions in same language. Eg. In Nepali, मोह and मोह-माया both mean infatuation.
20
Sanskrit Wordnet - Sanskrit
Indo-Aryan language Hinduism Buddhism Classical Sanskrit - Panini Vedic Sanskrit - pre-Classical
21
Sanskrit Wordnet - Sanskrit Etymology
Etymology of Verbs गण - Ten classes based on how stem is generated इट् - Three groups based on position of tense marker उपसर्ग - 22 prepositional particles that modify a root
22
Synset Marking Grouping of synsets based on frequency of occurrence and usage in language Universal concepts who and what honesty
23
SynsetMarker - Interface
24
SynsetMarker - Features
Display of synset fields Browsing Search Word ID Marking - Universal, Common, Common in Hindi and Uncommon Save/Exit Shortcuts
25
SynsetMarker - API records operations gui DefineRecord SynsetRecord
SynsetOperator RecordReader RecordWriter gui Interface
26
SynsetMarker - Process
First round divided among 6 people 31000 synsets marked Universal and Common clubbed synsets Common in Hindi synsets Uncommon synsets Second round voting schema Common synsets
27
Core Synset Selection Bharatiya Vyavahara Kosh
English and 15 Indian languages 2000 concepts with domains खेल (game), प्राणी (animal), फल (fruit) Link synsets to words in Kosh Polysemy अनन्नास as pineapple fruit अनन्नास as pineapple plant
28
DomainClassifier - Interface
29
DomainClassifier - Features
Display of synset fields Browsing through records Marking right synset for a word and a domain Save/Export
30
DomainClassifier - API
records DefineRecord SynsetRecord operations SynsetOperator RecordReader RecordWriter gui Interface
31
DomainClassifier - Process
Groupings Single IDs Multiple IDs No IDs Rounds of marking Common synsets Common in Hindi synsets Uncommon synsets
32
DomainClassifier - Process
End of process Core synsets Common synsets
33
Online SynsetMarker - Interface
34
Online SynsetMarker - Interface
35
Online SynsetMarker - API
Written in PHP login.php - Interface to login as a user or as an admin or to register as a new user process.php - To process login/register data and accordingly direct a user logout.php - To logout a user mainprocess.php - Processing of data to display unmarked synset main.php - Display of synset with buttons to mark as Common or Uncommon admin.php - Admin page with statistical data of number of marked synsets per user and number of users based on synset marks adminpassword.php - Password interface to login as admin adminuserprofile.php - Profile data of a particular user
36
Online SynsetMarker - Process
Threshold for dropping synset as Uncommon Had to be set to 1 Common synsets
37
Sanskrit Wordnet Interface
Interface for creation of Sanskrit Wordnet Based on idea of Concept-based Multilingual dictionary
38
User Interface - Configure
39
User Interface - Main
40
User Interface - Panels
Help Panel: Buttons for Commenting, Synchronizing and References tool. Search Panel: Search word or ID or perform advanced search. Font increase/decrease. Synset Panels: Synset data fields and completion status. Tool Panel: English synset, Link tool, Etymology tool. Browse Panel: Browsing through records, saving and exiting.
41
User Interface - Features - Reference tool
42
User Interface - Features - Synchronize tool
43
User Interface - Features - Advanced Search
44
User Interface - Features - English synsets tool
45
User Interface - Features - Link tool
46
User Interface - Features - Etymology tool
47
User Interface - Features - Keyboard Shortcuts
Undo feature - Monitor keyboard actions and undo on Ctrl-Z Saving feature - Monitor change in field values and save on Ctrl-S Search - Ctrl-F for quick search access
48
Interface API Problems and Requirements
Huge volumes of data (eg. 30,000 synsets) Links between different data Efficient and user-friendly GUI Sufficient querying Grouping Review separation
49
Interface API
50
Graphical User Interface
JButton saveButton = null; public JButton getSaveButton() { if (saveButton == null) { saveButton = new JButton(); } return saveButton;
51
Graphical User Interface
52
Graphical User Interface - Panels
53
Graphical User Interface
Panels Hierarchical structure Components (within Panels) Classes JButton, JTextField, JCheckBox, etc. Listeners ActionListner - actions performed by user KeyListener - key strokes (undo, search) and shortcuts
54
Synset Synset ID: a unique number identifying a synset
Category: POS category of the words Concept: The part of the gloss that gives a brief summary of what the synset represents Example: One or more examples of the words in the synset being used in sentences Synset: The set of synonymous words comprised in the synset
55
Synset - DSF format ID :: 121 CATEGORY :: NOUN
CONCEPT :: अपने से छोटों के प्रति हृदय में उठनेवाला प्रेम EXAMPLE :: “चाचा नेहरू को बच्चों से बहुत ही स्नेह था” SYNSET :: स्नेह,नेह,लगाव,ममता
56
Data structure - SynsetRecord
Class SynsetRecord Strings to hold field values Functions: equals(otherObject) isBetterThan(otherObject) isComplete() …
57
Data structure - DefineRecord
58
“define-end” language
Example (description of a book about cricket): define book sixer length :: 700 topic :: cricket define chapter 1 length :: 300 topic :: batting end define chapter 2 length :: 400 topic :: bowling :: scientific
59
Data structure - DefineRecord
Example (etymology format): define etymformat verb इट् :: dropdown :: word :: सेट्, अनिट्, वेट् पद :: dropdown :: word :: आत्मनेपद, परस्मैपद, उभयपद कर्मवत्त्व :: dropdown :: synset :: सकर्मक, अकर्मक कृत् रूप :: textfield :: word उपसर्ग :: dropdown :: word :: प्र, परा, अप, सम्, अनु, अव, निस्, निर्, दुस्, दुर्, वि, अधि, अपि, परि, नि, आ, प्रति, उप, सु, उत्, अभि, अति साधित धातु :: dropdown :: word ::णिच्, सन्, यङ्, यङ्लुक्, नामधातु end
60
Data structure - DefineRecord
61
Data structure - DefineRecord
Example (etymology data for synset ID 1476): define etymology 1476 कर्मवत्त्व :: अकर्मक finished :: true define word क्षि इट् :: सेट् पद :: परस्मैपद स्वर :: - कृत् रूप :: क्षयः उपसर्ग :: अप साधित धातु :: - end
62
Data structure - DefineRecord
Data structure to hold parametric and nested data Functions: addField(objectToAdd) - Function to add a parameter or a nested instance of DefineRecord toString() - Function to export a record in the define-end language getParameterField(parameterName) - Function to return a specific parameter field …
63
Data Operations
64
Data Operations - File I/O
Unicode text data manipulation - UTF-8 format Classes for file parsing/writing: RecordWriter RecordReader
65
Data Operations - File I/O
RecordReader SynsetRecord parser DefineRecord parser String converters RecordWriter
66
Data Operations - RecordModel Interface
Model to create mechanism for working with a new data structure Handles parsing, writing, querying and ID retrieval Models written as Classes: SynsetRecordModel EnglishSynsetRecordModel AbstractDefineRecordModel
67
Data Operations - RecordModel Interface
int getRecordId(E record): Function to return the record ID of a record boolean isBetterThan(E a, E a): Function to return whether a record weighs better than the other boolean isFinished(E a): Function to return whether a record can be set as completed E mergeRecords(E a, E b): Function to merge in data in two separate records into one boolean searchWord(String word, E a): Function to perform a query (defined in String word) on a record E parseRecord(RecordReader fileHandle): Function to parse a record from a file void writeRecord(RecordReader fileHandle, E a): Function to write a record into a file
68
Data Operations - RecordOperator Class
Operator to provide functionality to work with records of data Load, Browse, Update, Search, Synchronize and Write Two kinds at the GUI level: Parent Operator Linker Operator
69
Data Operations - RecordOperator Class
Functions for each data type (depending on the corresponding RecordModel): Constructors for ParentOperator and LinkerOperator getRecord() - Function to obtain the current record setCurrentId() and getCurrentId() - Functions to set and obtain ID to work with getFirstId(), getPreviousId(), getNextId() and getLastId() - Functions to browse through records isFinished and isAllFinished() - Functions to obtain completion status of records searchRecords() and advancedSearch() - Functions to perform search operations on the records …
70
API Overview GUI defines one ParentOperator (eg. source synsets)
GUI defines many LinkerOperators (eg. target synsets, link data, etc.) Models attached to the operators Data repositories are defined GUI browses, retrieves and manipulates data using operators.
71
Version history
72
Future work Tool to generate etymology format
GUI functionality to display synsets from multiple languages Advanced commenting based on reviews and completion
73
References Miller G.A., Beckwith R., Fellbaum C., Gross D., Miller K.J., "Introduction to WordNet: An On-line Lexical Database", International Journal of Lexicography, Vol. 3, No. 4, 1990, pp Ramanand J., Ukey A., Singh B.K., Bhattacharyya P., "Mapping and Structural Analysis of Multilingual Wordnets", IEEE Data Engineering Bulletin, Vol. 30, No. 1, 2007, pp Hindi Wordnet Documentation, Chakrabarti D., Narayan D.K., Pandey P., Bhattacharyya P., "Experiences in building the Indo WordNet - A WordNet for Hindi", in First International Wordnet Conference, CIIL, Mysore, India, 2002. Mohanty R.K., Bhattacharyya P., Kalele S., Pandey P., Sharma A., Kopra M., "Synset Based Multilingual Dictionary: Insights, Applications and Challenges", in Proceedings of the Fourth Global WordNet Conference, University of Szeged, Department of Informatics, 2008. Sinha, M., Reddy, M., Bhattacharyya, P., "An Approach towards Construction and Application of Multilingual Indo-WordNet", in Proceedings of the Third Global Wordnet Conference, Jeju Island, Korea, 2006. Staal J.F., "Sanskrit and Sanskritization", The Journal of Asian Studies, Vol. 22, No. 3, 1963, pp. 261–275.
74
References MacDonell A.A., A History Of Sanskrit Literature, Kessinger Publishing, ISBN , 2004. Burrow T., Sanskrit language, Motilal Banarsidass, ISBN , 2001. Goldman R.P. and Sutherland S.J., Devavanipravesika: An Introduction to the Sanskrit Language, ISBN , 1999. Macdonell A.A., A Sanskrit Grammar for Students, ISBN , 1997. Monier-Williams M., A Sanskrit English Dictionary, Motilal Banarsidass, (reprint) New Delhi, ISBN , 2005. Katre S.M., Ashtadhyayi of Panini, Motilal Banarsidass, New Delhi, 1989. Indian Languages, Wierzbicka A., "Universal human concepts as a tool for exploring bilingual lives", International Journal of Bilingualism, Vol. 9, No. 1, 2005, pp Beckwith R., Miller G.A., Tengi R., "Design and Implementation of the WordNet Lexical Database and Searching Software", Description of WordNet, 1993. JSch - Java Secure Channel,
75
Thank you
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.