Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tools and Interfaces for Wordnet construction, linking and maintenance

Similar presentations


Presentation on theme: "Tools and Interfaces for Wordnet construction, linking and maintenance"— Presentation transcript:

1 Tools and Interfaces for Wordnet construction, linking and maintenance
Abhishek G. Nanda Under the guidance of: Prof. Pushpak Bhattacharyya

2 Wordnet Language - Means of communication using encoded information
Words - Units used for communicating information Semantics - Meanings of words and word forms

3 Wordnet Dictionary - List of alphabetically arranged words with meanings Thesaurus - List of alphabetically arranged concepts with word forms What is Wordnet?

4 Wordnet Lexical database of words
Arranged based on concepts Grouped based on synonymy Synonymy - Property of different words sharing same meaning in a context. Eg. buy and purchase Polysemy - Property of words having different meanings based in different contexts. Eg. bank as financial institution and as river bank

5 Wordnet - Lexical Matrix
Word Meanings Word Forms F1 F2 F3 Fn M1 (depend) E1,1 (bank) E1,2 (rely) E1,3 M2 E2,2 (embankment) E2,… M3 E3,2 E3,3 Mm Em,n

6 Wordnet - Relations Semantic Relations Lexical Relations
Hypernymy and Hyponymy Meronymy and Holonymy Entailment Troponymy Coordinate terms Lexical Relations Antonymy Gradation

7 Wordnet - Relations Hypernymy and Hyponymy Meronymy and Holonymy
is a kind of leaf is the hypernym of neem leaf neem leaf is the hyponym of leaf Meronymy and Holonymy part-whole root is the meronym of tree tree is the holonym of root

8 Wordnet - Relations Entailment Troponymy Coordinate terms implication
snore entails sleep Troponymy manner elaboration roar is the troponym of speak Coordinate terms Common hypernym wolf and dog are coordinate terms

9 Wordnet - Relations Antonymy Gradation opposites
fat is the antonym of thin Gradation Intermediate concepts in antonymy morning -> noon -> evening

10 Wordnet - Wordnets PWN - Princeton WordNet for English language
EuroWordNet - Wordnet for European languages HWN - Hindi Wordnet for Hindi language

11 Hindi Wordnet Relations borrowed - synynymy, hypernymy, holonymy, troponymy, entailment, etc. Defines 8 part-whole relationships Defines 3 types of antonymy relations Gradable antonym (गर्म-ठंडा) Complementary antonym (जीवित-मृत) Converse antonym (लेना-देना)

12 Hindi Wordnet Gradation Intermediate terms
Pre-Intermediate terms Post-Intermediate terms Eg. सूखा - शुष्क - नम - तर - गीला 10 domains of interpretation. Eg. State, Size, Gender, etc.

13 Hindi Wordnet - Verbs Simple Verb - One root. Eg. खाना
Compound Verb - Made up of another POS. Eg. मीठा लगना Combination Verb - Made of related two verbs. Eg. पढ़ना-लिखना Onomatopoeic Verb - Eg. खटखटाना from खटखट Conjunct Verb - Hidden sense of action. Eg. ले जाना

14 Hindi Wordnet - Verbs Causative verbs
First causative verb - Eg. सुलाना (to make somebody sleep) Second causative verb - Eg. सुलवाना (to make somebody sleep through the effort of a third person)

15 Hindi Wordnet - Creation
Principles for Wordnet creation Minimality - Minimal set. Eg. {घर, कमरा, कक्ष} Coverage - Coverage of words. Eg. {घर, कमरा, कक्ष} Replaceability - Mutual replaceability in a context. Eg. अमेरिका में दो साल बिताने के बाद श्याम स्वदेश/घर लौटा

16 Sanskrit Wordnet Concept-based Multilingual dictionary Need
Loss of synonymy when moving across languages. Eg. dark and evil are synonymous in English but counterparts अंधेरा and दुष्ट are not. Number of lexicographers required - O(n2)

17 Sanskrit Wordnet - Concept based Multilingual dictionary
Concepts L1 (English) L2 (Hindi) L3 (Sanskrit) Concept ID: Concept description (W1, W2, W3, ..) (W4, W5, W6, ..) (W7, W8, W9, ..) 4066: any of various long-tailed primates (excluding the prosimians) (monkey) (बंदर, बन्दर, बानर, वानर, कीश, कपि, मर्कट, ..) (वानरः, कपिः, प्लवङ्गः, प्लवगः, शाखामृगः, वलीमुखः, मर्कटः, ..) 2186: a typical star that is the source of light and heat for the planets in the solar system (sun) (सूर्य,सूरज, भानु, दिवाकर, भास्कर, प्रभाकर, दिनकर, रवि, ..) (सूर्यः, सविता, आदित्यः, मित्रः, अरुणः, भानुः, पूषा, अर्कः, ..)

18 Sanskrit Wordnet - Challenges
Observed during construction of Marathi Wordnet: Single word to synthetic expression. Eg. bankrupt -> दिवाला निकालना Culture specific concepts. Eg. girlfriend. Requires transliteration such as महिलामित्र Splitting of concepts. Eg. फ़ीका (tasteless) in Hindi -> अगोड (less sweet), अळणी (less salty), मिळमिळत (less spicy) in Marathi

19 Sanskrit Wordnet - Challenges
Observed during Indo Wordnet workshop at Coimbatore, June 2009: Varied usage across regions and people. Eg. In Kashmiri, separate words for drinking water and water in Muslim community but one word in hindu community. Single-word and multi-word expressions in same language. Eg. In Nepali, मोह and मोह-माया both mean infatuation.

20 Sanskrit Wordnet - Sanskrit
Indo-Aryan language Hinduism Buddhism Classical Sanskrit - Panini Vedic Sanskrit - pre-Classical

21 Sanskrit Wordnet - Sanskrit Etymology
Etymology of Verbs गण - Ten classes based on how stem is generated इट् - Three groups based on position of tense marker उपसर्ग - 22 prepositional particles that modify a root

22 Synset Marking Grouping of synsets based on frequency of occurrence and usage in language Universal concepts who and what honesty

23 SynsetMarker - Interface

24 SynsetMarker - Features
Display of synset fields Browsing Search Word ID Marking - Universal, Common, Common in Hindi and Uncommon Save/Exit Shortcuts

25 SynsetMarker - API records operations gui DefineRecord SynsetRecord
SynsetOperator RecordReader RecordWriter gui Interface

26 SynsetMarker - Process
First round divided among 6 people 31000 synsets marked Universal and Common clubbed synsets Common in Hindi synsets Uncommon synsets Second round voting schema Common synsets

27 Core Synset Selection Bharatiya Vyavahara Kosh
English and 15 Indian languages 2000 concepts with domains खेल (game), प्राणी (animal), फल (fruit) Link synsets to words in Kosh Polysemy अनन्नास as pineapple fruit अनन्नास as pineapple plant

28 DomainClassifier - Interface

29 DomainClassifier - Features
Display of synset fields Browsing through records Marking right synset for a word and a domain Save/Export

30 DomainClassifier - API
records DefineRecord SynsetRecord operations SynsetOperator RecordReader RecordWriter gui Interface

31 DomainClassifier - Process
Groupings Single IDs Multiple IDs No IDs Rounds of marking Common synsets Common in Hindi synsets Uncommon synsets

32 DomainClassifier - Process
End of process Core synsets Common synsets

33 Online SynsetMarker - Interface

34 Online SynsetMarker - Interface

35 Online SynsetMarker - API
Written in PHP login.php - Interface to login as a user or as an admin or to register as a new user process.php - To process login/register data and accordingly direct a user logout.php - To logout a user mainprocess.php - Processing of data to display unmarked synset main.php - Display of synset with buttons to mark as Common or Uncommon admin.php - Admin page with statistical data of number of marked synsets per user and number of users based on synset marks adminpassword.php - Password interface to login as admin adminuserprofile.php - Profile data of a particular user

36 Online SynsetMarker - Process
Threshold for dropping synset as Uncommon Had to be set to 1 Common synsets

37 Sanskrit Wordnet Interface
Interface for creation of Sanskrit Wordnet Based on idea of Concept-based Multilingual dictionary

38 User Interface - Configure

39 User Interface - Main

40 User Interface - Panels
Help Panel: Buttons for Commenting, Synchronizing and References tool. Search Panel: Search word or ID or perform advanced search. Font increase/decrease. Synset Panels: Synset data fields and completion status. Tool Panel: English synset, Link tool, Etymology tool. Browse Panel: Browsing through records, saving and exiting.

41 User Interface - Features - Reference tool

42 User Interface - Features - Synchronize tool

43 User Interface - Features - Advanced Search

44 User Interface - Features - English synsets tool

45 User Interface - Features - Link tool

46 User Interface - Features - Etymology tool

47 User Interface - Features - Keyboard Shortcuts
Undo feature - Monitor keyboard actions and undo on Ctrl-Z Saving feature - Monitor change in field values and save on Ctrl-S Search - Ctrl-F for quick search access

48 Interface API Problems and Requirements
Huge volumes of data (eg. 30,000 synsets) Links between different data Efficient and user-friendly GUI Sufficient querying Grouping Review separation

49 Interface API

50 Graphical User Interface
JButton saveButton = null; public JButton getSaveButton() { if (saveButton == null) { saveButton = new JButton(); } return saveButton;

51 Graphical User Interface

52 Graphical User Interface - Panels

53 Graphical User Interface
Panels Hierarchical structure Components (within Panels) Classes JButton, JTextField, JCheckBox, etc. Listeners ActionListner - actions performed by user KeyListener - key strokes (undo, search) and shortcuts

54 Synset Synset ID: a unique number identifying a synset
Category: POS category of the words Concept: The part of the gloss that gives a brief summary of what the synset represents Example: One or more examples of the words in the synset being used in sentences Synset: The set of synonymous words comprised in the synset

55 Synset - DSF format ID :: 121 CATEGORY :: NOUN
CONCEPT :: अपने से छोटों के प्रति हृदय में उठनेवाला प्रेम EXAMPLE :: “चाचा नेहरू को बच्चों से बहुत ही स्नेह था” SYNSET :: स्नेह,नेह,लगाव,ममता

56 Data structure - SynsetRecord
Class SynsetRecord Strings to hold field values Functions: equals(otherObject) isBetterThan(otherObject) isComplete()

57 Data structure - DefineRecord

58 “define-end” language
Example (description of a book about cricket): define book sixer length :: 700 topic :: cricket define chapter 1 length :: 300 topic :: batting end define chapter 2 length :: 400 topic :: bowling :: scientific

59 Data structure - DefineRecord
Example (etymology format): define etymformat verb इट् :: dropdown :: word :: सेट्, अनिट्, वेट् पद :: dropdown :: word :: आत्मनेपद, परस्मैपद, उभयपद कर्मवत्त्व :: dropdown :: synset :: सकर्मक, अकर्मक कृत् रूप :: textfield :: word उपसर्ग :: dropdown :: word :: प्र, परा, अप, सम्, अनु, अव, निस्, निर्, दुस्, दुर्, वि, अधि, अपि, परि, नि, आ, प्रति, उप, सु, उत्, अभि, अति साधित धातु :: dropdown :: word ::णिच्, सन्, यङ्, यङ्लुक्, नामधातु end

60 Data structure - DefineRecord

61 Data structure - DefineRecord
Example (etymology data for synset ID 1476): define etymology 1476 कर्मवत्त्व :: अकर्मक finished :: true define word क्षि इट् :: सेट् पद :: परस्मैपद स्वर :: - कृत् रूप :: क्षयः उपसर्ग :: अप साधित धातु :: - end

62 Data structure - DefineRecord
Data structure to hold parametric and nested data Functions: addField(objectToAdd) - Function to add a parameter or a nested instance of DefineRecord toString() - Function to export a record in the define-end language getParameterField(parameterName) - Function to return a specific parameter field

63 Data Operations

64 Data Operations - File I/O
Unicode text data manipulation - UTF-8 format Classes for file parsing/writing: RecordWriter RecordReader

65 Data Operations - File I/O
RecordReader SynsetRecord parser DefineRecord parser String converters RecordWriter

66 Data Operations - RecordModel Interface
Model to create mechanism for working with a new data structure Handles parsing, writing, querying and ID retrieval Models written as Classes: SynsetRecordModel EnglishSynsetRecordModel AbstractDefineRecordModel

67 Data Operations - RecordModel Interface
int getRecordId(E record): Function to return the record ID of a record boolean isBetterThan(E a, E a): Function to return whether a record weighs better than the other boolean isFinished(E a): Function to return whether a record can be set as completed E mergeRecords(E a, E b): Function to merge in data in two separate records into one boolean searchWord(String word, E a): Function to perform a query (defined in String word) on a record E parseRecord(RecordReader fileHandle): Function to parse a record from a file void writeRecord(RecordReader fileHandle, E a): Function to write a record into a file

68 Data Operations - RecordOperator Class
Operator to provide functionality to work with records of data Load, Browse, Update, Search, Synchronize and Write Two kinds at the GUI level: Parent Operator Linker Operator

69 Data Operations - RecordOperator Class
Functions for each data type (depending on the corresponding RecordModel): Constructors for ParentOperator and LinkerOperator getRecord() - Function to obtain the current record setCurrentId() and getCurrentId() - Functions to set and obtain ID to work with getFirstId(), getPreviousId(), getNextId() and getLastId() - Functions to browse through records isFinished and isAllFinished() - Functions to obtain completion status of records searchRecords() and advancedSearch() - Functions to perform search operations on the records

70 API Overview GUI defines one ParentOperator (eg. source synsets)
GUI defines many LinkerOperators (eg. target synsets, link data, etc.) Models attached to the operators Data repositories are defined GUI browses, retrieves and manipulates data using operators.

71 Version history

72 Future work Tool to generate etymology format
GUI functionality to display synsets from multiple languages Advanced commenting based on reviews and completion

73 References Miller G.A., Beckwith R., Fellbaum C., Gross D., Miller K.J., "Introduction to WordNet: An On-line Lexical Database", International Journal of Lexicography, Vol. 3, No. 4, 1990, pp Ramanand J., Ukey A., Singh B.K., Bhattacharyya P., "Mapping and Structural Analysis of Multilingual Wordnets", IEEE Data Engineering Bulletin, Vol. 30, No. 1, 2007, pp Hindi Wordnet Documentation, Chakrabarti D., Narayan D.K., Pandey P., Bhattacharyya P., "Experiences in building the Indo WordNet - A WordNet for Hindi", in First International Wordnet Conference, CIIL, Mysore, India, 2002. Mohanty R.K., Bhattacharyya P., Kalele S., Pandey P., Sharma A., Kopra M., "Synset Based Multilingual Dictionary: Insights, Applications and Challenges", in Proceedings of the Fourth Global WordNet Conference, University of Szeged, Department of Informatics, 2008. Sinha, M., Reddy, M., Bhattacharyya, P., "An Approach towards Construction and Application of Multilingual Indo-WordNet", in Proceedings of the Third Global Wordnet Conference, Jeju Island, Korea, 2006. Staal J.F., "Sanskrit and Sanskritization", The Journal of Asian Studies, Vol. 22, No. 3, 1963, pp. 261–275.

74 References MacDonell A.A., A History Of Sanskrit Literature, Kessinger Publishing, ISBN , 2004. Burrow T., Sanskrit language, Motilal Banarsidass, ISBN , 2001. Goldman R.P. and Sutherland S.J., Devavanipravesika: An Introduction to the Sanskrit Language, ISBN , 1999. Macdonell A.A., A Sanskrit Grammar for Students, ISBN , 1997. Monier-Williams M., A Sanskrit English Dictionary, Motilal Banarsidass, (reprint) New Delhi, ISBN , 2005. Katre S.M., Ashtadhyayi of Panini, Motilal Banarsidass, New Delhi, 1989. Indian Languages, Wierzbicka A., "Universal human concepts as a tool for exploring bilingual lives", International Journal of Bilingualism, Vol. 9, No. 1, 2005, pp Beckwith R., Miller G.A., Tengi R., "Design and Implementation of the WordNet Lexical Database and Searching Software", Description of WordNet, 1993. JSch - Java Secure Channel,

75 Thank you


Download ppt "Tools and Interfaces for Wordnet construction, linking and maintenance"

Similar presentations


Ads by Google