Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004.

Slides:



Advertisements
Similar presentations
Building Wordnets Piek Vossen, Irion Technologies.
Advertisements

Uses of a Corpus “[E]xplore actual patterns of language use”
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
1 Why do CPA? Patrick Hanks Research Institute for Information and Language Processing, University of Wolverhampton; Bristol Centre for Linguistics, University.
Statistical NLP: Lecture 3
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
English Lexicography.
Lexicography versus Terminography
Macrostructure  Front matter  Body  Appendices Jackson, Howard Lexicography: An Introduction. London: Routledge, p. 25.
Lexicography ( Dictionary Skills) Lecture 2
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
1 Words and the Lexicon September 10th 2009 Lecture #3.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Introduction to Computational Linguistics Lecture 2.
Stemming, tagging and chunking Text analysis short of parsing.
1/27 Semantics Going beyond syntax. 2/27 Semantics Relationship between surface form and meaning What is meaning? Lexical semantics Syntax and semantics.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
PSY 369: Psycholinguistics Some basic linguistic theory part3.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
1/23 LELA Lecture 2 Corpus-based research in Linguistics See esp. Meyer pp
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
1 ENGLISH PHONETICS AND PHONOLOGY Lesson 3A Introduction to Phonetics and Phonology.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Deny A. Kwary Internal Structures of Dictionary Entries.
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic.
The DVC project: Disambiguation of Verbs by Collocation ____ an introduction to the linguistic theory of norms and exploitations Patrick Hanks Research.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
Teaching Vocabulary.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
U SING C ORPUS - BASED R ESEARCH FOR L ANGUAGE T EACHING AND L EARNING ENGLISH 510 Hee Sung (Grace) Jun & Kimberly LeVelle.
Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.
1 How to Compute the Meaning of Natural Language Utterances Patrick Hanks, Research Institute of Information and Language Processing, University of Wolverhampton.
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
9. Microstructure of Bilingual Dictionaries. The microstructure of the dictionary specifies the way the lemma articles are composed. The lemma article.
Chapter 1: By: Ms. Ola Al-arjani
 1.Books  2.CD-ROMs  3.Internet BooksCD-ROMSInternet Advantages  Familiarity  Ownership  Fast retrieval  Lots of information  Light-weight 
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Chapter 10 Language and Computer English Linguistics: An Introduction.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Introduction to CL & NLP CMSC April 1, 2003.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
11 Chapter 19 Lexical Semantics. 2 Lexical Ambiguity Most words in natural languages have multiple possible meanings. –“pen” (noun) The dog is in the.
Chapter 3 Monolingual Dictionaries II Arabic Dictionaries.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Natural Language Processing Chapter 1 : Introduction.
Introduction Chapter 1 Foundations of statistical natural language processing.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
Lexicography Lexicon has two different meanings:
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Slang. Informal verbal communication that is generally unacceptable for formal writing.
INTRODUCTION TO APPLIED LINGUISTICS
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
英语词汇学课程课件 课件名称:英语词典制作人:孙红梅、寻阳单位:曲阜师范大学外国语学院. Chapter 10 English Dictionaries.
Chapter Thirty-Nine Using the Dictionary.
Statistical NLP: Lecture 3
European Network of e-Lexicography
Machine Learning in Natural Language Processing
The ultimate in data organization
Information Retrieval
Using Dictionaries in Translation (223 TRAJ)
Using Dictionaries in Translation (223 TRAJ)
Presentation transcript:

Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004.

2/18 Dictionaries/Lexicons Lexicography and the computer Corpus-based lexicography MRDs Dictionaries for NLP Thesauri: structured lexicons

3/18 Computational lexicography Restructuring and exploiting human dictionaries for use by computer programs Using computational techniques to compile (new) dictionaries Focus on English (and other well established languages) Significant different issues for other languages, especially –Alphabetization and arrangement –Compilation from scratch for previously unstudied languages

4/18 Human dictionaries Traditional view of what a “dictionary” is –List of words, arranged (usually) alphabetically –Inclusion in dictionary lends authority, even proscriptively –Entry typically gives spelling... alternate spellings POS, morphology (if irregular) core definition (using defining vocab?) pronunciation (using own transcription) etymology examples of usage –as justification for inclusion –as illustration of use (esp. learner’s dictionaries) –Entry typically doesn’t give help with spelling morphology (if regular), especially derivational subcategorization information contrastive examples of use indications of possible metaphorical extensions to meaning

5/18 Human dictionaries Historically –bilingual dictionaries for translators –monolingual dictionary as (pre/proscriptive) definition of language, often polemical –OED ( ) first dictionary on purely descriptive principle, relying on citations Deficiencies and difficulties –What to include? (neologisms, slang) –Inclusion of names –Differentiating senses

6/18 Differentiating word senses Dictionaries disagree widely Probably no right answer General principles (look for excuse to split vs look for reason to lump) Keep related words of different POS together? Etymology can be misleading (eg crane, pupil) Metaphorical extension of original meaning – how far do you go? (eg rose, bar) Purpose of dictionary may help decide, eg translation

7/18 Citations Senses and uses identified by collecting examples of use –Sent in on “slips” by informants –Lexicographer’s job is to collate these Criteria for a new word (or new meaning) –Number of citations –Source of citations –Veracity of use

8/18 Corpus-based dictionaries A collection of texts, usually collected with a specific purpose in mind British National Corpus, attempt to capture a synchronic picture of BrE of the late 1980s (100m words) COBUILD “Bank of English” dynamic “monitor” corpus used to help lexicographers identify/define usage

9/18 Machine-readable dictionaries “Machine” means “computer” Dictionary stored in a format which makes it manipulable on a computer Originally, derived from MR version of print dictionary (from type-setter’s tapes) Now the other way round: data stored as a database from which hard copy can be printed (inter alia)

10/18 MRDs - advantages Flexibility of access and presentation –Not bound to alphabetical listing –Information presented can be filtered –Can be searched as a database –Different versions (for different users, serving different purposes) can be produced Increased storage capacity –More information can be stored, especially Implicit information can be made explicit More examples, including “negative data”

11/18 Lexicons for NLP Have to state everything we need to know about the word –Phonology: stress pattern, possible weak forms –Orthography: spelling alternatives, hyphenation –Morphology: inflectional paradigms, even if regular –Information about derivations –Syntax: Explicit information about subcategorization and eg syntactic/semantic features of arguments Any special interpretation of tenses –Lexical combinatorics: compounds, idioms –Semantics: definition, semantic features, semantic relations –Pragmatics: register, collocation, connotation

12/18 Lexicons for NLP - example Information about derivations Agentive derivation (-er) is very productive –Usually means the actor doing the action of a verb, e.g. swimmer, dancer, killer –Not available for some verbs, e.g. *knower, *cycler, *sayer though cf soothsayer, *hoper –May have a specialised meaning instead of or as well as the derived meaning, e.g. revolver, computer, washer, hitter –In some cases can mean the object undergoing the action (via ergative use of verb), e.g. taster

13/18 Subcategorization Words are assigned to categories (ie parts of speech, POS), eg noun, verb –on basis of form, meaning, use Syntactic behaviour is predictable from (or determined by) category Within a category there are subcategories with specific patterns of behaviour, both syntactic and semantic, e.g. –transitive/intransitive verb  direct object? passivize?

14/18 Subcategorization Subcat frames indicate complement patterns and preferences, e.g. –subj, obj, double obj, prep-obj, infinitival complement, that complement etc –semantic features of complements, eg obj of eat normally edible Subcat information can help to disambiguate –cf He told the man where the body was buried. – He found the place where the body was buried. Much of this info can be captured in general rules [ ][ ] [ [ ]]

15/18 Have to state everything we need to know about the word, though not necessarily explicitly –There can be rules to capture inheritance of properties, e.g. accomplishment + prog tense implies incompletion cf She was baking a cake when she dropped dead  no cake She was stroking the cat when she dropped dead

16/18 Exploiting human dictionaries in NLP In all NLP applications, lexicon is major bottleneck Availability of MRD versions of human dictionaries provided possible solution –Obviously, MRD gives list of words, and some information –Extract further information about verb frames by analysing the examples –Identify semantic features from definitions eg a plant which..., a person who... –Identify hidden arguments eg to lock = to close sthg using a key cf He locked the door. The key was heavy. He emptied his pockets. *The key was heavy.

17/18 Exploiting human dictionaries in NLP Generic information about a word and its usage can be derived from definitions in which it occurs: Wine: alcoholic drink made from fermented juices, especially of grapes Vintage: a season’s yield of wine from a vineyard Red wine: wine having a red colour derived from the skins of the grapes used... Vineyard: an orchard where grapes are grown for the purpose of wine making Pinot noir: a dry red Californian table wine Sake: Japanese rice wine Claret: a dry red Bordeaux or Bordeaux-like wine Sherry: a sweet white wine from the Jerez region of Spain Riesling: a dessert wine made from white grapes grown historically in Germany...

18/18 Corpus-based lexicography revisited Similarly, analysis of real examples can reveal patterns of usage –Identify primary meaning: not always what you’d expect (example of reckon) –Identify possible complementation patterns, and their relative frequency

19/18 Structured dictionaries Special type of dictionary in which words are grouped together according to their meaning: thesaurus Classic example Roget’s Thesaurus (1852) Structured vocabulary much used in field of terminology Also now a valuable resource for NLP: Miller’s (Princeton) WordNet (1985)