RECOGNISING NOMINALISATIONS

Slides:



Advertisements
Similar presentations
Corpus Processing and NLP
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Automatic Metaphor Interpretation as a Paraphrasing Task Ekaterina Shutova Computer Lab, University of Cambridge NAACL 2010.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
1 Words and the Lexicon September 10th 2009 Lecture #3.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Knowledge-Free Induction of Morphology Using Latent Semantic Analysis (Patric Schone and Daniel Jurafsky) Danny Shacham Yehoyariv Louck.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Semi-automatic glossary creation from learning objects Eline Westerhout & Paola Monachesi.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Introduction to English Morphology Finite State Transducers
Deny A. Kwary Internal Structures of Dictionary Entries.
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Paradigm based Morphological Analyzers Dr. Radhika Mamidi.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
9. Microstructure of Bilingual Dictionaries. The microstructure of the dictionary specifies the way the lemma articles are composed. The lemma article.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih Natural Language Processing ICS.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Communicative and Academic English for the EFL Professional.
Natural Language Processing Chapter 2 : Morphology.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Lexicography Lexicon has two different meanings:
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
Slang. Informal verbal communication that is generally unacceptable for formal writing.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
Gardner, D. (2007). Validating the construct of word in applied corpus-based vocabulary research: A critical survey. Applied Linguistics, 28(2), 241–265.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Measuring Monolinguality
Machine Learning in Natural Language Processing
Artificial Intelligence 2004 Speech & Natural Language Processing
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

RECOGNISING NOMINALISATIONS Supervisors: Dr. Alex Lascarides Dr. Mirella Lapata (Andrew) Yuk On KONG University of Edinburgh

DEFINITION “Nominalisation refers to the process of forming a noun from some other word-class. (e.g. red+ness) or (in classical transformational grammar especially) the derivation of a noun phrase from an underlying clause (e.g. Her answering of the letter….from She answered the letter). The term is also used in the classification of relative clauses (e.g. What concerns me is her attitude)…….” (Crystal 1997)

Nominalisations (1st definition) from verbs only are considered here, e.g. "statement" from "state". Problem: WORD--noun? from a verb or not? Nominalsations derived from verbs are very productive in English and are usually created by means of suffixation (i.e., suffixes that form nouns are attached to verb bases).

EXCLUSIONS Nominals, e.g. the poor, the wounded Nominalisation NOT From Verb, e.g. redness -ing form, e.g. the making of the movie Antidisestablish-ment-arian-ism

REGULAR? Nominalise nominalisation Interpret interpretation Interrupt interruption Associate association delete deletion   break breakage leak leakage

Confine confinement Refine refinement (but define definition)   submit submission admit admission (but also admittance) remit remission; remittance; remit

VERB=NOUN Debate Debate (not debation); debater Pay pay Love love Boss boss Stand stand purchase purchase Lie lie (“tell a lie”) (cf lie down)

VERB=NOUN (except stress) transfer transfer transport transport import import rebel rebel; (rebellion)

1 VERB, >1 NOUNS Collect collection; collector Interpret interpretation; interpreter Cover cover; coverage Conduct conduction; conductor; Depend dependant/dependent; dependence; dependency

SEMANTICS Conduct conduction (conduct electricity/heat) Conduct conduct (behave/organise)

WHEN TO USE WHICH SUFFIX -tion/-sion er/or Debate debater Talk talker Collect collector Conduct conductor

IRREGULAR NOMINALISATION Choose choice Succeed success;succession;successor Decide decision Sell sale

PSEUDO-NOMINALISATION mote?? Motion (noun; a very small piece of dust) Depart Departure; Department??? Apart apartment????

WHY BOTHER? The identification of nominalisations and their associated verbs (e.g. "statement" and "state"). important for a number of NLP tasks: machine translation information retrieval automatic learning of machine-readable dictionaries grammar induction

HOW ? nominalisation is a productive morphological phenomenon: list all acceptable nominalised forms? New words?

techniques NOT focusing on nominalisations build rules machine-learning approaches to induce morphological structures using large corpora knowledge-free induction of inflectional morphologies (Schone and Jurafsky 2001).

SCHONE AND JURAFSKY (2001) Schone and Jurafsky (2001) have performed work for acquiring cognates and morphological variants.  Induced semantics—Latent Semantic Analysis (LSA) Induced orthographic info Induced syntactic info Transitive information Affix frequencies

GOAL OF THIS STUDY The principal goal of this project is to develop a system which can recognise nominalisations, together with the verbs from which they are derived.

EXPERIMENT 1 (baseline) identify nouns using the tags in the corpus identify potential nominalisations from the list of nouns with a list of nominalisation suffixes find the corresponding potential verb for each by identifying the verb (from among verbs as tagged) that shares with it the greatest number of letters in sequence accept a pair of nominalisation and verb if the % letter matched > 50% and discard any other

EXPERIMENT 2 using decision tree to build a model possible features include: -letter similarity between verbs and nouns -suffix frequency -verb frequency -verb semantics -subject of noun -subject of verb

EVALUATION experiments will be based on the BNC corpus. The obtained nominalisations will be evaluated against the CELEX morphological lexicon and manually annotated data. Precision, recall and F-score

BRITISH NATIONAL CORPUS Over 100 million words Corpus of modern English Both spoken (10%) and written (90%) Each word is automatically tagged by the CLAWS stochastic POS tagger 65 different tags encoded using SGML to represent POS tags and a variety of other structural properties of texts (e.g. headings, paragraphs, lists, etc.)

<item> <s n=086> <w NN1-VVG>Shopping <w PRP>including <w NN1>collection <w PRF>of <w NN2>prescriptions </item> <s n=087> <w VVG>Daysitting <w CJC>and <w VVG>nightsitting

CELEX English, Dutch and German Annotated by human using lemmata from two dictionaries of English 52,446 lemmata and 160,594 wordforms orthographic, phonological, morphological, syntactic and frequency information morphological structure, e.g. ((celebrate),(ion))

MILESTONES 6/2002 Experiment 1—baseline 7/2002 Experiment 2 8/2002 Write-up 9/2002 Finalise report