On translation units and automatic processing Patricia Fernández Carrelo University of Deusto CliP 2006, London, 29 June–1 July.

Slides:



Advertisements
Similar presentations
OLIF V2 Gr. Thurmair April OLIF April 2000 OLIF: Overview Rationale Principles Entries Descriptions Header Examples Status.
Advertisements

Applying Ontology-Based Lexicons to the Semantic Annotation of Learning Objects Kiril Simov and Petya Osenova BulTreeBank Project
Idioms and exceptionality Nik Gisborne and Dick Hudson LAGB Leeds September 2010.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Cognitive Approaches to Grammatical Forms Gui Shichun (based on Croft & Cruse)
First year undergraduate courses in Language and Linguistics Louise Mullany School of English Studies University of Nottingham 29th October 2004 Subject.
Machine Translation II How MT works Modes of use.
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Multiword Expressions Presented by: Bhuban Seth ( )Somya Gupta ( )Advait Mohan Raut ( )Victor Chakraborty ( ) Under the guidance.
Cognitive Linguistics Croft & Cruse 9
Statistical Methods and Linguistics - Steven Abney Thur. POSTECH Computer Science NLP Lab Shim Jun-Hyuk.
1 Semantic Description of Programming languages. 2 Static versus Dynamic Semantics n Static Semantics represents legal forms of programs that cannot be.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.
Introduction to Computational Linguistics Lecture 2.
Language, Mind, and Brain by Ewa Dabrowska Chapter 2: Language processing: speed and flexibility.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Semantics and Lexicology Generativist semantics. From structuralist semantics Semantic features, components.
Multiword Expressions: A Pain in the Neck for NLP Emad Soliman Mohamed Nawfal Department of Linguistics.
Outline What is a collocation? Automatic approaches 1: frequency-based methods Automatic approaches 2: ruling out the null hypothesis, t-test Automatic.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France.
GRAMMAR APPROACH By: Katherine Marzán Concepción EDUC 413 Prof. Evelyn Lugo.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
Morphology & Syntax Dr. Eid Alhaisoni. Basic Definitions Language : a system of communication by written or spoken words, which is used by people of a.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
Vocabulary connections
1 Define a model 2 Populate the lexicon. Core Model.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
Interpreting Dictionary Definitions Dan Tecuci May 2002.
LIRICS Mid-term Review 1 LIRICS WP2 – NLP Lexica Monica Monachini CNR-ILC - Pisa 23rd May 2006.
Vocabulary connections:multi- word items in English.
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
Postgraduate Diploma in Translation Lecture 1 Computers and Language.
Vocabulary connections: multi- word items in English Orietta Gutiérrez Herrera.
Comparing syntactic semantic patterns and passages in Interactive Cross Language Information Access (iCLEF at the University of Alicante) Borja Navarro,
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Interdisciplinary Workshop, Kobe University, October 30, 2008 Designing an Interactive System for the Grammatical Analysis of Written Romanian Objectives,
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 4.
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
CSA2050 Introduction to Computational Linguistics Lecture 1 What is Computational Linguistics?
ENeL WG3 meeting: Automatic Knowledge Acquisition for Lexicography Herstmonceux, August 2015 STARTS AT 2:30 PM.
MULTI-WORD ITEMS IN ENGLISH COLLOCATIONS RESTRICTED COLLOCATION WHERE CERTAIN WORDS OCCUR ALMOST ENTIRELY IN THE CO-TEXT OF ONE OR TWO OTHER WORDS IS A.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Idiomaticity and Translation in the Context of Contemporary Applied Linguistics. Zinaida Camenev, doctor conferenţiar, ULIM, Chişinău,Moldova Olga Pascari,
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
Morphology and Syntax- Week 5
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
Lexicons, Concept Networks, and Ontologies
Approaches to Machine Translation
Statistical NLP: Lecture 7
A tool for automated extraction of multi-word expressions
Vocabulary connections: multi-word items in English
Natural Language Processing (NLP)
Method of Language Definition
Irion Technologies (c)
Approaches to Machine Translation
Natural Language Processing (NLP)
Vocabulary/Lexis LEXIS: n., collective, uncountable
Natural Language Processing (NLP)
Presentation transcript:

On translation units and automatic processing Patricia Fernández Carrelo University of Deusto CliP 2006, London, 29 June–1 July

Natural Language Processing -Main lexical problems- Disambiguation Multiword expressions All levels of language Point of view: Monolingual Multilingual Interlingual

Interlingual task: translation (I) Problem: text segmentation Machine translation: Need for objective criteria for segmentation

Interlingual task: translation (II) Multiword segments Multiword expressions Of the same order of magnitude as the number of single words (Jakendoff 1977) 41% - WordNet 1.7 (Fellbaum 1999)

Linguistic levels Lexicology (and terminology) Degree of lexicalization Morphology and syntax Components: order, cooccurrence, inflection... Semantics Decomposability, other relationships Pragmatics Context, equivalent words Text analysis

Points of view for analysing Traditional Linguistics Since Computational Linguistics A pain in the neck (Sag et al. 2002) Translation – Machine Translation Need for better approaches

Names and definitions for MWE (I) Idiosyncratic interpretations that cross word boundaries (or spaces) (Sag et al. 2002) A sequence of words that acts as a single unit at some level of linguistic analysis (Calzolari et al. 2002) Any phrase that is not entirely predictable on the basis of standard grammar rules and lexical entries (LinGO Lab, Stanford University)

Names and definitions for MWE (II) English: Multiword Expressions (MWE) o Units (MWU) (Cowie, 1985) Multi-word lexemes (MWL) (Gates, 1988) Multiword lexical unit (Zgusta, 1967) complex lexemes and lexical units (Lipka, 1983) Basque: lexia konplexuak (Abaitua, 2002) hitz anitzeko unitate lexikalak (HAUL) (Grupo IXA) Spanish: expresiones o unidades multipalabra multiverbales (Alvar Ezquerra, 2000) poliléxicas (Benson, 1985) expresiones pluriverbales (Casares, 1992 [1950]) unidades pluriverbales lexicalizadas y habitualizadas (Haensch et al., 1982) unidad léxica pluriverbal (Hernández, 1989) unidades fraseológicas (UFS) o fraseologismo (Zuluaga, 1980) lexías complejas (Abaitua, 1997)

Classification criteria and linguistic description Cooccurrence and/or need of some components Syntactic and semantic transparency Formal and semantic compositionality Frozen or fixed status Selectional restrictions Violation of some general syntactic patterns or rules Degree of lexicalization Degree of conventionality Idiomaticity

Taxonomy (I) Lexicalized phrases Fixed expressions Semi-fixed expressions Non-Decomposable idioms Compound Nominals Proper Names Multiword terminology Syntactically flexible-expressions Verb-particle constructions Decomposable idioms Light verbs Institutionalized phrases (collocations) Sag et al., 2002

Taxonomy (II) Fixed expressions: Adverbial phrases: Al pie de la letra – to the letter – hitzez hitz De improviso – suddenly – ziplo Prepositional phrases: A causa de – because of - (r)en ondorioz* En torno a – around – inguruan Multiword conjunctions: Mientras tanto – meanwhile – bitartean Con tal de que – so long as – ba...* Latin expressions: Ad hoc, sine dubio, sine die...

Taxonomy (III) Semi-fixed expressions Non-Decomposable idioms: kick the bucket / estirar la pata Compound Nominals Viaje de novios – honeymoon – eztei-bidaia Proper Names the (Oakland) Raiders (problemática propia) Multiword terminology Mayoría absoluta – absolute majority – erabateko gehiengo

Taxonomy (IV) Syntactically flexible-expressions Verb-particle constructions Non-compositionals: write up, look up / acordarse de, constar de / posposizioak compositionals: break up Decomposable idioms spill the beans – revelar un secreto Light verbs: make, do, have, give hacer, tener, ser, dar egin, izan, eman

Taxonomy (V) Institutionalized phrases (collocations) Pay attention – poner/prestar atención – arreta eman Heavy smoker – fumador empedernido – erretzaile amorratua Red wine – vino tinto – ardo beltza (Examples from Testuteka testuteka/index.html)

MultiWord Expression as Translation Unit Translation Units: difficulty in definition and classification Vázquez-Ayora (1977): simple diluted – multiple-to-one-equivalents (Nida) fractionary "In fact there are good reasons for keeping the UT (in the sense of translation atom) in MT as small -and hence as manageable- as possible" (Bennet, 1994)

Methods for processing Simbolics Words-with-spaces Hierarchical Lexicon with Default Constraint Inheritance Circumscribed Constructions Lexical Selection Information about Frequency Example: Villavicencio et al Statistics F. Smadja: Xtract

Conclusions MWEs as Translation Units Approach from Translation and, specially, from Machine Translation Linguistic definition and precision for better processing

Thats all folks! ¡Eso es todo amigos! Agur Ben-Hur!