A probabilistic approach to language structure Annarita Felici and Paul Pal Royal Holloway, University of London Helsinki 2-4 June 2008

Slides:



Advertisements
Similar presentations
Concept of Law and Sources of Law
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Speaking Notes 10 November 2014 Professor Jacques Ziller EP JURI Committee information on ReNEUAL Model Rules on EU Administrative Procedures Jacques Ziller.
GReening business through the Enterprise Europe Network EN Giovanni FRANCO European Commission Enterprise and Industry EN
N.D.GagunashviliUniversity of Akureyri, Iceland Pearson´s χ 2 Test Modifications for Comparison of Unweighted and Weighted Histograms and Two Weighted.
Modeling Process Quality
Introduction to Textual Analysis. Descriptive CategoriesFields of Study Sound SystemPhonetics and Phonology Word FormationMorphology Sentence StructureSyntax.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
Information Access I Measurement and Evaluation GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde.
1/13 Parsing III Probabilistic Parsing and Conclusions.
Software Requirements
Lecture 2: Basic Information Theory Thinh Nguyen Oregon State University.
Chapter 9 Flashcards. measurement method that uses uniform procedures to collect, score, interpret, and report numerical results; usually has norms and.
Budgeting According to hotel management consultant Kirby Payne, ‘Managing expenses is among the most important things a manager does. (I never say it.
Chapter 14 Risk and Uncertainty Managerial Economics: Economic Tools for Today’s Decision Makers, 4/e By Paul Keat and Philip Young.
Trieschmann, Hoyt & Sommer Risk Identification and Evaluation Chapter 2 ©2005, Thomson/South-Western.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Translating for the European Commission Vilnius, 7 June 2013 Miroslav Adamiš Director DGT.
Norm Theory and Descriptive Translation Studies
Helsinki University of Technology Systems Analysis Laboratory 1 London Business School Management Science and Operations 1 London Business School Management.
Some basic concepts of Information Theory and Entropy
INFORMATION THEORY BYK.SWARAJA ASSOCIATE PROFESSOR MREC.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Hypothesis Testing.
MODAL VERBS.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Southern Cross University 1 CADUI' June FUNDP Namur Investigating Layout Complexity Tim Comber Dr. John Maltby Centre of Computing Southern.
Some Background Assumptions Markowitz Portfolio Theory
Investment Analysis and Portfolio Management Chapter 7.
A. Mancuso - HACCP and small food producing establishments – experience from Italy 1 HACCP AND SMALL FOOD PRODUCING ESTABLISHMENTS – EXPERIENCE FROM ITALY.
MA in English Linguistics Experimental design and statistics Sean Wallis Survey of English Usage University College London
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Review and Preview This chapter combines the methods of descriptive statistics presented in.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Digital Information and Heritage INFuture Zagreb, Sentence Alignment as the Basis For Translation Memory Database Sanja Seljan Faculty of.
Twinning Project No 00MAC01/02/006: Approximation of Legislation to the Internal Market Acquis An EU-funded project managed by the European Agency for.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
Legislative Texts. The legislative process in the EU Proposal, recommendation, communication from Commission, Green Paper, consultation, studies, draft.
Investment Analysis and Portfolio Management First Canadian Edition By Reilly, Brown, Hedges, Chang 6.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 1. The Statistical Imagination.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Legislation Drafting guidelines and tools
Two Main Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
Question paper 1997.
Copyright © Cengage Learning. All rights reserved. Chi-Square and F Distributions 10.
Progress Meeting - Rennes - November 2001 Sampling: Theory and applications Progress meeting Rennes, November 28-30, 2001 Progress meeting Rennes, November.
Doc.Ing. Zlata Sojková,CSc.1 Analysis of Variance.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Deploying Analytical Redundancy for System Fault Tolerance V. Cortellessa, D. Del Gobbo, A. Mili, M. Shereshevsky, and Z. Zhuang CSEE Dept. West Virginia.
Basic Concepts of Information Theory A measure of uncertainty. Entropy. 1.
Chapter 5 Probability Distributions 5-1 Overview 5-2 Random Variables 5-3 Binomial Probability Distributions 5-4 Mean, Variance and Standard Deviation.
Lecture №4 METHODS OF RESEARCH. Method (Greek. methodos) - way of knowledge, the study of natural phenomena and social life. It is also a set of methods.
Legal Language LEGAL PRINCIPLES. Preliminary remarks Various terms: Rule, norm, provision, regulation Polish Criminal Code Art § 1. Whoever kills.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
 Modal verbs express a variety of moods or attitudes of the speaker towards the meaning expressed by the main verb in a clause.
Comparative legal studies (Zinkovskiy Sergey, associate professor, PhD Department of the Theory and History of State and Law) Topic 2 Methodology of comparative.
Design and Data Analysis in Psychology I English group (A) Salvador Chacón Moscoso Susana Sanduvete Chaves Milagrosa Sánchez Martín School of Psychology.
Theory of Legal Translation Unit 1 Introduction. The theory of legal translation as a linguistic discipline  General theory of translation studies the.
Chapter 6 INFERENTIAL STATISTICS I: Foundations and sampling distribution.
Risk Identification and Evaluation Chapter 2
Distinctive features of
Decisions Under Risk and Uncertainty
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
The Decision Maker’s Environment
Towards Measuring Anonymity
Context-based Data Compression
TECHNICAL REPORTS WRITING
Generalized Diagnostics with the Non-Axiomatic Reasoning System (NARS)
Presentation transcript:

A probabilistic approach to language structure Annarita Felici and Paul Pal Royal Holloway, University of London Helsinki 2-4 June 2008

2-4 June 2008QITL3 Outline  Field of investigation  Research goals  Data  Probabilistic analysis  Information Theory  Entropy results

2-4 June 2008QITL3 Field of investigation  Repetitive language structure in multilingual legal text  EU normative statements in translation  Languages of investigation  English, French, German and Italian

2-4 June 2008QITL3 Field of investigation: legal norms  Deontic norms (from the Greek deon = duty).  obligations, prohibitions, permissions and authorizations  Constitutive performatives  The uttering of a performative is, or is part of, the doing of a certain kind of action or speech acts (Austin 1962) Uttering a sentence = doing things

2-4 June 2008QITL3 Other norm types  Logical necessity  necessary requirements or competences  Non-binding norms  guidelines, correct procedure

2-4 June 2008QITL3 Research goals 1. To evaluate the degree of prescriptive standardization in French, German and Italian with reference to English 2. To predict translation equivalents in French, German and Italian

2-4 June 2008QITL3  English legal drafting is highly standardized  The EU and the main English drafting suggest modal verbs for prescriptive norms (Coode 1843, Driedger 1976, Dickerson 1975, Thornton 1996)  Text types under investigation are repetitive and reusable  Text types under investigation can be more or less binding under the conditions that:

2-4 June 2008QITL3 Data Multilingual parallel corpus  Origin:EU  Corpus size: words  Text type:normative  Type of docs:Secondary Legislation ( Regulations,Decisions,Directives, Recommendations)  Years:  Languages:English, French, German, Italian

2-4 June 2008QITL3 Probabilistic Analysis Information Theory  To measure the amount of linguistic alternatives when translating a repetitive normative statement from English into French, German and Italian = Quantifying information by reducing uncertainty more alternatives = more uncertainty (high entropy) less alternatives = more standardization, certainty (low entropy)

2-4 June 2008QITL3 Probabilistic Variables  Categories of expressions  Linguistic forms  English modals  Entry point for parallel retrieval  shall, must, may, can, should

2-4 June 2008QITL3 Categories of expression  Constitutive norms and performatives  Logical necessity  Permissions and authorizations  Capability  Non-binding norms

2-4 June 2008QITL3 Linguistic forms  Indicative (pres.)  Modal verbs (mv)  Verbal periphrasis (vp)  Lexicalized modal expressions (le)  Ellipses (0- correspondence)

2-4 June 2008QITL3 Linguistic forms Linguistic equivalents used in constitutive and performative norms

2-4 June 2008QITL3 Linguistic forms Linguistic equivalents used to convey permissions and authorizations

2-4 June 2008QITL3  Given the English system of modality, which is the relative probability of choosing an equivalent modal verb in the translation of may or must and a different linguistic form as the equivalent of shall?  Is the probability of a choice in a system affected by a choice in another?

2-4 June 2008QITL3 Information Theory  the information value or content h(p) is dependent on the probability of occurrence (p) of an event (Shannon 1949) h(p) = - log (p) = log (1/p) Entropy  degree of uncertainty (= shortage of information due to the large number of alternatives)

2-4 June 2008QITL3 Probabilistic analysis  The frequency of occurrence (n i ) of each linguistic form is associated with a category  A probability variable (p i ) is derived from the estimated proportion of a particular linguistic form

2-4 June 2008QITL3 Probabilistic analysis  In English P 1 = p mv → shall = n shall / n;p 2 = p mv → must = n must / n; p 3 = p mv → should = n should / n; p 4 = p mv → can = n can / n; p 5 = p mv → may = n may / n  In French, German and Italian p 1 = p indicative + p mv + p vp + p me + p ellipses ; p 2 = p indicative + p mv + p vp + p me + p ellipses and so on.

2-4 June 2008QITL3 Linguistic forms and frequencies of occurrences in the EU Regulation for the selected categories of 1) constitutive norms and 2) permissions and authorization

2-4 June 2008QITL3 Probabilistic approach  The sum of these probabilities produces different information values  The expected information content of a system is the sum of the information contents weighted by the probabilities for each possible outcome

2-4 June 2008QITL3 Entropy : extrema  Variations in the language-specific p(i) values of linguistic forms produce distribution profiles reflecting the characteristics of the corresponding language.  Mathematically it can be shown that If all the p(i) values are equal (equi-probable situation), the profile is a uniform distribution and results in maximum entropy. If only one probability p(i) is maximum and the remaining p(i) values are zero, the entropy is minimum (e.g. English). All other distributions lie between these two limits (e.g. French, German and Italian)

2-4 June 2008QITL3 A concrete example  Regulation document in English, French, German and Italian + a fictitious language.  One category of expression: e.g. the constitutive norms.  5 linguistic forms for this category.  Total number of modal verbs and alternatives: 2075.

2-4 June 2008QITL3 Constitutive norm EnglishFrenchGermanItalianFictitious mv ind vp me el Frequency of occurrences of expression modes in 4 real languages and one fictitious language

2-4 June 2008QITL3 Histogram of 5 modes of expression

2-4 June 2008QITL3 Comparison based on Entropy Computed Entropy of Constitutive norm ENH = 0 + Hmv = FRH = Hind + Hmv + Hvp + Hme + Hme =0.857 GEH = Hind + Hmv + Hvp + Hme + Hme =1.08 ITH = Hind + Hmv + Hvp + Hme + Hme =0.88 FIH = Hind + Hmv + Hvp + Hme + Hme =2.32

2-4 June 2008QITL3 Computed Entropy of constitutive norms (English, French, German, Italian and Fictitious)

2-4 June 2008QITL3 Entropy results 1. In the EU Regulation according to the 5 categories of expression (1. Constitutive and performative norms, 2. Logical necessity, 3.Permissions and authorizations, 4.Capability, 5. Non-binding norms) 2. In the EU Secondary Legislation overall according to the 4 types of documents (Regulations, Decisions, Directives, Recommendations)

2-4 June 2008QITL3 Entropy in the EU Regulation

2-4 June 2008QITL3 Entropy results EU Regulation  Logical necessity, permissions and authorizations and capability (< entropy)  quite standardized in the 4 languages = almost equivalent translations  Constitutive performative norms (> entropy)  translation is more difficult to predict  Definitions, const. statements, obligations  FR: < entropy than IT  DE: > entropy (VP sein/haben…zu)

2-4 June 2008QITL3 Entropy results EU Regulation  Non -binding norms  fairly amount of variation among the 4 languages  FR/IT: >entropy  DE: < entropy (should is most likely translated with sollen- Soll-Vorschriften)

2-4 June 2008QITL3 Entropy overall the 4 EU documents

2-4 June 2008QITL3 Entropy results EU Secondary Legislation  Regulations and Decisions (< entropy)  Direct applicability of the norms = more precision and standardization  FR looks more standardized than IT and DE  Directives (> entropy than Reg. and Dec.)  Binding only as to the result to be achieved  Recommendations (> entropy)  Not-binding: more freedom  DE : sollen

2-4 June 2008QITL3 Conclusions  Given certain conditions, it is possible to predict with some certainty the occurrence of a particular factor  If applied to repetitive texts, entropy analysis can enhance research in langauge testing, evaluation and in the development of automated translation’s tools

2-4 June 2008QITL3 References  Austin, J. L How to do things with words.Oxford: Oxford University Press.  Coode, G Legislative Expressions. Appendix to the Report of the Poor Law Commissioners on Local Taxation. Published separately 1845, 2nd Ed  Driedger, E. A The Composition of legislation. Legislative forms and precedents (2 nd Ed.). Ottawa:The Department of Justice  Shannon, Cand W. Weaver (1949) The mathematical theory of communication. Urbana: University of Illinois Press.USA.  Thornton G.C Legislative Drafting (4 th Ed.). Butterworths, London. 