METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE Massimo Poesio Universita di Venezia.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution.
Advertisements

Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 10: Natural Language Processing and IR. Syntax and structural disambiguation.
Addition Facts
Corpora in grammatical studies
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Word Bi-grams and PoS Tags
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Corpus Processing and NLP
Statistical NLP: Lecture 3
Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.) ‏
Statistical Methods and Linguistics - Steven Abney Thur. POSTECH Computer Science NLP Lab Shim Jun-Hyuk.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Natural Language Processing - Feature Structures - Feature Structures and Unification.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
PROBABILITY REVIEW PART 9 CONDITIONAL PROBABILITY II Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
INFORMATION THEORY BAYESIAN STATISTICS I Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
1 Natural Language Processing INTRODUCTION Husni Al-Muhtaseb Tuesday, February 20, 2007.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Lecture 2, 7/22/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 2 22 July 2005.
ELN – Natural Language Processing Giuseppe Attardi
9/8/20151 Natural Language Processing Lecture Notes 1.
Introduction to Natural Language Processing Heshaam Faili University of Tehran.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
INFORMATION THEORY CONDITIONAL ENTROPY Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
INFORMATION THEORY SIMPLIFIED POLYNESIAN LANGUAGE EXAMPLE Thomas Tiahrt, MA, PhD CSC492 – Advanced Text Analytics.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Linguistics & AI1 Linguistics and Artificial Intelligence Linguistics and Artificial Intelligence Frank Van Eynde Center for Computational Linguistics.
CS 4705 Natural Language Processing Fall 2010 What is Natural Language Processing? Designing software to recognize, analyze and generate text and speech.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
CPSC 503 Computational Linguistics
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Natural Language Processing Lecture 14—10/13/2015 Jim Martin.
1 An Introduction to Computational Linguistics Mohammad Bahrani.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Natural Language Processing [05 hours/week, 09 Credits] [Theory]
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
Tools for Natural Language Processing Applications
Statistical NLP: Lecture 3
CSC 594 Topics in AI – Natural Language Processing
Machine Learning in Natural Language Processing
Natural Language - General
CS4705 Natural Language Processing
Linguistic Essentials
Artificial Intelligence 2004 Speech & Natural Language Processing
Statistical NLP: Lecture 10
Presentation transcript:

METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE Massimo Poesio Universita di Venezia

Obiettivi del corso Unintroduzione alluso dei corpora e ai metodi statistici

Piano del corso Fondamenti di statistica, uso dei corpora Tasks & tecniche base: predizione di parole, n- grams, smoothing, spelling, Bayesian inference POS tagging: tagsets, Brill tagger, HMM tagging Valutazione di sistemi Il lessico Grammatiche probabilistiche,parsing statistico

Oggi Statistica e Linguistica (Abney, 1996) Fondamenti di probabilita Corpora

Dettagli pratici Orario: 10:30-13, 14:30-17 Laboratori: dalle 17 alle 18 (non oggi) Orario di ricevimento: 9:30-10:30, Pagina web (temporanea): csstaff.essex.ac.uk/staff/poesio/Courses/Venez ia/Stat_NLP/ csstaff.essex.ac.uk/staff/poesio/Courses/Venez ia/Stat_NLP/

Empiricism vs. Rationalism Chomskyan linguistics: – Assumption: linguistic knowledge mostly innate – Emphasis on explanation – Primary goal: simplicity of the theory Empirical methods – Assumption: linguistic knowledge primarily derives from generalizations over experience – Emphasis on data – Primary goal: fact discovery Computational Linguistics between 1960 & 1980 mostly Chomskyan

Problems statistical methods are meant to address Ambiguity resolution: previous choices were – Narrow domains to avoid ambiguity – Hand-coded rules – Hand-tuned preference weights Adaptation to new domains Measuring improvement

Case study: POS tagging Time flies like an arrow N/V N/V V/N/CJ Det N Number of tags Number of words types

The rise of statistical methods First area in which statistical techniques truly proved their worth was Automatic Speech Recognition (ASR) ASR techniques then used for POS tagging, and then in all areas of CL A synthesis of statistical methods and linguistic insights now underway

Modern empiricism in Computational Linguistics Large data collections Rigorous collection techniques (interannotator agreement) Rigorous evaluation techniques Discovery of generalizations: via learning techniques

Statistics & the study of language? Theoretical advances – Language acquisition: the role of experience – Linguistic theory: graded grammaticality – Language change: shifts in grammaticality Empirical – Quantify linguistic phenomena – Analyze data – Test hypotheses Psychological – Express preferences

Some interesting statistics about language Lexical biases – Category: bank = Noun 85%, Verb 15% – Sense: Bank(river) 22%, Bank(money) 78% Syntax – Subcategorization of realised: NP 20%, S 65%, Other 15% Semantics / discourse – he in subject position 65% of the time

Corpora The use of statistical techniques has been made possible by the availability of CORPORA – large collections of text typically ANNOTATED with linguistic information: – The Brown corpus (1M words) and British National Corpus (150 million words), annotated with POS tags (English) – Penn Treebank (4M words), syntactically annotated (English) – SEMCOR (250K), annotated with wordsense information – The MapTask, annotated with dialogue information – Italian: CORIS (100M words+, Bologna), Si-TAL (220K words, written, annotated with syntactic information & wordsense information), IPAR (MapTask Italiano)

Basic uses of corpora: Collocations COMPOUNDS: computer program, disk drive, calcio di rigore PHRASAL VERBS: wake up, come on PHRASAL EXPRESSIONS: bacon and eggs, the bees knees, siamo alla frutta

Bigrams: New York FrequencyWord 1Word ofthe 58841inthe 26430tothe ……… 15494tobe ……… 12622fromthe 11428NewYork ………

Statistical Language Processing Statistical inference: – Collect statistics about occurrence of X – Predict new occurrences Example: language modeling – Problem: predict word that follows, given previous ones – Find W n that maximizes P(W n |W 1..W n-1 ) Applications: – Speech recognition – Spell-checking – POS tagging …

Bibliografia Steven Abney, Statistical Methods and Linguistics, in Judith Klavans and Philip Resnik (eds.), The Balancing Act, The MIT Press, Cambridge, Mass., 1995.Statistical Methods and Linguistics Testi: – Daniel Jurafsky and James Martin, Speech and Language Processing, Prentice-Hall Piu generale, e piu facile da seguire – Christopher Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press Piu completo, e scritto da una prospettiva piu linguistica, ma tecnicamente piu avanzato