Presentation is loading. Please wait.

Presentation is loading. Please wait.

METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE Massimo Poesio Universita di Venezia.

Similar presentations


Presentation on theme: "METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE Massimo Poesio Universita di Venezia."— Presentation transcript:

1 METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE Massimo Poesio Universita di Venezia

2 Obiettivi del corso Unintroduzione alluso dei corpora e ai metodi statistici

3 Piano del corso Fondamenti di statistica, uso dei corpora Tasks & tecniche base: predizione di parole, n- grams, smoothing, spelling, Bayesian inference POS tagging: tagsets, Brill tagger, HMM tagging Valutazione di sistemi Il lessico Grammatiche probabilistiche,parsing statistico

4 Oggi Statistica e Linguistica (Abney, 1996) Fondamenti di probabilita Corpora

5 Dettagli pratici Orario: 10:30-13, 14:30-17 Laboratori: dalle 17 alle 18 (non oggi) Orario di ricevimento: 9:30-10:30, 18-19 Email: poesio@essex.ac.uk Pagina web (temporanea): csstaff.essex.ac.uk/staff/poesio/Courses/Venez ia/Stat_NLP/ csstaff.essex.ac.uk/staff/poesio/Courses/Venez ia/Stat_NLP/

6 Empiricism vs. Rationalism Chomskyan linguistics: – Assumption: linguistic knowledge mostly innate – Emphasis on explanation – Primary goal: simplicity of the theory Empirical methods – Assumption: linguistic knowledge primarily derives from generalizations over experience – Emphasis on data – Primary goal: fact discovery Computational Linguistics between 1960 & 1980 mostly Chomskyan

7 Problems statistical methods are meant to address Ambiguity resolution: previous choices were – Narrow domains to avoid ambiguity – Hand-coded rules – Hand-tuned preference weights Adaptation to new domains Measuring improvement

8 Case study: POS tagging Time flies like an arrow N/V N/V V/N/CJ Det N Number of tags1234567 Number of words types 353403760264611221

9 The rise of statistical methods First area in which statistical techniques truly proved their worth was Automatic Speech Recognition (ASR) ASR techniques then used for POS tagging, and then in all areas of CL A synthesis of statistical methods and linguistic insights now underway

10 Modern empiricism in Computational Linguistics Large data collections Rigorous collection techniques (interannotator agreement) Rigorous evaluation techniques Discovery of generalizations: via learning techniques

11 Statistics & the study of language? Theoretical advances – Language acquisition: the role of experience – Linguistic theory: graded grammaticality – Language change: shifts in grammaticality Empirical – Quantify linguistic phenomena – Analyze data – Test hypotheses Psychological – Express preferences

12 Some interesting statistics about language Lexical biases – Category: bank = Noun 85%, Verb 15% – Sense: Bank(river) 22%, Bank(money) 78% Syntax – Subcategorization of realised: NP 20%, S 65%, Other 15% Semantics / discourse – he in subject position 65% of the time

13 Corpora The use of statistical techniques has been made possible by the availability of CORPORA – large collections of text typically ANNOTATED with linguistic information: – The Brown corpus (1M words) and British National Corpus (150 million words), annotated with POS tags (English) – Penn Treebank (4M words), syntactically annotated (English) – SEMCOR (250K), annotated with wordsense information – The MapTask, annotated with dialogue information – Italian: CORIS (100M words+, Bologna), Si-TAL (220K words, written, annotated with syntactic information & wordsense information), IPAR (MapTask Italiano)

14 Basic uses of corpora: Collocations COMPOUNDS: computer program, disk drive, calcio di rigore PHRASAL VERBS: wake up, come on PHRASAL EXPRESSIONS: bacon and eggs, the bees knees, siamo alla frutta

15 Bigrams: New York FrequencyWord 1Word 2 80871ofthe 58841inthe 26430tothe ……… 15494tobe ……… 12622fromthe 11428NewYork ………

16 Statistical Language Processing Statistical inference: – Collect statistics about occurrence of X – Predict new occurrences Example: language modeling – Problem: predict word that follows, given previous ones – Find W n that maximizes P(W n |W 1..W n-1 ) Applications: – Speech recognition – Spell-checking – POS tagging …

17 Bibliografia Steven Abney, Statistical Methods and Linguistics, in Judith Klavans and Philip Resnik (eds.), The Balancing Act, The MIT Press, Cambridge, Mass., 1995.Statistical Methods and Linguistics Testi: – Daniel Jurafsky and James Martin, Speech and Language Processing, Prentice-Hall Piu generale, e piu facile da seguire – Christopher Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press Piu completo, e scritto da una prospettiva piu linguistica, ma tecnicamente piu avanzato


Download ppt "METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE Massimo Poesio Universita di Venezia."

Similar presentations


Ads by Google