Corpus Linguistics 2007, University of Birmingham Corpus-based evaluation of prosodic phrase break prediction Claire Brierley and Eric Atwell School of.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Corpus Processing and NLP
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Semantic Role Labeling Abdul-Lateef Yussiff
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.
Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions.
® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Ling 570 Day 17: Named Entity Recognition Chunking.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
CSA2050 Introduction to Computational Linguistics Parsing I.
Evolutionary Programming
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Python and NLTK Shallow Parsing and Chunking NLTK Lite.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Evolutionary Programming A.E. Eiben and J.E. Smith, Introduction to Evolutionary Computing Chapter 5.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)
Natural Language Processing Vasile Rus
Language Identification and Part-of-Speech Tagging
PRESENTED BY: PEAR A BHUIYAN
Natural Language Processing (NLP)
CSCI 5832 Natural Language Processing
Chunk Parsing CS1573: AI Application Development, Spring 2003
CS246: Information Retrieval
6.001 SICP Interpretation Parts of an interpreter
CSCI 5832 Natural Language Processing
Natural Language Processing (NLP)
Extracting Why Text Segment from Web Based on Grammar-gram
Natural Language Processing (NLP)
Presentation transcript:

Corpus Linguistics 2007, University of Birmingham Corpus-based evaluation of prosodic phrase break prediction Claire Brierley and Eric Atwell School of Computing, University of Leeds

Prosody and prosodic phrase breaks PROSODY emotionstressrhythm pitch accents intonationphrasing In the popular mythology the computer is a mathematics machine: it is designed to do numerical calculations. Yet it is really a language machine: its fundamental power lies in its ability to manipulate linguistic tokens - symbols to which meaning has been assigned. Terry Winograd, 1984

Punctuation is a way of annotating phrase breaks in text.. PROSODY emotionstressrhythm pitch accents intonationphrasing In the popular mythology the computer is a mathematics machine: it is designed to do numerical calculations. Yet it is really a language machine: its fundamental power lies in its ability to manipulate linguistic tokens - symbols to which meaning has been assigned. Terry Winograd, 1984

..and is therefore one text-based feature used in automatic phrase break prediction PROSODY emotionstressrhythm pitch accents intonationphrasing In the popular mythology the computer is a mathematics machine| it is designed to do numerical calculations| Yet it is really a language machine| its fundamental power lies in its ability to manipulate linguistic tokens| symbols to which meaning has been assigned| Terry Winograd, 1984

Once upon a time | there will be a little girl called Uncumber. | Uncumber will have a younger brother called Sulpice | and they will live with their parents | in a house in the middle of the woods. | upon a time = trigram where we expect a boundary next the middle of = trigram which might include a boundary live with = bigram which might include a boundary girl called = bigram where we might have a boundary next and which might also include a boundary… Positional syntactic features: n-grams

Some top class phrase break models There are 2 generic approaches: Deterministic or rule-based: chink chunk or CFP (Liberman & Church, 1992) They will live | with their parents | in a house | in the middle | of the woods | Probabilistic or statistical: e.g. as used in Festival (CSTR) (Taylor & Black, 1998) 79% breaks-correct on MARSEC (Roach, P. et al, 1993)

Shallow or chunk parsing Source: [S [PP [IN In] [NP [AT the] [JJ popular] [NN mythology]]] [NP [AT the] [NN computer]] [VP [BEZ is] [NP [AT a] [NN mathematics] [NN machine.]]]] In the popular mythology | the computer is a mathematics machine. Chunk parse rule - using NLTK version 0.6: parse.ChunkRule(' <IN|DT|DTI|AT|AP|CD|OD|PPO|PN|POSS|JJ|JJT|JJS|NP|N N|NNS>+', <with sequences of other prepositions, determiners, numbers, certain pronouns, adjectives and nouns - and these can be in any order>)

rules or features? break or non-break? The classification task Task: to classify junctures between words Train the model on gold standard speech corpus: training data: PoS tags + boundary tags Test the model: unseen test set quantitative metrics % boundaries correct? % insertion & deletion errors? Model type: deterministic or probabilistic?

Variant phrasing strategies and templates Gold standard corpus version has lots of major boundaries Given the state of lawlessness | that exists in Lebanon || the uninformed outsider might reasonably expect security | at Beirut airport || to be amongst the tightest in the world || but the opposite is true || Rule-based variant Given the state | of lawlessness | that exists | in Lebanon the uninformed outsider | might reasonably expect security | at Beirut airport | to be | amongst the tightest in the world | but the opposite is true | Score on this sentence: Recall = 83.33%; Precision = 55.55% Aix-MARSEC Corpus: annotated transcript of 1980s BBC news commentary

Variant phrasing strategies and templates Gold standard corpus version has lots of major boundaries Given the state of lawlessness | that exists in Lebanon || the uninformed outsider might reasonably expect security | at Beirut airport || to be amongst the tightest in the world || but the opposite is true || Intuitive prosodic phrasing Given the state of lawlessness that exists in Lebanon | the uninformed outsider | might reasonably expect | security | at Beirut airport | to be amongst the tightest in the world | but the opposite is true | Score on this sentence: Recall = 83.33%; Precision = 71.43%..the very notion of evaluating a phrase-break model against a gold standard is problematic as long as the gold standard only represents one out of the space of all acceptable phrasings.. (Atterer and Klein, 2002)

Current work: developing a prosody lexicon intersection with Python dictionary get some more tags e.g. CFP, stress pattern [..(gone, VBN, C, 1),..] these tags are text-based features Sources used: 1. Computer-usable dictionary CUVPlus (Pedler, 2002) - incorporates C5 PoS tags 2. Lexical stress patterns derived from CELEX2 database (Baayen et al, 1995) and Carnegie-Mellon Pronouncing dictionary (CMU, 1998) incoming corpus text already PoS-tagged format: list of tuples [..(gone, VBN),..]

Lexicon fields - and lookup Python dictionary syntax stores the above information as (key, value) pairs { (cascades, NN2) : [0, k&'skeIdz, Kj%, NN2:1, 2, 01, C] (cascades, VVZ) : [0, k&'skeIdz, Ia%, VVZ:-1, 2, 01, C] } Incoming corpus text - also in the form of (token, tag) tuples - can be matched against dictionary keys Thus intersection enables corpus text to accumulate additional values which have the potential to become features for machine learning tasks

What Id like to achieve 1. Develop phrase break predictors representative of two generic approaches - rule- based and probabilistic and compare their performance. 2. Use the WEKA toolkit plus training data from the Aix-MARSEC corpus (Auran et al, 2004) which has linguistically sophisticated prosodic annotations, to explore a new mix of features for machine learning of phrase break prediction. This is where the prosody lexicon comes in.Aix-MARSEC corpus 3. Develop a purpose-built corpus of different text genres and different annotation schemes to moderate the process of evaluating these phrase break models against one prosodic template. 4. If I can develop a good model, then a possible contribution to the Aix-MARSEC project may be to enrich this gold standard by generating alternative prosodic markup to the corpus linguists analysis. Outputs from the model would potentially represent legitimate, variant phrasing strategies to those already uncovered and provide new prosodic templates for the evaluation of phrase break models.

Example problem - still working on it! Input text: list of token, tag tuples [.,('that', 'CS'), ('individual', 'JJ'), ('willingness', 'NN'), ('to', 'TO'), ('pay', 'VB'), ('should', 'MD'), ('be', 'BE'), ('the', 'ATI'), ('main', 'JJB'), ('test', 'NN'), ('of', 'IN'), ('how', 'WRB'), ('resources', 'NNS'), ('are', 'BER'), ('used', 'VBN'), ('.', '.'),.] SEC: annotated transcript of Reith Lecture Input text is temporarily tagged with C5 for lexicon lookup Mapping C5 LOB is usually a case of one-to-many However, C5 has separate tags for that and of - a case of many- to-one CJS (subordinating conjunction) or CJT (that) CS and PRP (preposition) or PRF (of) IN Need to resolve this to accomplish Python dictionary lookup (preferred option) or use different lookup mechanism (hopefully not!) Problem compounded with introduction of different PoS tag sets as consequence of planned composite test corpus