Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpus Linguistics 2007, University of Birmingham Corpus-based evaluation of prosodic phrase break prediction Claire Brierley and Eric Atwell School of.

Similar presentations


Presentation on theme: "Corpus Linguistics 2007, University of Birmingham Corpus-based evaluation of prosodic phrase break prediction Claire Brierley and Eric Atwell School of."— Presentation transcript:

1 Corpus Linguistics 2007, University of Birmingham Corpus-based evaluation of prosodic phrase break prediction Claire Brierley and Eric Atwell School of Computing, University of Leeds

2 Prosody and prosodic phrase breaks PROSODY emotionstressrhythm pitch accents intonationphrasing In the popular mythology the computer is a mathematics machine: it is designed to do numerical calculations. Yet it is really a language machine: its fundamental power lies in its ability to manipulate linguistic tokens - symbols to which meaning has been assigned. Terry Winograd, 1984

3 Punctuation is a way of annotating phrase breaks in text.. PROSODY emotionstressrhythm pitch accents intonationphrasing In the popular mythology the computer is a mathematics machine: it is designed to do numerical calculations. Yet it is really a language machine: its fundamental power lies in its ability to manipulate linguistic tokens - symbols to which meaning has been assigned. Terry Winograd, 1984

4 ..and is therefore one text-based feature used in automatic phrase break prediction PROSODY emotionstressrhythm pitch accents intonationphrasing In the popular mythology the computer is a mathematics machine| it is designed to do numerical calculations| Yet it is really a language machine| its fundamental power lies in its ability to manipulate linguistic tokens| symbols to which meaning has been assigned| Terry Winograd, 1984

5 Once upon a time | there will be a little girl called Uncumber. | Uncumber will have a younger brother called Sulpice | and they will live with their parents | in a house in the middle of the woods. | upon a time = trigram where we expect a boundary next the middle of = trigram which might include a boundary live with = bigram which might include a boundary girl called = bigram where we might have a boundary next and which might also include a boundary… Positional syntactic features: n-grams

6 Some top class phrase break models There are 2 generic approaches: Deterministic or rule-based: chink chunk or CFP (Liberman & Church, 1992) They will live | with their parents | in a house | in the middle | of the woods | Probabilistic or statistical: e.g. as used in Festival (CSTR) (Taylor & Black, 1998) 79% breaks-correct on MARSEC (Roach, P. et al, 1993)

7 Shallow or chunk parsing Source: http://ironcreek.net/phpsyntaxtree/http://ironcreek.net/phpsyntaxtree/ [S [PP [IN In] [NP [AT the] [JJ popular] [NN mythology]]] [NP [AT the] [NN computer]] [VP [BEZ is] [NP [AT a] [NN mathematics] [NN machine.]]]] In the popular mythology | the computer is a mathematics machine. Chunk parse rule - using NLTK version 0.6: parse.ChunkRule(' +', )

8 rules or features? break or non-break? The classification task Task: to classify junctures between words Train the model on gold standard speech corpus: training data: PoS tags + boundary tags Test the model: unseen test set quantitative metrics % boundaries correct? % insertion & deletion errors? Model type: deterministic or probabilistic?

9 Variant phrasing strategies and templates Gold standard corpus version has lots of major boundaries Given the state of lawlessness | that exists in Lebanon || the uninformed outsider might reasonably expect security | at Beirut airport || to be amongst the tightest in the world || but the opposite is true || Rule-based variant Given the state | of lawlessness | that exists | in Lebanon the uninformed outsider | might reasonably expect security | at Beirut airport | to be | amongst the tightest in the world | but the opposite is true | Score on this sentence: Recall = 83.33%; Precision = 55.55% Aix-MARSEC Corpus: annotated transcript of 1980s BBC news commentary

10 Variant phrasing strategies and templates Gold standard corpus version has lots of major boundaries Given the state of lawlessness | that exists in Lebanon || the uninformed outsider might reasonably expect security | at Beirut airport || to be amongst the tightest in the world || but the opposite is true || Intuitive prosodic phrasing Given the state of lawlessness that exists in Lebanon | the uninformed outsider | might reasonably expect | security | at Beirut airport | to be amongst the tightest in the world | but the opposite is true | Score on this sentence: Recall = 83.33%; Precision = 71.43%..the very notion of evaluating a phrase-break model against a gold standard is problematic as long as the gold standard only represents one out of the space of all acceptable phrasings.. (Atterer and Klein, 2002)

11 Current work: developing a prosody lexicon intersection with Python dictionary get some more tags e.g. CFP, stress pattern [..(gone, VBN, C, 1),..] these tags are text-based features Sources used: 1. Computer-usable dictionary CUVPlus (Pedler, 2002) - incorporates C5 PoS tags 2. Lexical stress patterns derived from CELEX2 database (Baayen et al, 1995) and Carnegie-Mellon Pronouncing dictionary (CMU, 1998) incoming corpus text already PoS-tagged format: list of tuples [..(gone, VBN),..]

12 Lexicon fields - and lookup Python dictionary syntax stores the above information as (key, value) pairs { (cascades, NN2) : [0, k&'skeIdz, Kj%, NN2:1, 2, 01, C] (cascades, VVZ) : [0, k&'skeIdz, Ia%, VVZ:-1, 2, 01, C] } Incoming corpus text - also in the form of (token, tag) tuples - can be matched against dictionary keys Thus intersection enables corpus text to accumulate additional values which have the potential to become features for machine learning tasks

13 What Id like to achieve 1. Develop phrase break predictors representative of two generic approaches - rule- based and probabilistic and compare their performance. 2. Use the WEKA toolkit plus training data from the Aix-MARSEC corpus (Auran et al, 2004) which has linguistically sophisticated prosodic annotations, to explore a new mix of features for machine learning of phrase break prediction. This is where the prosody lexicon comes in.Aix-MARSEC corpus 3. Develop a purpose-built corpus of different text genres and different annotation schemes to moderate the process of evaluating these phrase break models against one prosodic template. 4. If I can develop a good model, then a possible contribution to the Aix-MARSEC project may be to enrich this gold standard by generating alternative prosodic markup to the corpus linguists analysis. Outputs from the model would potentially represent legitimate, variant phrasing strategies to those already uncovered and provide new prosodic templates for the evaluation of phrase break models.

14 Example problem - still working on it! Input text: list of token, tag tuples [.,('that', 'CS'), ('individual', 'JJ'), ('willingness', 'NN'), ('to', 'TO'), ('pay', 'VB'), ('should', 'MD'), ('be', 'BE'), ('the', 'ATI'), ('main', 'JJB'), ('test', 'NN'), ('of', 'IN'), ('how', 'WRB'), ('resources', 'NNS'), ('are', 'BER'), ('used', 'VBN'), ('.', '.'),.] SEC: annotated transcript of Reith Lecture Input text is temporarily tagged with C5 for lexicon lookup Mapping C5 LOB is usually a case of one-to-many However, C5 has separate tags for that and of - a case of many- to-one CJS (subordinating conjunction) or CJT (that) CS and PRP (preposition) or PRF (of) IN Need to resolve this to accomplish Python dictionary lookup (preferred option) or use different lookup mechanism (hopefully not!) Problem compounded with introduction of different PoS tag sets as consequence of planned composite test corpus


Download ppt "Corpus Linguistics 2007, University of Birmingham Corpus-based evaluation of prosodic phrase break prediction Claire Brierley and Eric Atwell School of."

Similar presentations


Ads by Google