Presentation on theme: "WP4: Normalization of Transcriptions. From Transcriptions to Subtitles Erik Tjong Kim Sang University of Antwerp."— Presentation transcript:
WP4: Normalization of Transcriptions. From Transcriptions to Subtitles Erik Tjong Kim Sang University of Antwerp
Subtasks of this work package Data collection and automatic alignment Input/output specification Automatic subtitling: statistical approach Automatic subtitling: linguistic approach The third and the fourth task will be combined. The project will produce a hybrid system.
Data Collection and Automatic Alignment Goal: collecting training data for other parts of this work package and developing, testing and applying a sentence alignment method to this data. Output: alignment software and parallel corpus.
Example Over deze man gaat het, Noël Slangen, communicatieadviseur van premier Verhofstadt, en eigenaar van een communicatiebedrijf. Noel Slangen is communicatie-adviseur van Verhofstadt. Hij heeft een communicatiebedrijf
Why is sentence alignment not trivial? A single sentence in one file may need to be combined with more than one sentence in the corresponding file. Sentences in the transcribed text may not be present in the subtitle text for space reasons. Parts of the text in either of the files may be missing (interviews, foreign language, on-screen lists).
Alignment The standard method for aligning translated sentences (Gale and Curch) does not work because there are many gaps in the text. We used a method which estimates the probability that sentences belong together based on the words they contain. The system benefits from making several passes over the data.
Collecting data Teletext subtitles have been stored daily since December 2001 for the main Flemish news broadcast (VRT 19:00) and the Flemish soap Thuis. Some transcripts of these programmes have been supplied to us by the VRT. A year of Dutch news (NOS 20:00), both subtitles and autocues, has been donated by the University of Twente in The Netherlands.
Processing the data All files, except the VRT news transcriptions (HTML) have been converted to XML. Punctuation signs have been separated from words and sentence boundaries have been marked. Sentences in available transcript files have been aligned with corresponding subtitles. All alignment structures have been manually checked.
Alignment software performance Precision and recall have been computed for pairs of sentences. F ß=1 is the harmonic mean of these. CorpusPrecisionRecallF ß=1 VRT93.5%85.3%89.2 NOS 199987.6%87.3%87.4 NOS 200290.1%86.8%88.4 Thuis 200075.7%45.1%56.5 Thuis 200291.4%95.6%93.5
Corpus size Sizes have been measured in number of words. A word has been defined as a string containing at least one of the characters in [A-Za-z0-9]. CorpusParallelComp.rateSubtitles VRT189,98376.7%993,102 NOS 1999195,49173.7%- NOS 200217,68571.6%- Thuis 200015,01988.1%- Thuis 20025,36880.4%200,358
Automatic Subtitling from Transcriptions: Statistical Approach A baseline experiment has been performed A memory-based learner was trained to predict subtitle words given the words in transcripts The learner obtained an accuracy of 71.7% on two files of Thuis, an improvement of the strategy of keeping all words (67.3%) Problem: the learner required word-aligned texts
Output example Wel, onze collega’s in Sankt-Vith die hebben het onderzoek daar afgesloten. Onze collega’s in Sankt-Vith hebben het onderzoek daar afgesloten. onze collega’s in Sankt-Vith die hebben het onderzoek daar afgesloten.
What will the subtitle generator use? Automatically generated word-class information and phrase boundaries. Significance scores for words, word classes and phrases, computed from the ATraNoS corpus and other Dutch text. Antwerp has software available for the first but most of it is for English. A module for Dutch named-entity recognition has been developed in this project year. Other modules will follow.
Future work Expanding the size of the corpora, provided that more transcriptions become available. Adapting the Antwerp shallow parsing software for Dutch. Adding shallow parsing information to the corpora. Developing a significance scoring system for words, word classes and phrases. Improving the summarization system
Publications 2002 Memory-Based Shallow Parsing. In Journal of Machine Learning Research, volume 2 (March), 2002. Introduction to the CoNLL-2002 shared task: Language-Independent Named Entity Recognition. In Proceedings of CoNLL-2002, Taipei, Taiwan, 2002. Memory-Based Named Entity Recognition. In Proceedings of CoNLL-2002, Taipei, Taiwan, 2002.