Arabic STD 2006 Results Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, 2006 2006 Spoken Term Detection Workshop

Slides:

Advertisements

Similar presentations

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Advertisements

Pearson Knowledge Technologies, Palo Alto, California NAACL Boulder Automatic Assessment of Spoken Modern Standard Arabic NAACL Boulder, Colorado.

What is VOICE? VOICE, the Vienna-Oxford International Corpus of English, is a structured collection of language data, the first computer-readable corpus.

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

Statistical Methods and Linguistics - Steven Abney Thur. POSTECH Computer Science NLP Lab Shim Jun-Hyuk.

1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.

Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.

4/14/20051 ACE Annotation Ralph Grishman New York University.

Spoken Term Detection Evaluation Overview Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, Spoken Term Detection Workshop

The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.

By Ghizlane Lafdi Lesson objectives By the end of this session you will - learn about different variations of Arabic - learn the Arabic alphabet - differentiate.

FISH 521 Peer review. Peer review Mechanics Advantages Challenges Solutions.

1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC.

The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

Assessing Reading: Meeting Year 3 Expectations

Arabic Language Challenges Walid Magdy 29 Sep 2010.

Data-driven approach to rapid prototyping Xhosa speech synthesis Albert Visagie Justus Roux Centre for Language and Speech Technology Stellenbosch University.

1 The role of the Arabic orthography in reading and spelling Salim Abu-Rabia University of Haifa.

Mohamed Maamouri, Ann Bies, Seth Kulick Linguistic Data Consortium, University of Pennsylvania, USA Presenter Name: Al-Elaiwi Moh’d.

Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**

Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

Chapter 6 ~~~~~ Oral And English Language Learner/Bilingual Assessment.

BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:

Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.

EARS STT Workshop at ICASSP, March 2005 EARS STT Workshop at ICASSP Christopher Cieri, Mohamed Maamouri, Shudong Huang, James Fiumara, Stephanie Strassel,

“……What has TV guide got to with news?”. “In order to have a successful report you must assemble the facts and opinions from a variety of sources, review.

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

LREC 2008, May 26 – June 1, Marrakesh Speaker Recognition: Building the Mixer 4 and 5 Corpora Linda Brandschain, Christopher Cieri, David Graff, Abby Neely,

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.

Developmental Word Knowledge

Chapter 3 Monolingual Dictionaries II Arabic Dictionaries.

Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia.

Interlingua Annotation Owen Rambow Advaith Siddharthan Kathleen McKeown

An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih Natural Language Processing ICS.

For Friday Finish chapter 24 No written homework.

Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

Supertagging CMSC Natural Language Processing January 31, 2006.

Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.

CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.

LANGUAGE, DIALECT, AND VARIETIES

Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Correcting Misuse of Verb Forms John Lee, Stephanie Seneff Computer Science and Artiﬁcial Intelligence Laboratory, MIT, Cambridge ACL 2008.

MORPHOLOGY. PART 1: INTRODUCTION Parts of speech 1. What is a part of speech?part of speech 1. Traditional grammar classifies words based on eight parts.

Children’s Oral Reading Corpus (CHOREC) Description & Assessment of Annotator Agreement L. Cleuren, J. Duchateau, P. Ghesquière, H. Van hamme The SPACE.

Towards Developing a Multi-Dialect Morphological Analyser for Arabic 4 th International Conference on Arabic Language Processing May 2–3, 2012, Rabat,

بسم الله الرحمن الرحيم 1. الملتقى العلمي الأول لقسم اللغة الانجليزية C OMMON C HALLENGES F ACING E NGLISH L EARNERS.

Language choice in multilingual communities

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,

Introduction to Parsing (adapted from CS 164 at Berkeley)

The role of the Arabic orthography in reading and spelling

Towards Emotion Prediction in Spoken Tutoring Dialogues

Today Myths vs. facts about sign language Structure of ASL.

Arabic Language Challenges

King Saud University, Riyadh, Saudi Arabia

Statistical n-gram David ling.

بسم اللّه الرّحمن الرّحيم

Presentation transcript:

Arabic STD 2006 Results Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, Spoken Term Detection Workshop

Outline Motivation Language background –Challenges of processing –Written Arabic Orthography and Syntax Diacritization Evaluation results –Corpus statistics –Participants –Results Future directions

Motivation Why Include Arabic in STD? –The Arab Countries represents 4.8% of the world population 1 –Arabic a good complement to English and Mandarin –Suitable corpora resources exist –Community expertise is growing Aside from the STD technology questions, how to handle diacritics is a major issue –Diacritics provide better specificity for dialectal words but at a cost. Will it be worth it? –If diacritics are used, can they be reliably transcribed? 1

The Many Challenges of Arabic Language Research Arabic is not a single language but a family of languages –Each dialect is “like” a different language –STD ’06 focused on the two variants used in previous DARPA EARS evaluations: Modern Standard Arabic (MSA) in Broadcast News, e.g., Al Jezeera, Al Arabiya Colloquial Arabic in Conversational Telephone Speech: Levantine dialect While MSA is commonly written, dialectal Arabic is not –In fact, dialectal variations are not mutually intelligible The Arabic writing system is more complicated than English –28 letters consonants (each having several script glyphs) –8 Diacritics: 3 short vowels, 3 long vowels, an omitted vowel, double consonant –Diacritics can be omitted or optionally used to disambiguate text  Fluent readers predict vowels from context

Orthography and Syntax (K-6) Predicting Vowels Three Arabic syntactic classes –Verb (Fi’il) – Actions connected to time –Noun (Ism) – Content words –Particles (Harf) – Everything else The vowels within a word are affected by Agents –Agents can be: Word position within sentences Preceding morphemes that determine case, mood, accusative, state (Sign of Dammah, Fatha, Kashra, Sukoon), etc. Preceding particles –Agents are orthographically realized through vowels This is why diacritics can be left out because they can be predicted Abstracted from الشَّرْعِيِّة الدَّوْلِيِّة الشرعية الدولية

Diacritic Usage The 8 diacritics –Long Vowels Short Vowels Fathatan -/ā/ Fatha -/a/ Dammatan -/ū/ Damma - /u/ Kasratan -/ī/Kasra - /i/ –Other Diacritics Shadda – Consonant doubling Sukoon – Lack of a vowel Diacritized texts  Filtered(Diacritized texts) –Diacritics are sometimes used to disambiguate words Long Vowels were in the non-diacritized training data Caused a mismatch between terms and training resources –Expedient solution: throw out 77 terms with long vowels

Evaluation Corpora STD Arabic Data Sources –From the Rich Transcription 2004 Test Set BNews: Al Jazeera, Dubai TV, (~1 hour) CTS: Levantine Fisher Data collection (~1 hour) –This was too small, but all we had available Data was originally transcribed by LDC –Appen corrected and added diacritics to the transcripts

Appen’s Diacritizations Appen corrected and added diacritics to the DevSet and EvalSet transcripts. –2 Pass process: transcription and QA –20% was dually transcribed Findings: –The corrected, undiacritized transcripts differed from LDC by 5.0% for BNews, 4.7% for CTS –13.7% of the words have 2 or more diacritized variants (same underlying consonants, different vowels) These may be real differences –For the 20% dually transcribed data Inter-transcriber error rate for the diacritized transcripts was 17.0% for BNews and 19.7% for CTS Inter-transcriber error rate for the NON-diacritized transcripts was 4.5% for BNews and 8.8% for CTS 12.5% and 10.5% is the lower bound on disagreement in diacritization Conclusion: –The current level of diacritic ambiguity is not conducive to evaluations This was not an unexpected result Would need careful annotation guidelines to improve consistency –Mid-eval correction: allow both diacritized and non-diacritized term systems

Arabic Term Profile Followed the same selection protocol English –Except no trigrams were selected The annotator became too frustrated because trigrams in Arabic are mostly sentences. –Nahia Zorub selected DevSet terms, Essa Zorub selected the EvalSet Selection used diacritized transcripts –Non-diacritized terms derived by removing the diacritics from the diacritized terms – Whoops! –77 non-diacritized with long vowels thrown out Lower density rates

STD 2006 Arabic Results DiacritizedNon-Diacritized BBN1p * BUT1p, 3c2c DOD1p, 1c * BBN and DOD only processed the CTS data Diacritized results are not comparable to the Non-Diacritized

Diacritized Arabic CTS SiteIndexing Rate (Hp/Hs) ATWVMTWV BUT DOD

Non-Diacritized Arabic CTS SiteIndexing Rate (Hp/Hs) ATWVMTWV BBN

Diacritized Arabic BNews SiteIndexing Rate (Hp/Hs) ATWVMTWV BUT

Conclusions Considerable problems were uncovered working with the transcripts –Diacritized vs. Non-Diacritized distinction isn’t clear –Diacritic annotation has a low inter-annotator agreement This probably swamps the specificity benefit Biggest roadblock to moving forward: –Deciding how to handle diacritics