Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arabic STD 2006 Results Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, 2006 2006 Spoken Term Detection Workshop

Similar presentations


Presentation on theme: "Arabic STD 2006 Results Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, 2006 2006 Spoken Term Detection Workshop"— Presentation transcript:

1 Arabic STD 2006 Results Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, 2006 2006 Spoken Term Detection Workshop http://www.nist.gov/speech/tests/std

2 Outline Motivation Language background –Challenges of processing –Written Arabic Orthography and Syntax Diacritization Evaluation results –Corpus statistics –Participants –Results Future directions

3 Motivation Why Include Arabic in STD? –The Arab Countries represents 4.8% of the world population 1 –Arabic a good complement to English and Mandarin –Suitable corpora resources exist –Community expertise is growing Aside from the STD technology questions, how to handle diacritics is a major issue –Diacritics provide better specificity for dialectal words but at a cost. Will it be worth it? –If diacritics are used, can they be reliably transcribed? 1 http://en.wikipedia.org/wiki/List_of_Arab_countries_by_poulationhttp://en.wikipedia.org/wiki/List_of_Arab_countries_by_poulation

4 The Many Challenges of Arabic Language Research Arabic is not a single language but a family of languages –Each dialect is “like” a different language –STD ’06 focused on the two variants used in previous DARPA EARS evaluations: Modern Standard Arabic (MSA) in Broadcast News, e.g., Al Jezeera, Al Arabiya Colloquial Arabic in Conversational Telephone Speech: Levantine dialect While MSA is commonly written, dialectal Arabic is not –In fact, dialectal variations are not mutually intelligible The Arabic writing system is more complicated than English –28 letters consonants (each having several script glyphs) –8 Diacritics: 3 short vowels, 3 long vowels, an omitted vowel, double consonant –Diacritics can be omitted or optionally used to disambiguate text  Fluent readers predict vowels from context

5 Orthography and Syntax (K-6) Predicting Vowels Three Arabic syntactic classes –Verb (Fi’il) – Actions connected to time –Noun (Ism) – Content words –Particles (Harf) – Everything else The vowels within a word are affected by Agents –Agents can be: Word position within sentences Preceding morphemes that determine case, mood, accusative, state (Sign of Dammah, Fatha, Kashra, Sukoon), etc. Preceding particles –Agents are orthographically realized through vowels This is why diacritics can be left out because they can be predicted Abstracted from http://www.ummah.com/forum/showthread.php?t=100867 الشَّرْعِيِّة الدَّوْلِيِّة الشرعية الدولية

6 Diacritic Usage The 8 diacritics –Long Vowels Short Vowels Fathatan -/ā/ Fatha -/a/ Dammatan -/ū/ Damma - /u/ Kasratan -/ī/Kasra - /i/ –Other Diacritics Shadda – Consonant doubling Sukoon – Lack of a vowel Diacritized texts  Filtered(Diacritized texts) –Diacritics are sometimes used to disambiguate words Long Vowels were in the non-diacritized training data Caused a mismatch between terms and training resources –Expedient solution: throw out 77 terms with long vowels

7 Evaluation Corpora STD Arabic Data Sources –From the Rich Transcription 2004 Test Set BNews: Al Jazeera, Dubai TV, (~1 hour) CTS: Levantine Fisher Data collection (~1 hour) –This was too small, but all we had available Data was originally transcribed by LDC –Appen corrected and added diacritics to the transcripts

8 Appen’s Diacritizations Appen corrected and added diacritics to the DevSet and EvalSet transcripts. –2 Pass process: transcription and QA –20% was dually transcribed Findings: –The corrected, undiacritized transcripts differed from LDC by 5.0% for BNews, 4.7% for CTS –13.7% of the words have 2 or more diacritized variants (same underlying consonants, different vowels) These may be real differences –For the 20% dually transcribed data Inter-transcriber error rate for the diacritized transcripts was 17.0% for BNews and 19.7% for CTS Inter-transcriber error rate for the NON-diacritized transcripts was 4.5% for BNews and 8.8% for CTS 12.5% and 10.5% is the lower bound on disagreement in diacritization Conclusion: –The current level of diacritic ambiguity is not conducive to evaluations This was not an unexpected result Would need careful annotation guidelines to improve consistency –Mid-eval correction: allow both diacritized and non-diacritized term systems

9 Arabic Term Profile Followed the same selection protocol English –Except no trigrams were selected The annotator became too frustrated because trigrams in Arabic are mostly sentences. –Nahia Zorub selected DevSet terms, Essa Zorub selected the EvalSet Selection used diacritized transcripts –Non-diacritized terms derived by removing the diacritics from the diacritized terms – Whoops! –77 non-diacritized with long vowels thrown out Lower density rates

10 STD 2006 Arabic Results DiacritizedNon-Diacritized BBN1p * BUT1p, 3c2c DOD1p, 1c * BBN and DOD only processed the CTS data Diacritized results are not comparable to the Non-Diacritized

11 Diacritized Arabic CTS SiteIndexing Rate (Hp/Hs) ATWVMTWV BUT68.80.0030.034 DOD8.6-6.570.000

12 Non-Diacritized Arabic CTS SiteIndexing Rate (Hp/Hs) ATWVMTWV BBN10.50.346 7 0.343

13 Diacritized Arabic BNews SiteIndexing Rate (Hp/Hs) ATWVMTWV BUT68.85-0.0920.066

14 Conclusions Considerable problems were uncovered working with the transcripts –Diacritized vs. Non-Diacritized distinction isn’t clear –Diacritic annotation has a low inter-annotator agreement This probably swamps the specificity benefit Biggest roadblock to moving forward: –Deciding how to handle diacritics


Download ppt "Arabic STD 2006 Results Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, 2006 2006 Spoken Term Detection Workshop"

Similar presentations


Ads by Google