Arabic Language Challenges Walid Magdy 29 Sep 2010.

Slides:



Advertisements
Similar presentations
The Arabic Alphabet By Bryce Casper.
Advertisements

Guidelines for Meaningful Phonics Instruction Priscilla L. Griffith University of Oklahoma
Welcome to Ridge House Letters and Sounds Presentation
Mohammed Aabed Sameh Awaideh Abdul-Rahman Elshafei.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Ian Cushing English teacher, Surbiton High School UK Linguistics Olympiad Committee Education Committee, Linguistics Association of Great Britain Grammar.
PHONICS & DECODING Chapter 6. BACKGROUND & RESEARCH By Rachel Jensen.
Bits and the "Why" of Bytes: Representing Information Digitally
Vocabulary Punctuation Study Guide. GLOSSARY: A glossary is a list of words and their meanings in alphabetical order.
Literacy Continuum K-6 Western Sydney Region – Literacy Background
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Components important to the teaching of reading
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Phonetics and Phonology.
By Ghizlane Lafdi Lesson objectives By the end of this session you will - learn about different variations of Arabic - learn the Arabic alphabet - differentiate.
Arabic Natural Language Processing: P-Stemmer, Browsing Taxonomy, Text Classification, RenA, ALDA, and Template Summaries — for Arabic News Articles Tarek.
Arabic 101 in an hour (or so).
Phonics. Phonics Instruction “Phonics instruction teaches children the relationship between the letters of written language and the individual sounds.
EDC 424 Spring 2014 JMaggiacomo Development of Orthographic Knowledge.
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166.
Arabic TTS (status & problems) O. Al Dakkak & N. Ghneim.
1 The role of the Arabic orthography in reading and spelling Salim Abu-Rabia University of Haifa.
Tips For Learners of Arabic. 1 Learn the Whole Before the Part * Learn the word before isolated letters. * Learn the word before isolated sounds.
Arabic STD 2006 Results Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, Spoken Term Detection Workshop
Arabic NLP: Challenges & Opportunities Dr. Samir Tartir Scientific Day Faculty of Information Philadelphia University May 15 th 2013.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Computational Investigation of Palestinian Arabic Dialects
Million Book Bibliotheca Alexandrina Noha Adly 20 November 2006.
The Great Vowel Shift Continued The reasons behind this shift are something of a mystery, and linguists have been unable to account for why it took place.
Reading With Your Kids A parent’s guide to helping your child with reading at home.
Proper grammar in sentence construction is important in every writing assignment a person does. Grammar is defined as the set rules that direct the flow.
THE NATURE OF TEXTS English Language Yo. Lets Refresh So we tend to get caught up in the themes on English Language that we need to remember our basic.
Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.
Developmental Word Knowledge
Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.
Chapter 3 Monolingual Dictionaries II Arabic Dictionaries.
Supporting Early Literacy Learning Ballarat March, 2011.
An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih Natural Language Processing ICS.
Arabs, Kurds, & Persians. Standards SS7G8 The student will describe the diverse cultures of the people who live in Southwest Asia (Middle East). a. Explain.
Natural Language Processing Chapter 2 : Morphology.
Seminar on Endangered Languages Writing Systems.  Different Writing Systems  What makes a writing system  Standardization vs Historical artifacts 
Arabs, Kurds, & Persians. This is a group of people who share a common culture. These characteristics have been part of their community for generations.
Slang. Informal verbal communication that is generally unacceptable for formal writing.
Mohamed. A Mohammed. I Abasiono. M Adrian. N Tariq. Y.
Standard Assessment Tests Glynne Primary School SATs Information Evening.
Towards Developing a Multi-Dialect Morphological Analyser for Arabic 4 th International Conference on Arabic Language Processing May 2–3, 2012, Rabat,
A CRITIQUE OF AN ASSESSMENT TOOL AT THE UNIVERSITY OF MICHIGAN BY ABEER EL-ANWAR Arabic Proficiency Test For College Level Prepared by Raii Rammuny and.
Arabic Handwriting Recognition Thomas Taylor. Roadmap  Introduction to Handwriting Recognition  Introduction to Arabic Language  Challenges of Recognition.
Word Study With Diverse Learners What? Why? How? 2009 IRA Regional Conference: Branson, MO Presenters: Jenifer Pastore and Brandi Clowers.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
 A phoneme is the vocal gestures from which words are constructed. There are 42 pure sounds singly and in combinations needed to write our 26 letter.
An Efficient Hindi-Urdu Transliteration System Nisar Ahmed PhD Scholar Department of Computer Science and Engineering, UET Lahore.
Scorescore Jeopardy!. scorescore Attack that Word! Multiple Spellings All about phonemes Random Language Simple Jeopardy.
INFORMATION FOR PARENTS AUTUMN 2014 SPELLING, PUNCTUATION AND GRAMMAR.
Finstall First School English Information Evening for Parents
October In-Service First Grade
The role of the Arabic orthography in reading and spelling
LANGUAGE AND SPEECH LEVELS. PLAN 1. Language and speech levels 2. Primary and secondary levels 3. Units of levels 4. The difference between language and.
Issues in Arabic MT Alex Fraser USC/ISI 9/22/2018 Issues in Arabic MT.
Statistical Methods for Text Error Correction
Arabic Language Challenges
October In-Service First Grade
Arabic 101 in an hour (or so).
Six Word Memoir Fearlessness is the mother of reinvention.
Supporting Children At Home
DECEMBER, 18th, 1973 UN Arabic Language Day is observed annually on December 18. The event was established by the UN  (UNESCO) in 2010 seeking "to celebrate multilingualism.
Year 3 Spelling Rules.
Presentation transcript:

Arabic Language Challenges Walid Magdy 29 Sep 2010

This presentation is not About my PhD Work About Arabic language technologies Description of the state-of-the-art Highly technical Duplicate to other presentations (I hope) Boring (promise)

This presentation is about Arabic language Arabic orthographic nature Arabic morphological nature Arabic phonetic nature Challenges stem from this nature

This sentence is written in Arabic Language

Arabic Language Arabic is the largest living member of the Semitic language family It is classified as a macro-language with 27 sub-languages It is spoken by over 280 million people in 28 countries (middle-east) The language of Quran (over 1.6 billion Muslims)

Arabic Language (Internet) Internet users by language (2010)Growth in Internet ( )

Arabic Language (Types) Current written Arabic is the modern standard Arabic Unified across all Arabic countries (news, political speeches) Easy to understand by all Arabs Not spoken by people! Spoken Arabic (dialectic Arabic) Different across Arabic countries (regions) Semi-understandable by different Arabic dialectic Not for formal use Classic Arabic (Language of Quran) Contains ancient Arabic words Mostly understandable by Arabic people Previously used different version of Arabic scripts

Arabic Language Nature Orthographical nature: The way to write Arabic letters Morphological nature: The way to construct Arabic sentences Phonetic nature: The way to pronounce Arabic letters and words OCR NLP, IR, MT ASR, T2S, S2S

Orthographical Nature Written from right to left (letters only) 15 of the 28 letters contain dots Characters are connected or semi-connected Character shape depends on position Printed text may include ligatures and kashida Optional diacritics may be present

15 of the 28 letters contain dots

Character shape depends on position middle begin end isolated

Printed text may include kashida and ligatures

Optional diacritics may be present

It was very ambiguous

What about Arabic OCR? Word Error Rates (WER) are considerably high Good Arabic OCR: 30-40% WER on average Trained on similar font: <10% WER Ambiguous fonts: >70% WER Omni fonts: 40% WER

Morphological Nature Language is built of 10k roots Short vowels are not written (diacritics) Words contain prefix, infix, and suffix (pronouns, others) (the, and, his, her, their, it, him, them, will …) are attached to the main word Word spelling can change according to grammatical position No rule for plural words 60 billion possible surface forms

Short vowels are not written In the Arabic text we do not write its short vowels and the pronouns are attached to the words In th Arbc txt w do nt writ its short vwls and th pronuns ar attachd to th words كتب (kataba)write كتب (kotub)books كتب (kattaba)let someone write كتب (kuttiba)forced to write

Words contain prefix, infix, and suffix وسـيــكـتبونـهـا wasaya+ktub+unahaa and will + write + they it = and they will write it They are Peter’s children The children behaved well Her children are cute My children are funny We have to save our children He loves his children His children loves him كتب (kataba)write كاتب (kateb)writer كتاب (ketab)book

No rule for plural SingularPlural رجل man رجال men كاتب writer كتاب Writers مكتب office مكاتب offices مكتبة library مكتبات libraries هاتف telephone هواتف telephones مصلي prayer مصلين prayers إمام leader أئمة leaders

What about Arabic IR? Some characters are normalized Diacritics (short vowels) are removed Later approaches for search - Search with words - Apply light stemming for words - Apply morphological stemming for words - Simple character n-grams representation Character n-grams achieves the best example: exa xam amp mpl ple

Phonetic Nature Some phonemes are in Arabic doesn’t exist in other language (‘ein, ghain, ha, kha, Dad, Sad, Ta, Hamza) Examples: Mohamed (ha) Attia (‘ein, Ta) Khalid (kha) Ghada (ghain) Asmaa (Hamza) Baraa (Hamza) Diaa (Dad, Hamza)

What about Arabic ASR? Needs special training and decoding Requires huge amount of training State-of-the-art is not bad MASTOR by IBM

Conclusion Arabic language is full of challenges Research is in it early stages Huge amount of work is still needed Some initiatives are trying to help ALTEC: Arabic Language TEchnology Center

شكراً Thank you