Download presentation

Presentation is loading. Please wait.

Published byNicholas Cresswell Modified over 3 years ago

1
Based on research conducted by RDI’s NLP group (2003-2009) http://www.RDI-eg.com/RDI/Technologies/Arabic_NLP.htm Mohsen Rashwan, Mohamed Al-Badrashiny, and Mohamed Attia Presented by Mohamed Attia Talk hosted by Group of Computational Linguistics - Dept. of Computer Science University of Toronto – Toronto - Canada Oct. 7 th, 2009 Automatic Full Phonetic Transcription of Arabic Script with and without Language Factorization www.RDI-eg.com

2
The Problem of Ambiguity with NLP Numerous non-trivial NLP tasks that are handled via rule-based (i.e. language factorizing) methods typically end up with multiple possible solutions/analyses; e.g. Morphological Analysis, PoS Tagging, Syntax Analysis, Lexical Semantic Analysis... etc. This residual ambiguity arises due to our incomplete knowledge of the underlying dynamics of the linguistic phenomenon, and maybe also due to the lack of higher language processing layers constraining such a phenomenon; e.g. absence of semantic analysis layer constraining morphological and syntax analysis. Statistical methods are well known to be one of the most (if not the ever most) effective, feasible, and widely adopted approaches to automatically resolve that ambiguity. 2/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada

3
3/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada Statistical disambiguation of factorized sequences of language entities

4
Intermediate Ambiguous NLP Tasks Sometimes, such ambiguous NLP tasks are not sought for the sake of their outputs themselves, but as an intermediate step to infer another final output. An example is the problem of automatically obtaining the phonetic transcription of a given Arabic crude text w 1 … w n, which can be directly inferred as a one-to-one mapping of diacritics on the characters of the input words. But these diacritics are typically absent in MSA script! The NLP solution to this TTS problem is to indirectly infer the diacritics d 1 … d n via factorizing the crude input words by morphological analysis, PoS tagging, and Arabic phonetic grammar. Slides no. 13 to 26 provides a review of these language factorization models. However these language factorization processes are themselves highly ambiguous! 4/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada

5
5/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada Arabic morphological analysis as an intermediate ambiguous language factorization towards the target output of the diacritics of i/p words

6
Why not to Go without Language Factorization Altogether!? Some researchers, however, argue that if statistical disambiguation is eventually deployed to get the most likely sequence of outputs, why do not we go fully statistical; i.e. un-factorizing from the very beginning and give up the burden of rule-based methods? For our example; this means the statistical disambiguation (as well as the statistical language models) are built from manually diacritized text corpora where spelling characters and their full diacritics are both supplied for each word. 6/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada

7
Cannot Cover, but How Accurate and How Fast? The obvious answer in many such cases (including the one of our example) is to overcome the problem of poor coverage when the input language entities are produced via a highly generative linguistic process; e.g. Arabic morphology. However, that sound question may be modified so that it enquires about the performance (accuracy and speed) of statistically disambiguating un-factorized language entities (at least those frequent ones that may be covered without factorization) as compared to statistically disambiguating factorized language entities. The rest of this presentation discusses 4 issues in this regard: 1- The statistical disambiguation methodology deployed in both cases. 2- The related Arabic NLP factorization models and the architecture of the factorizing system. 3- The architecture of the hybrid (factorizing/un-factorizing) Arabic phonetic transcription system. 4- Results analysis: factorizing system vs. hybrid system, and hybrid system vs. other groups’. 7/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada

8
1- Statistical Disambiguation Methodology Noisy Channel Model for Statistical Disambiguation 8/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada With maximum a posteriori probability (MAP) criterion: For our example; O is the crude Arabic i/p text words sequence. - In case of the factorizing system; I is any valid sequence of factorizations; e.g. Arabic morphological analyses (quadruples), and the ^ denotes the most likely one. - In case of the un-factorizing system; I is any valid sequence of diacritics, and the ^ denotes the most likely one.

9
1- Statistical Disambiguation Methodology Likelihood Probability 9/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada In case of the factorizing system; I is now restricted to only possible factorized sequences that can generate (via synthesis) that input sequence, and the ^ denotes the most likely one. In case of the un-factorizing system; I is a possible sequence of diacritics matching that i/p sequence, and the ^ denotes the most likely one. In other pattern recognition problems; e.g. OCR and ASR, the term P(O|I) referred to as the likelihood probability, is modeled via probability distributions; e.g. HMM. Our language factorization models enable us to do better by viewing the availability of possible structures for a given i/p string - in terms of probabilities - as a binary decision of whether the observed string complies with the formal rules of the factorization models or not. This simplifies the MAP formula into: where R(O) is the part of space of the factorization model corresponding to the observed input string; i.e.

10
1- Statistical Disambiguation Methodology Statistical Language Models, and Search Space 10/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada The term P(I) is conventionally called the (Statistical) Language Model (SLM). Let us replace the conventional symbol I by the more adequate for our problem, by Q which is more convenient for our specific problem/set of problems. With the aid of the 1 st graph in this presentation; the problem is now reduced to searching for the most likely sequence of q i,f(i) ; 1 ≤ i ≤ L, i.e. the one with the highest marginal probability through the following lattice: This creates a Cartesian search space: A * search algorithm is guaranteed to exit with the most likely path via two tree- search strategies.

11
1- Statistical Disambiguation Methodology Lattice Search, and n-Gram Probabilities 11/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada Where h+1 is the maximum affordable length of n-grams in the SLM. 1- Heuristic probability estimation of the rest of the path to be expanded next. This is called the h * function. combined with 2- Best-first tree expansion of the path with highest sum of start-to-expansion probability; the g function, plus the h * function. It is then required to estimate the marginal probability of any whole/partial possible path in the lattice. Via the chain rule and the attenuating correlation assumption, this probability is approximated by the formula:

12
1- Statistical Disambiguation Methodology Computing Probabilities of n-Grams with Zipfian Sparseness 12/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada These conditional probabilities are primarily calculated via the famous Bayesian formula. Due to the Zipfian sparseness, the Good-Turing discount and Katz’s back-off techniques are also deployed to obtain smooth distributions as well as reliable estimations of rare and unseen events respectively. While the DB of elementary n-gram probabilities P(q 1 …q n ); (1 ≤ n ≤ h) are built during the training phase, the task of the statistical disambiguation in the runtime is rendered to:

13
2- Arabic NLP Factorization Models Arabic Phonetic Transcription: Problem Definition 13/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada Despite Arabic is an intensively diacritized language, Modern Standard Arabic (MSA) is typically written by the contemporary natives without diacritics! So, it is the task of the NLP system to accurately infer all the missing diacritics of all the input words in the input Arabic text, and also to amend those diacritics in order to account for the mutual phonetic effects among adjacent words upon their continuous pronunciation.

14
2- Arabic NLP Factorization Models Challenges of Arabic Phonetic Transcription 14/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada Modern standard Arabic (MSA) is typically written without diacritics. MSA script is typically full of many common spelling mistakes. The extreme derivative and inflective nature of Arabic, which necessitates treating it as a morpheme-based rather than a vocabulary-based language. The size of generable Arabic vocabulary is within the order of billions! One (or more) diacritic in about 65% of the words in Arabic text is dependent on the syntactic case-ending of each word. Lexical and Syntax grammars alone produce a high avg. no. of possible solutions at each word of the text. (High Ambiguity) 7.5% of open-domain Arabic text are transliterated words which lack any Arabic constraining model. Moreover, many of these words are confusingly analyzable as normal Arabic words!

15
2- Arabic NLP Factorization Models The Ladder of NLP Layers; Undiscovered Levels 15/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada Theoretically speaking, NLP problems should be combinatorially tackled at all the NLP layers, which is yet far beyond the reach of the current state-of-the-art of science. Moreover, NLP researchers have not developed firm knowledge at all the NLP layers yet.

16
2- Arabic NLP Factorization Models Language Factorizations Deployed for Solving the Problem 16/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada Arabic morphological analysis (and statistical disambiguation) is deployed to retrieve the syntax-independent lexical phonetic info of each input Arabic word from its building morphemes. Arabic PoS-tagging (along with morphological analysis) are deployed to statistically infer the most likely syntax-dependent (case-ending) phonetic info of i/p Arabic words. For transliterated (foreign) words, intra-word Arabic Phonetic Grammar is deployed to constrain the statistical search for the most likely diacritization that matches the spelling of each input transliterated word. Inter-word Arabic phonetic Grammar is deployed (synthetically) to phonetically concatenate fully diacritized adjacent words of all kinds.

17
The Architecture of the Factorizing Arabic Phonetic Transcription System 17/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada

18
2- Arabic NLP Factorization Models Arabic Morphological Structure: Morphemes 18/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada Our Arabic morphological model acknowledge the following P: 260 prefixes. R d : 4,600 derivative roots. F rd : 1,000 regular derivative patterns. F id : 300 irregularly derived words. R f : 260 roots of fixed words. F f : 300 fixed words. R a : 240 roots of Arabized words. F a : 290 Arabized words. S: 550 suffixes. Arabic is a highly derivative and inflective language whose words can be decomposed into a relatively compact set of morphemes.

19
2- Arabic NLP Factorization Models Arabic Morphological Structure: Lexicon 19/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada A comprehensive Arabic lexicon has been built to be the repository of the linguistic (orthographic, phonological, morphological, Syntactic) description of each Arabic morpheme along with all their possible mutual interactivities (with other morphemes) are registered as extensively as possible in a compact structured format. This lexicon is the core of all our language factorizations.

20
2- Arabic NLP Factorization Models Canonical Structure of Arabic Morphology 20/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada

21
The Multiplicity of Possible Arabic Lexical Analyses 21/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada

22
22/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada 2- Arabic NLP Factorization Models The Arabic Lexical Disambiguation Lattice After this process we obtain the diacritization of each Arabic word except for the case ending ones.

23
23/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada 2- Arabic NLP Factorization Models The Arabic Case Endings Disambiguation Lattice After this process we obtain the case ending diacritics of each Arabic word.

24
2- Arabic NLP Factorization Models Inferring the Diacritization of Transliterated Words 24/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada Foreign names and terminology frequently appear as transliterated Arabic strings in real-life Arabic text at a rate of 7.5% = 1/14 approx. These words are not constrained by Arabic Morphological or Syntactic models. Look-Up table-based approach is not a viable solution due to: - Its lack of completeness and bad coverage. - Its lack of tolerability to spelling variance. - Its inability to attaching Arabic infixes. - Its lack of guarantee to the compliance with the Arabic phonology and above all: - The time variance nature of this problem, Our approach was then to go statistical at the phoneme level, however, this would generate a too wide search space and perplexity to get good results. To limit the search space, we constrain the search with another NLP model at the phonology layer: Intra Word Arabic Phonetic Grammar.

25
25/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada 2- Arabic NLP Factorization Models Disambiguation Lattice of Transliterated Words After this process we obtain the case ending diacritics of each Arabic word.

26
26/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada 2- Arabic NLP Factorization Models Intra Word Arabic Phonetic Grammar

27
3- The Hybrid Factorizing/Un-factorizing Transcriptor Adding the Un-factorizing Phonetic Transcriptor 27/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada The un-factorizing diacritizer simply tests the spelling of each input word against a dictionary of final-form words; i.e. vocabulary list. The possible diacritizations of each word in a sequence of input words (called henceforth “Segment”) that are all covered by that dictionary are directly retrieved without any language factorization. The resulting diacritizations lattice of each segment is then statistically disambiguated. Uncovered segments (along with the disambiguated diacritizations of the covered segments) are then sent to the factorizing transcriptor for inferring the most likely diacritization of uncovered segments as well as for phonetically concatenating the words in all segments.

28
3- The Hybrid Factorizing/Un-factorizing Transcriptor The Architecture of the Hybrid Transcriptor 28/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada

29
4- Results Analysis Experimental Evaluation of both Architectures 29/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada Two sets of experiments and result analyses have been performed to evaluate our Arabic phonetic transcription work: Experiments to compare the performance of the purely factorizing architecture with the hybrid factorizing/un-factorizing one. Experiments to compare the performance of the best of our two architectures, with the best-reported other systems produced by our rival R&D groups. While the first set of experiments shows the hybrid architecture to outperform the purely factorizing one, the second set shows our hybrid system to be superior to the ones of our rival groups.

30
4- Results Analysis Comparing with Best Rivals; Experimental Setup 30/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada The best two reported rival systems reported in the published literature on the problem of full Automatic Arabic Phonetic Transcription are: N. Habash & O. Rambow group in Columbia Univ. whose architecture is a language factorizing one, with statistical modeling/disambiguation tool of Support Vector Machine Tool (SVMTool). They also build an open-vocabulary SLM with Kneser-Ney smoothing using the SRILM toolkit. (2007) I. Zitouni, J. S. Sorensen, R. Sarikaya group in IBM’s WRC whose architecture is also a language factorizing one, with statistical modeling/disambiguation work frame of Maximum Entropy. (2006) Both of the two groups evaluated their performance by training and testing their two systems using LDC’s Arabic Treebank of diacritized news stories (LDC2004T11; text–part 3, v1.0) that is published in 2004. This Arabic text corpus which includes a total of 600 documents ≈ 340K words from AnNahar (Lebanese) newspaper text is split into a training data ≈ 288K words and test data ≈ 52K words.

31
31/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada In order to obtain a fair comparison with the work of Habash & Rambow’s group, and with Zitouni et al.’s group: We used the same aforementioned training and test corpus from LDC’s Treebank. We adopted their same metrics at counting the errors while evaluating our hybrid system vs. theirs. 4- Results Analysis Comparing with Best Rivals; Experimental Results As each of the other two groups deploys more sophisticated statistical tools than ours, one can attribute the superior performance of ours to hybridizing the un- factorizing transcriptor with the factorizing one in our system architecture.

32
32/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada It is very insightful not only to know how better is the hybrid transcriptor compared to the purely factorizing one, but also to know how the error margin evolves in both cases with increasing the size of the training annotated text corpora. To this end; a domain-balanced annotated training Arabic text corpora of a total size of 3,250K words have been developed (over years) so that a manually supervised full Arabic morphological analysis and diacritization had been applied to every word. Another domain-balanced (tough) test set of 11K words had also been prepared in both the annotated and un-annotated formats. At approx. log-scale steps of the size of the training corpora, the statistical models (with the same equivalent h) had been built and the following metrics have been measured for each of the two architectures: Error margin. Average execution time per query. Average size of the SLM's. 4- Results Analysis Comparing the Factorizing to the Hybrid Architecture; Experimental Setup

33
33/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada 4- Results Analysis Comparing the Factorizing to the Hybrid Architecture; Experimental Results Both systems asymptote to the same irreducible error margin. Justification: Despite being put in two different formats, the SLM’s of both systems are built form the same data and have hence the same information content. The hybrid system has a faster learning curve than the purely factorizing one. Justification: The un-factorizing component suggests fewer candidate diacritizations (by looking the dictionary up) than the factorizing component (which generates all the possibilities) which in turn leads to less ambiguity. Due to the NLP’s Zipfian distribution, a small dictionary (built up from small training data) can quickly capture the frequent words.

34
34/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada The hybrid system has been found to be approx. twice faster than the purely factorizing one as per the avg. execution time per transcription query. Justification: Time needed for extra language factorizations, and slimmer lattice hence less A * search time. The storage needed for the SLM's of the un-factorizing system has been found to be 8 times smaller (on avg.) than their equivalent counterparts of the purely factorizing one. N.B. The storage needed for the SLM's of the hybrid system is the sum of those needed for the factorizing and un-factorizing components. Justification: Extra space is needed to store much more lower-order n-grams in the factorizing system than in the un-factorizing one. 4- Results Analysis Comparing the Factorizing to the Hybrid Architecture; Experimental Results (cont’d)

35
Relevant Publications by: I- Competing Groups (Columbia Univ. group) - N. Habash, O. Rambow, Arabic Diacritization through Full Morphological Tagging, Proceedings of the 8 th Meeting of the North American Chapter of the Association for Computational Linguistics (ACL); Human Language Technologies Conference (HLT-NAACL), 2007. (IBM group) - I. Zitouni, J. S. Sorensen, R. Sarikaya, Maximum Entropy Based Restoration of Arabic Diacritics, Proceedings of the 21 st International Conference on Computational Linguistics and 44 th Annual Meeting of the Association for Computational Linguistics (ACL); Workshop on Computational Approaches to Semitic Languages; Sydney - Australia, July 2006; http://www.ACLweb.org/anthology/P/P06/P06-1073. http://www.ACLweb.org/anthology/P/P06/P06-1073 35/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada

36
Relevant Publications by: II- Our Group (RDI’s) 1- Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A., A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Un-factorized Textual Features, IEEE Transactions on Audio, Speech, and Language Processing (TASLP) http://www.SignalProcessingSociety.org/Publications/Periodicals/TASLP. (Accepted but not published yet) http://www.SignalProcessingSociety.org/Publications/Periodicals/TASLP 2- Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A., A Stochastic Arabic Hybrid Diacritizer, 2009 IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE'09); http://caai.cn:8080/nlpke09/, Dalian-China, Sept. 2009. http://caai.cn:8080/nlpke09/ 3- Al-Badrashiny, M., Automatic Diacritization for Arabic Texts, M.Sc. thesis, Dept. of Computer Engineering, Faculty of Engineering, Cairo University, June 2009: http://www.rdi-eg.com/rdi/Downloads/ArabicNLP/Mohamed-Badashiny_MSc- Thesis_June2009.pdf. http://www.rdi-eg.com/rdi/Downloads/ArabicNLP/Mohamed-Badashiny_MSc- Thesis_June2009.pdf Cont. on the next page 36/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada

37
Relevant Publications by: II- Our Group (RDI’s) “Cont’d” 4- Attia, M., Rashwan, M., Al-Badrashiny, M., Fassieh; a Semi-Automatic Visual Interactive Tool for the Morphological, PoS-Tags, Phonetic, and Semantic Annotation of the Arabic Text, IEEE Transactions on Audio, Speech, and Language Processing (TASLP) http://www.SignalProcessingSociety.org/Publications/Periodicals/TASLP: Special Issue on Processing Morphologically Rich Languages, Vol. 17 - Issue 5; pp. 916 to pp. 925 http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=5067414&arnumber=50757 78&count=21&index=6, July 2009.http://www.SignalProcessingSociety.org/Publications/Periodicals/TASLP http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?isnumber=5067414&arnumber=50757 78&count=21&index=6 5- Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., A Hybrid System for Automatic Arabic Diacritization, The Proceedings of the 2 nd International Conference on Arabic Language Resources and Tools, Cairo - Egypt http://www.MEDAR.info/Conference_All/2009/index.php, Apr. 2009. http://www.MEDAR.info/Conference_All/2009/index.php 6- Attia, M., Theory and Implementation of a Large-Scale Arabic Phonetic Transcriptor, and Applications, PhD thesis, Dept. of Electronics and Electrical Communications, Faculty of Engineering, Cairo University, http://www.rdi-eg.com/rdi/technologies/papers.htmhttp://www.rdi-eg.com/rdi/technologies/papers.htm, Sept. 2005. 7- Attia, M., A Large-Scale Computational Processor of the Arabic Morphology, and Applications, M.Sc. thesis, Dept. of Computer Engineering, Faculty of Engineering, Cairo University, http://www.rdi-eg.com/rdi/technologies/papers.htm, Jan. 2000.http://www.rdi-eg.com/rdi/technologies/papers.htm 37/39 Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009) CL group - Dept. of CS – U of T – Toronto - Canada

38
Conclusions 38/39 CL group - Dept. of CS – U of T – Toronto - Canada I- A given statistical disambiguation technique operating on either factorized or un-factorized sequences of linguistic entities asymptotes to the same disambiguation accuracy at infinitely huge size of annotated training corpora. II- Disambiguating un-factorized sequences is easier-to-develop, computationally faster, and seems to have a faster “accuracy vs. training corpora size” learning curve. III- With highly generative linguistic phenomena (e.g. Arabic morphology), language factorization is necessary to handle the problem of coverage. IV- On the other hand, language factorization costs much R&D efforts, and is also more computationally expensive. V- In such cases, the optimal systems can be built as a hybrid of the two approaches so that the factorizing mode is resorted to only if some un-factorized entities in the i/p sequence are OOV. Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)

39
Thank you for your attention. 39/39 CL group - Dept. of CS – U of T – Toronto - Canada To probe further, please visit: http://www.RDI-eg.com/RDI/Technologies/Arabic_NLP.htm You may also contact: - Prof. Mohsen Rashwan: Mohsen_Rashwan@RDI-eg.com - Dr. Mohamed Attia: m_Atteya@RDI-eg.com Automatic Full Phonetic Transcription of Arabic Script, with and without Language Factorization (Oct. 2009)

Similar presentations

Presentation is loading. Please wait....

OK

Introduction to Algorithms 6.046J/18.401J

Introduction to Algorithms 6.046J/18.401J

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on revolution and rotation of the earth Ppt on components of railway track Ppt on juvenile rheumatoid arthritis Ppt on rulers of uae Memory games for kids ppt on batteries Ppt on networking related topics such Ppt on power grid failure 9-13-2015 Ppt on omission of articles Ppt on describing words for class 1 Ppt on metallic and nonmetallic minerals