Presentation on theme: "DIAC+: A Professional Diacritics Recovering System Research Institute for Artificial Intelligence, Romanian Academy 13, Calea "13 Septembrie", 050711,"— Presentation transcript:
DIAC+: A Professional Diacritics Recovering System Research Institute for Artificial Intelligence, Romanian Academy 13, Calea "13 Septembrie", 050711, Bucharest Dan TufişAlexandru Ceauşu firstname.lastname@example.org@racai.ro
Outlook I.Motivations II.Related work and different approaches III.Diacritics in Romanian IV.DIAC + Architecture V.Evaluation VI.Implementation
Motivations Almost all European languages use diacritics In most languages that use diacritical characters, they are usually not only decorative, but they may have grammatical and/or semantic meaning The lack or the wrong use of the diacritics is extremely annoying especially in texts meant for publication. Why the lack of diacritics still happens nowadays? –reuse of older texts –ergonomic factors (non-localized keyboards, multiple key- strokes for a diacritical character) –inappropriate authoring tools or character-set converters –typos
Different approaches & related work Word-based (dictionary supported) approaches: –El-Bèze et al (1994), Yarowsky (1994), Spriet & El-Bèze (1997), Simard (1998), Tufiş & Chiţu (1999) etc. Character-based approaches: –Mihalcea (2002), Bobiceva (2008), Zweigenbaum and Grabar (2002), Wagacha et al. (2006), De Pauw et al. (2007) etc.
Diacritics in Romanian (I) Romanian language has 5 diacritical characters: ă,â,î,ş and ţ (plus their uppercase variants) Two categories of words that may contain diacritics: –U-words (Unambiguous words): the class of legal words of Romanian, which when their diacritics are stripped-off, are not words of the language anymore: padure (pădure - forest), tufis (tufiş - bush), cantar (cântar - balance), carare (cărare - pathway), casmir (caşmir - cashmere), macar (măcar - at least), fara (fără - without), cati (câţi - how many) Their recovery is trivial when a back-up lexicon is available
Diacritics in Romanian (II) –A-words (Ambiguous words): the class of legal words of Romanian, which when their diacritics are stripped-off, are still words of the language; these words are never identified by a traditional spell- checker; for instance the string fata could mean any of the following: fata – the girl, fată – a girl; or (about animals) gives birth, fâţa – the quick-swimming little fish/the coquette, fâţă – a quick-swimming little fish/a coquette, faţa – the face, faţă – a face, făta – (about animals) to give birth; gave birth, fătă – (about animals) just gave birth.
Diacritics in Romanian (III) Most A-words could be disambiguated based on grammatical information; those that cannot, are called S-words (Semantically ambiguous words). The proper treatment of S-words (characterized by the same morpho-syntactic properties) require semantic disambiguation. –For the previous example, knowing the morpho-syntactic properties (Ncfsry: common nouns, feminine, definite forms and direct case), still leaves three diacritics restoration possibilities with very different meanings: fata (Ncfsry) – the girl, faţa (Ncfsry) – the face. fâţa (Ncfsry) – the quick-swimming little fish/the coquette, A text may : – be completely diacritics-free (Tufiş and Chitu 1999) or –partially contain diacritics (and not always in a correct way); this is a harder case
Diacritics in Romanian (IV) CorpusJournalism(Agenda)Juridical (Acquis) 1. No of words6,680,4483,511,093 1* No. of chars37,008,23621,404,666 2. No. of words with diacs (of 1)2,004,763 (30,01%)1,026,385 (29,23% 2* No. of diacs2,351,2201,192,875 3. U-words (of 2)238,132 (11,88%)175,822 (17,13%) 4. A-words (of 2)1,766,631 (88,12%)850,563 (82,87%) 5. S-words (Ctag set, of 4) 58,420 (3,31%)38,323 (4,51%) 6. S-words (MSDtag set, of 4) 24,916 (1,41%)16,463 (1,94%) In an ideal setting, with a full coverage dictionary and a text with no typographical error other than the missing diacritics, about 25% (#A- words/#Words) of the total number of words would remain ambiguous. Our supposedly error free texts: 72722 (1.09%) typing errors (journalism texts) and 29387 (0.84%) typing errors (juridical texts).
DIAC + Architecture Input text Output text & spelling alternatives (i) Tokenization (iii) Tiered tagging (ii) Hypotheses generation (iv) Candidate selection (v) Unknown words processing D 0,D 1,D 2 Dictionarie s Language model Tokenizer resources Character model
Dictionaries D 0, D 1, D 2 and Hypotheses Generation LEX dictionary – normative lexicon >lemma>; 1million entries D0 dictionary is the subset of LEX containing all the words with at least one diacritical character; D1 dictionary is the diacritics stripped-off version of LEX; D2 dictionary contains words in the current text which are neither in D0 nor in D1 and which are suspected of being typing errors; they are derived from the words in D0 D1 differing by plus or minus one character or by switching two consecutive characters (additionally, the switched characters should be neighbors on the keyboard) In the hypotheses generation step, a word is first searched in D0 D1 If the word cannot be found in D0 D1 it is searched in the D2 dictionary. A word which is not found in any of the system's lexicons is considered unknown and irrecoverable by the word-based approach, and its processing is left in charge of a character-based recovery module. a word W, occurring in the current text, may be associated with several entries in the LEX word-form lexicon ; the tagging step will be used to filter this set and eventually select the single contextually correct.
Tiered Tagging &Candidate selection a special HMM language model in which the transition probabilities were computed from the regular training corpora (i.e. with diacritics) and the emission probabilities were computed from the diacritics stripped-off training corpora. TT = a two step tagging process –Tagging with a reduced tagset LM (92 tags) –Recovering left-out information from the lexical tagset (615 tags) Candidate selection. The U-words are replaced with their diacritical counterpart. The A-words which are not S-words are replaced by the surface-form identified by the MSD assigned by the tagger to the respective A-word. For the S-words, either the user is presented with a list of contextually meaningful choices or the replacement is automatically done based on lexical probabilities or some probabilistic preferences.
Character Model and Unknown Words Processing (I) Unknown word processing is used as backup for the candidate selection stage where no equivalent word-form was found in the lexicon. This case is quite rare – very few words are not covered by our almost 1,000,000 entries lexicon. The unknown word processing can be designed to work in parallel with the candidate selection phase. For processing unknown words, we used a character-based N- gram model similar to the one used in (Mihalcea, 2002). We used SRILM - SRI Language Modeling Toolkit (Stolcke, 2002) to train several character models. The training corpus contained 5,124,277 characters (including spaces) in 48,308 sentences and the test corpus has 613,234 characters in 6,411 sentences.
Character Model and Unknown Words Processing (II) Model order PerplexityAccuracy (no spaces) Model size 2-gram12.4293.67%20.8 KB 3-gram9.7295.52%223 KB 4-gram7.1197.72%1.29 MB 5-gram5.7798.59%4.82 MB 6-gram5.2998.79%13.1 MB 7-gram5.1798.84%27.7 MB 8-gram5.1898.85%48.4 MB We used Viterbi estimation with a 5-gram character model to find the most probable string for the unknown word.
Evaluation (I) Word-based vs. Character-based evaluations The evaluation scenario –R=reference corpus, tokenized, tagged and lemmatized; hand validated (cca. 118,000 words and about 502,000 characters ). –TT = the diacritics stripped-off version of R –RT = the tag and lemma stripped-of version of TT –Baseline system: from the Agenda Corpus (10 mio words) we derived a dictionary for which the head entries are non- diacritical forms of words and body of the entry is the list of diacritical counterparts each with the frequency in the corpus; the baseline system replaces a head word from this lexicon with the most frequent diacritical counterpart
Word-based Evaluation DIAC- tagged text (TT) DIAC - raw text (RT) Baseline system Tokens117,909 Words with diacritics 34,745 (29,47%) S-words361 Unknown words2130 (1,8%) Correct words116,810 (99,06%) 115,262 (97,75%) 113,491 (96,25%) Incorrect words1,092 (0,94%) 2,609 (2,25%) 4,418 (3,75%)
Character-based Evaluation DIAC- tagged text (TT) DIAC - raw text (RT) Baseline system Characters (no spaces) 501,735 Diacritical characters 41,144 (8,2%) Correct characters (no spaces) 500400 (99.73%) 498764 (99.4%) 497096 (99,07%) Incorrect characters (no spaces) 1335 (0,27%) 2971 (0,6%) 4639 (0,93%) Evaluations in terms of characters, always looks much better (approx 4 times better) than the evaluations in terms of words!
Implementation Two versions: –Standalone (everything packed in one executable; rather slow for large MS Office documents) –Web-service (distributed among various programs and machines; much faster) In both versions DIAC + may work under the user supervision (as classical spell-checkers) or independently – generates a logfile documenting each correction (initial word-form, possible replacements and the actual one). Optionally, the logfile can include for each replacement the sentence in which it was operated. The system can correct a few typographical errors such as transposed characters, wrong typed characters, or omitted characters. The MS spell-checker underlines all the unknown words, thus allowing the user to further inspect spelling errors which are out of reach for DIAC +.
Figure 2. Diacritics recovery in Microsoft Word 2003