Presentation is loading. Please wait.

Presentation is loading. Please wait.

Investigation and Modeling of the Structure of Texting Languages

Similar presentations


Presentation on theme: "Investigation and Modeling of the Structure of Texting Languages"— Presentation transcript:

1 Investigation and Modeling of the Structure of Texting Languages
Monojit Choudhury, Rahul Saraf*, Vijit Jain, Sudeshna Sarkar, Anupam Basu Department of Computer Science & Engineering, IIT Kharagpur *Department of Computer Engineering, NIT Jaipur

2 This is an example for Texting language
A new genre of English & also other languages used in chats, sms, s, blogs, etc. Ungrammatical, unconventional spellings dis is n eg 4 txtin lang This is an example for Texting language Monojit Choudhury, CSE, IIT Kharagpur

3 Texting Language The shorter  the faster Constraint: understandability A new genre of English & also other languages used in chats, sms, s, blogs, etc. Ungrammatical, unconventional spellings dis is n eg 4 txtin lang This is an example for Texting language 24 39 Monojit Choudhury, CSE, IIT Kharagpur

4 Objectives Modeling the structure of Texting language
Decoder from Texting language to standard English Domain: SMS texts Applications Search Engines Noisy text Correction Correction of ASR transcribed data Monojit Choudhury, CSE, IIT Kharagpur

5 The Noisy Channel Model
S: s1 s2 … sn T: t1 t2 … tm Standard Language Texting Language (T) = argmax Pr(T|S) Pr(S) = argmax [ΠPr(ti|si)]Pr(S) S n S i = 1 Monojit Choudhury, CSE, IIT Kharagpur

6 Refined Objective Given Texting Language word: t
Find the set of possible Standard language words {s1, s2, s3…} such that Pr(si|t) > p t = “tns” s1 = “teens” s2 = “tins” s3 = “tons” s4 = “tens” s5 = “tense” s6 = “turns” Monojit Choudhury, CSE, IIT Kharagpur

7 Texting Language Data 1000 SMS texts collected from web [http://www.treasuremytext.com] Manually translated to standard English Automatic word alignment through heuristics Word – Variation pair extracted from corpus and manually corrected Available at: Monojit Choudhury, CSE, IIT Kharagpur

8 Texting Language Data 1000 SMS texts collected from web [http://www.treasuremytext.com] Manually translated to standard English Automatic word alignment through heuristics Word – Variation pair extracted from corpus and manually corrected Available at: No of Tokens: ~ 20000 No of Types: ~ 2000 (Std English) No of Frequent Types: 234 (freq > 10) Compression Rate: 0.83 Monojit Choudhury, CSE, IIT Kharagpur

9 Tomorrow never dies!!! 2moro (9) tomoz (25) tomoro (12) tomrw (5)
tomra (2) tomorrow (24) tomora (4) tomm (1) tomo (3) tomorow (3) 2mro (2) morrow (1) tomor (2) tmorro (1) moro (1) Monojit Choudhury, CSE, IIT Kharagpur

10 Patterns or Compression Operators
Phonetic substitution (phoneme) psycho  syco, then  den Phonetic substitution (syllable) today  2day , see  c Deletion of vowels message  mssg, about  abt Deletion of repeated characters tomorrow  tomorow Monojit Choudhury, CSE, IIT Kharagpur

11 Patterns or Compression Operators
Truncation (deletion of tails) introduction  intro, evaluation  eval Common Abbreviations Kharagpur  kgp, text back  tb Informal pronunciation going to  gonna, better  betta Monojit Choudhury, CSE, IIT Kharagpur

12 Successive Application of Operators
Because  cause (informal usage) cause  cauz (phonetic substitution) cauz  cuz (vowel deletion) Monojit Choudhury, CSE, IIT Kharagpur

13 Approach Supervised Machine Learning using Hidden Markov Models
Training Instance – Only positive examples (t, s, freq) (“tns”, “teens”, 52) (“tns”, “tins”, 34) (“tns”, “tens”, 27) (“tns”, “tense”, 2) Monojit Choudhury, CSE, IIT Kharagpur

14 HMM Construction: Graphemic path
ε T @ ε O @ ε D @ ε A @ ε Y @ S0 G1 ‘T’ G2 ‘O’ G3 ‘D’ G4 ‘A’ G5 ‘Y’ S6 Monojit Choudhury, CSE, IIT Kharagpur

15 HMM Construction: Phonemic path
D Y E I S0 P1 /T/ P2 /AH/ P3 /D/ P4 /AY/ S6 2 S1 “2” Monojit Choudhury, CSE, IIT Kharagpur

16 HMM Construction: Cross-linking
‘D’ G4 ‘A’ G5 ‘Y’ S0 P1 /T/ P2 /AH/ P3 /D/ P4 /AY/ S6 S1 “2” Monojit Choudhury, CSE, IIT Kharagpur

17 HMM Construction: State Minimization
G1 ‘T’ G2 ‘O’ G3 ‘D’ G4 ‘A’ G5 ‘Y’ S0 P2 /AH/ P4 /AY/ S6 S1 “2” Monojit Choudhury, CSE, IIT Kharagpur

18 HMM Construction: Modification
G1 ‘T’ G2 ‘O’ G3 ‘D’ G4 ‘A’ G5 ‘Y’ EXT ‘E/S’ S0 P2 /AH/ P4 /AY/ S6 S1 “2” Monojit Choudhury, CSE, IIT Kharagpur

19 Learning Supervised estimation of the HMM parameters for known 234 words Generalization of the parameters over HMMs  learning operator probabilities Construction of HMMs for unknown words Monojit Choudhury, CSE, IIT Kharagpur

20 Supervised Estimation
Step 1: HMM for “Today” G1 ‘T’ G2 ‘O’ G3 ‘D’ G4 ‘A’ G5 ‘Y’ EXT ‘E/S’ S0 P2 /AH/ P4 /AY/ S6 S1 “2” Monojit Choudhury, CSE, IIT Kharagpur

21 Supervised Estimation
Step 2: Initialization 0.7 1 0.7 1 0.3 G1 ‘T’ G2 ‘O’ G3 ‘D’ G4 ‘A’ G5 ‘Y’ EXT ‘E/S’ 0.7 1 0.7 0.3 1 0.3 S0 P2 /AH/ P4 /AY/ S6 1 1 S1 “2” 0.3 Monojit Choudhury, CSE, IIT Kharagpur

22 Supervised Estimation
Step 3: Training using Viterbi 0.7 1 0.7 1 0.3 G1 ‘T’ G2 ‘O’ G3 ‘D’ G4 ‘A’ G5 ‘Y’ EXT ‘E/S’ 0.7 1 0.7 0.3 1 0.3 S0 P2 /AH/ P4 /AY/ S6 1 1 S1 “2” “2day” (10) 0.3 Monojit Choudhury, CSE, IIT Kharagpur

23 Supervised Estimation
Step 3: Training using Viterbi 0.7 1 0.7 1 0.3 G1 ‘T’ G2 ‘O’ G3 ‘D’ G4 ‘A’ G5 ‘Y’ EXT ‘E/S’ 0.7 1 0.7 0.3 1 0.3 S0 P2 /AH/ P4 /AY/ S6 1 1 S1 “2” “tday” (5) 0.3 Monojit Choudhury, CSE, IIT Kharagpur

24 Supervised Estimation
Step 4: Update the parameters 1 1 1 1 G1 ‘T’ G2 ‘O’ G3 ‘D’ G4 ‘A’ G5 ‘Y’ EXT ‘E/S’ 0.33 1 1 1 S0 P2 /AH/ P4 /AY/ S6 1 1 S1 “2” 0.66 Monojit Choudhury, CSE, IIT Kharagpur

25 Generalization Weighted estimation of 20 parameters from the 234 word HMM Probability of character deletion (null emission) from the first, last and intermediate G-states Probability of transition from G-state to P-state/S-state and vice versa Probability of transition to the extended state Monojit Choudhury, CSE, IIT Kharagpur

26 Construction of HMM for unseen words
12000 frequently used English words Their pronunciations (CMU pronunciation dictionary) Construct the structure of the word HMMs Assign the probability values based on the estimated parameters Monojit Choudhury, CSE, IIT Kharagpur

27 Experiments ~1200 distinct tokens obtained from the SMS corpus which are unseen (translations are known from the aligned data) Given t, For each word s in the standard lexicon, estimate Pr(s|t) ~ Pr(t|s) Rank the words according to Pr(s|t) Generate the suggestion list Monojit Choudhury, CSE, IIT Kharagpur

28 Results: Suggestion lists
2day (today) today (3.02) stay (11.46) away (13.13) play (13.14) clay (13.14) fne (phone) fine (3.52) phone (5.13) funny (6.26) fined (6.51) fines (6.72) cin (seeing) coin (3.52) chin (3.79) clean (5.95) coins (6.61) china (6.75) Monojit Choudhury, CSE, IIT Kharagpur

29 Results: Graphs All tokens only distorted tokens Accuracy (%) Rank
Monojit Choudhury, CSE, IIT Kharagpur

30 Comparison with Aspell
Model testing Our model on Unseen Aspell Accuracy (%) Rank Monojit Choudhury, CSE, IIT Kharagpur

31 Ongoing Work Detailed evaluation Incorporation of language models
Extension for other languages, namely Hindi and Bangla Algorithms for fast argmax searching Monojit Choudhury, CSE, IIT Kharagpur

32 Future Work Improvement of the structure of HMM
Introduction of self loop, backward edges Learning the structure from data Case-based or analogical learning late  l8, test  tst  gr8st  greatest Monojit Choudhury, CSE, IIT Kharagpur

33 Thank you for listening


Download ppt "Investigation and Modeling of the Structure of Texting Languages"

Similar presentations


Ads by Google