Presentation is loading. Please wait.

Presentation is loading. Please wait.

Investigation and Modeling of the Structure of Texting Languages Monojit Choudhury, Rahul Saraf*, Vijit Jain, Sudeshna Sarkar, Anupam Basu Department of.

Similar presentations


Presentation on theme: "Investigation and Modeling of the Structure of Texting Languages Monojit Choudhury, Rahul Saraf*, Vijit Jain, Sudeshna Sarkar, Anupam Basu Department of."— Presentation transcript:

1 Investigation and Modeling of the Structure of Texting Languages Monojit Choudhury, Rahul Saraf*, Vijit Jain, Sudeshna Sarkar, Anupam Basu Department of Computer Science & Engineering, IIT Kharagpur *Department of Computer Engineering, NIT Jaipur

2 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Texting Language A new genre of English & also other languages used in chats, sms, s, blogs, etc. Ungrammatical, unconventional spellings dis is n eg 4 txtin lang This is an example for Texting language

3 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Texting Language A new genre of English & also other languages used in chats, sms, s, blogs, etc. Ungrammatical, unconventional spellings dis is n eg 4 txtin lang This is an example for Texting language The shorter the faster Constraint: understandability

4 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Objectives Modeling the structure of Texting language Decoder from Texting language to standard English Domain: SMS texts Applications –Search Engines –Noisy text Correction –Correction of ASR transcribed data

5 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur The Noisy Channel Model NOISY CHANNEL Texting Language Standard Language S: s 1 s 2 … s n T: t 1 t 2 … t m ( T) = argmax Pr(T|S) Pr(S) = argmax [ Π Pr(t i |s i )]Pr(S) i = 1 n S S

6 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Refined Objective Given Texting Language word: t Find the set of possible Standard language words {s 1, s 2, s 3 …} such that Pr(s i |t) > p t = tns s 1 = teenss 2 = tins s 3 = tonss 4 = tens s 5 = tenses 6 = turns

7 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Texting Language Data 1000 SMS texts collected from web [http://www.treasuremytext.com] Manually translated to standard English Automatic word alignment through heuristics Word – Variation pair extracted from corpus and manually corrected Available at:

8 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Texting Language Data 1000 SMS texts collected from web [http://www.treasuremytext.com] Manually translated to standard English Automatic word alignment through heuristics Word – Variation pair extracted from corpus and manually corrected Available at: No of Tokens: ~ No of Types: ~ 2000 (Std English) No of Frequent Types: 234 (freq > 10) Compression Rate: 0.83

9 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Tomorrow never dies!!! 2moro (9) tomoz (25) tomoro (12) tomrw (5) tom (2) tomra (2) tomorrow (24) tomora (4) tomm (1) tomo (3) tomorow (3) 2mro (2) morrow (1) tomor (2) tmorro (1) moro (1)

10 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Patterns or Compression Operators Phonetic substitution (phoneme) –psycho syco, then den Phonetic substitution (syllable) –today 2day, see c Deletion of vowels –message mssg, about abt Deletion of repeated characters –tomorrow tomorow

11 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Patterns or Compression Operators Truncation (deletion of tails) –introduction intro, evaluation eval Common Abbreviations –Kharagpur kgp, text back tb Informal pronunciation –going to gonna, better betta

12 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Successive Application of Operators Because cause (informal usage) cause cauz (phonetic substitution) cauz cuz (vowel deletion)

13 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Approach Supervised Machine Learning using Hidden Markov Models Training Instance – Only positive examples (t, s, freq) (tns, teens, 52) (tns, tins, 34) (tns, tens, 27) (tns, tense, 2)

14 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur HMM Construction: Graphemic path G1TG1T S6S6 G2OG2O G3DG3D G4AG4A G5YG5Y S0S0

15 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur HMM Construction: Phonemic path P 1 /T/ S6S6 T P 2 /AH/ A O U P 3 /D/ D P 4 /AY/ Y E I S0S0 S12S12 2

16 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur HMM Construction: Cross-linking G1TG1T S6S6 G2OG2O G3DG3D G4AG4A G5YG5Y S0S0 P 1 /T/ P 2 /AH/ P 3 /D/ P 4 /AY/ S12S12

17 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur HMM Construction: State Minimization G1TG1T S6S6 G2OG2O G3DG3D G4AG4A G5YG5Y S0S0 P 2 /AH/ P 4 /AY/ S12S12

18 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur HMM Construction: Modification G1TG1T S6S6 G2OG2O G3DG3D G4AG4A G5YG5Y S0S0 P 2 /AH/ P 4 /AY/ S12S12 EXT E/S

19 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Learning Supervised estimation of the HMM parameters for known 234 words Generalization of the parameters over HMMs learning operator probabilities Construction of HMMs for unknown words

20 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Supervised Estimation G1TG1T S6S6 G2OG2O G3DG3D G4AG4A G5YG5Y S0S0 P 2 /AH/ P 4 /AY/ S12S12 EXT E/S Step 1: HMM for Today

21 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Supervised Estimation G1TG1T S6S6 G2OG2O G3DG3D G4AG4A G5YG5Y S0S0 P 2 /AH/ P 4 /AY/ S12S12 EXT E/S Step 2: Initialization

22 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Supervised Estimation G1TG1T S6S6 G2OG2O G3DG3D G4AG4A G5YG5Y S0S0 P 2 /AH/ P 4 /AY/ S12S12 EXT E/S Step 3: Training using Viterbi day (10)

23 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Supervised Estimation G1TG1T S6S6 G2OG2O G3DG3D G4AG4A G5YG5Y S0S0 P 2 /AH/ P 4 /AY/ S12S12 EXT E/S Step 3: Training using Viterbi tday (5)

24 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Supervised Estimation G1TG1T S6S6 G2OG2O G3DG3D G4AG4A G5YG5Y S0S0 P 2 /AH/ P 4 /AY/ S12S12 EXT E/S Step 4: Update the parameters

25 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Generalization Weighted estimation of 20 parameters from the 234 word HMM –Probability of character deletion (null emission) from the first, last and intermediate G-states –Probability of transition from G-state to P- state/S-state and vice versa –Probability of transition to the extended state

26 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Construction of HMM for unseen words frequently used English words Their pronunciations (CMU pronunciation dictionary) Construct the structure of the word HMMs Assign the probability values based on the estimated parameters

27 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Experiments ~1200 distinct tokens obtained from the SMS corpus which are unseen (translations are known from the aligned data) Given t, For each word s in the standard lexicon, estimate Pr(s|t) ~ Pr(t|s) Rank the words according to Pr(s|t) Generate the suggestion list

28 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Results: Suggestion lists 2day (today) today (3.02) stay (11.46) away (13.13) play (13.14) clay (13.14) fne (phone) fine (3.52) phone (5.13) funny (6.26) fined (6.51) fines (6.72) cin (seeing) coin (3.52) chin (3.79) clean (5.95) coins (6.61) china (6.75)

29 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Results: Graphs Rank Accuracy (%) All tokens only distorted tokens

30 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Comparison with Aspell Rank Accuracy (%) Aspell Our model on Unseen Model testing

31 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Ongoing Work Detailed evaluation Incorporation of language models Extension for other languages, namely Hindi and Bangla Algorithms for fast argmax searching

32 Investigation and Modeling of the Structure of Texting Languages AND 2007, Hyderabad Monojit Choudhury, CSE, IIT Kharagpur Future Work Improvement of the structure of HMM –Introduction of self loop, backward edges –Learning the structure from data Case-based or analogical learning –late l8, test tst gr8st greatest

33 Thank you for listening


Download ppt "Investigation and Modeling of the Structure of Texting Languages Monojit Choudhury, Rahul Saraf*, Vijit Jain, Sudeshna Sarkar, Anupam Basu Department of."

Similar presentations


Ads by Google