1 A Unified Tagging Approach to Text Normalization Conghui Zhu 1, Jie Tang 2, Hang Li 3, Hwee Tou Ng 4, and Tiejun Zhao 1 1 Harbin Institute of Technology.

1 A Unified Tagging Approach to Text Normalization Conghui Zhu 1, Jie Tang 2, Hang Li 3, Hwee Tou Ng 4, and Tiejun Zhao 1 1 Harbin Institute of Technology 2 Tsinghua University 3 Microsoft Research Asia 4 National University of Singapore

2 Outline Motivation Related Work Problem Description A Unified Tagging Approach Experimental Results Summary

3 Motivation More and more ‘informally inputted’ text data becomes available to NLP –E.g., emails, newsgroups, forums, blogs, etc. The informal text is usually very noisy –98.4% of the 5,000 randomly selected emails contain noises Previously, text normalization is conducted in a more or less ad-hoc manner –E.g., heuristic rules or separated classification models

4 Examples 1. i’m thinking about buying a pocket 2. pc device for my wife this christmas,. 3. the worry that i have is that she won’t 4. be able to sync it to her outlook express 5. contacts… I’m thinking about buying a Pocket PC device for my wife this Christmas.// The worry that I have is that she won’t be able to sync it to her Outlook Express contacts.// Noise Text Extra line break 1. i’m thinking about buying a pocket 2. pc device for my wife this christmas,. 3. the worry that i have is that she won’t 4. be able to sync it to her outlook express 5. contacts… Term Extraction Normalized Text I’m thinking about buying a Pocket PC device for my wife this Christmas.// The worry that I have is that she won’t be able to sync it to her Outlook Express contacts.// NER Case Error Cannot find any named entities from the noisy text Contain many errors in term extraction Extra space Extra punc. Missing space Missing period Product Date

6 Related Work – Cleaning Informal Text Preprocessing Noisy Texts –Clark (2003), Wong, Liu, and Bennamoun (2006) NER from Informal Texts –Minkov, Wang, and Cohen (2005) Signature Extraction from Informal Text –Carvalho and Cohen (2004) Email Data Cleaning –Tang, Li, Cao, and Tang (2005)

7 Related Work – Language Processing Sentence Boundary Detection –E.g., Palmer and Hearst (1997), Mikheev (2000) Case Restoration –Lita and Ittycheriah (2003), Mikheev (2002) Spelling Error Correction –Golding and Roth (I996), Brill and Moore (2000), Church and Gale (1991) Mays et al. (1991) Word Normalization –Sproat, et al. (1999)

9 Problem Description LevelTask Percentages of Noises Paragraph Extra line break deletion49.53 Paragraph boundary detection Sentence Extra space deletion15.58 Extra punctuation mark deletion0.71 Missing space insertion1.55 Missing punctuation mark insertion3.85 Misused punctuation mark correction0.64 Sentence boundary detection Word Case restoration15.04 Unnecessary token deletion9.69 Misspelled word correction3.41 Text normalization is defined at three levels Refers to deletion of tokens like ‘--’ and ‘==’ (strong) dependencies exist between subtasks An ideal normalization method should consider processing all the tasks together!

11 Processing Flow Preprocessing Determine Tokens Standard word Non-standard word Punc. mark Space Line break Labeling data Labeled data Learning a CRF model Train Test Assigning tags A unified tagging model Model Learning Tagging Tagging results Paragraph segmentation Feature definitions Paragraphs 1 2 3

12 Token Definitions Standard word Words in natural language Non-standard word Including several general ‘special words’ e.g. email address, IP address, URL, date, number, money, percentage, unnecessary tokens (e.g. ‘===’ and ‘###’), etc. Punctuation marks Including period, question mark, and exclamation mark Space Each space will be identified as a space token Line breakEvery line break is a token Standard word Words in natural language Non-standard word Including several general ‘special words’ e.g. email address, IP address, URL, date, number, money, percentage, unnecessary tokens (e.g. ‘===’ and ‘###’), etc. Punctuation marks Including period, question mark, and exclamation mark Space Each space will be identified as a space token Line break Every line break is a token

13 Possible Tags Assignment Green nodes are tags Purple nodes are tokens Standard Word AMC FUC ALC AUC Non-standard word DEL PRV Punctuation Mark DEL PRV PSB Space DEL PRV Line break DEL RPA PRV

14 Tagging get □ a □ toshiba’s AMCDEL FUC ALC AUC PRV DEL PRV \n DEL RPV PRV pc AMC FUC ALC AUC AMC FUC ALC AUC AMC FUC ALC AUC Y* = max Y P(Y|X), where X – tokens, Y – tags

15 Features Transition Features y i-1 =y’, y i =y y i-1 =y’, y i =y, w i =w y i-1 =y’, y i =y, t i =t State Features w i =w, y i =y w i-1 =w, y i =y w i-2 =w, y i =y w i-3 =w, y i =y w i-4 =w, y i =y w i+1 =w, y i =y w i+2 =w, y i =y w i+3 =w, y i =y w i+4 =w, y i =y w i-1 =w’, w i =w, y i =y w i+1 =w’, w i =w, y i =y t i =t, y i =y t i-1 =t, y i =y t i-2 =t, y i =y t i-3 =t, y i =y t i-4 =t, y i =y t i+1 =t, y i =y t i+2 =t, y i =y t i+3 =t, y i =y t i+4 =t, y i =y t i-2 =t’’, t i-1 =t’, y i =y t i-1 =t’, t i =t, y i =y t i =t, t i+1 =t’, y i =y t i+1 =t’, t i+2 =t’’, y i =y t i-2 =t’’, t i-1 =t’, t i =t, y i =y t i-1 =t’’, t i =t, t i+1 =t’, y i =y t i =t, t i+1 =t’, t i+2 =t’’, y i =y In total, more than 4M features were used in our experiments

17 Datasets in Experiments Data Set Number of Email Number of Noises Extra Line Break Extra Space Extra Punc. Missing Space Missing Punc. Casing Error Spelling Error Misused Punc. Unnece- ssary Token Number of Paragraph Boundary Number of Sentence Boundary DC1007024763183245314291457291 Ontology1002,7312,132243106820579151956771,132 NLP6086162312132313513249244296 ML40980868170213127061240589 Jena7005,8333,0661174238234888288591,1012,9991,836 Weka2001,72188644030372957713339699602 Prot é g é 7003,3061,7701274815113655211693971,6451,035 OWL3001,23268043244741152443198578424 Mobility4002,2961,29264223587495928201891892 WinServer4003,4872,029592657142822121212101,2321,151 Windows1,0009,2933,4163,056601163481,309291676303,5812,742 PSS1,0008,9653,3482,880591532961,331276665563,4112,590 Total5,00041,40720,5866,4742936451,4496,2491,4182654,02816,65413,580 41,407

18 Baseline Methods Two baselines: cascaded and independent methods Extra space detection Extra punc. mark detection Sentence boundary detection Unnecessary token deletion Case restoration Heuristic rules Extra line break detection Extra space detection Extra punc. mark detection Sentence boundary detection Unnecessary token deletion Case restoration Extra line break detection CascadedIndependent SVM TrueCasing by Lita (ACL’2003) /CRF

19 Normalization Results —5-fold cross validation Detection TaskPrec.Rec.F1-measureAcc. Extra Line Break Independent95.1691.5293.3093.81 Cascaded95.1691.5293.3093.81 Unified93.8793.6393.7594.53 Extra Space Independent91.8594.6493.2299.87 Cascaded94.5494.5694.5599.89 Unified95.1793.9894.5799.90 Extra Punctuation Mark Independent88.6382.6985.5699.66 Cascaded87.1785.3786.2699.66 Unified90.9484.8487.7899.71 Sentence Boundary Independent98.4699.6299.0498.36 Cascaded98.5599.2098.8798.08 Unified98.7699.6199.1898.61 Unnecessary Token Independent72.51100.084.0684.27 Cascaded72.51100.084.0684.27 Unified98.0695.4796.7596.18 Case Restoration (TrueCasing) Independent27.3287.4441.6396.22 Cascaded28.0488.2142.5596.35 Case Restoration (CRF) Independent84.9662.7972.2199.01 Cascaded85.8563.9973.3399.07 Unified86.6567.0975.6399.21

20 Normalization Results (cont.) Text NormalizationPrec.Rec.F1Acc. Independent (TrueCasing)69.5491.3378.9697.90 Independent (CRF)85.0592.5288.6398.91 Cascaded (TrueCasing)70.2992.0779.7297.88 Cascaded (CRF)85.0692.7088.7298.92 Unified w/o Transition Features 86.0393.4589.5999.01 Unified86.4693.9290.0499.05 1)The baseline methods suffered from ignorance of the dependencies between the subtasks 2)Our method benefits from modeling the dependencies

21 Comparison Example 1. i’m thinking about buying a pocket 2. pc device for my wife this christmas,. 3. the worry that i have is that she won’t 4. be able to sync it to her outlook express 5. contacts… By independent method By cascaded method By our method Original informal text I’m thinking about buying a Pocket PC device for my wife this Christmas.// The worry that I have is that she won’t be able to sync it to her Outlook Express contacts.// I’m thinking about buying a pocket PC device for my wife this Christmas, The worry that I have is that she won’t be able to sync it to her outlook express contacts.// I’m thinking about buying a pocket PC device for my wife this Christmas, the worry that I have is that she won’t be able to sync it to her outlook express contacts.//

22 Error Analysis Extra line break detection –31.14% due to incorrect elimination and 64.07% due to overlooking extra line breaks Space detection –e.g. “02-16- 2006” and “desk top” Case restoration –e.g. special word “.NET” and “Ph.D.” and Proper nouns like “John” and “HP Compaq”

23 Computational Cost MethodsTrainingTagging Independent (TrueCasing) 2 minutes a few seconds Cascaded (TrueCasing) 3 minutes a few seconds Unified 5 hours25s *Tested on a computer with two 2.8G P4-CPUs and 3G memory

24 How Text Normalization Helps NER +16.60% 278 named entities (person names, location names, organization names, and product names) are annotated in 200 emails NER Performance by GATE

26 Summary Investigated the problem of text normalization Formalized the problem as a task of noise elimination and boundary detection subtasks Proposed a unified tagging approach to perform the subtasks together Empirical verification of the effectiveness of the proposed approach

27 Thanks! Q&A

1 A Unified Tagging Approach to Text Normalization Conghui Zhu 1, Jie Tang 2, Hang Li 3, Hwee Tou Ng 4, and Tiejun Zhao 1 1 Harbin Institute of Technology.

Similar presentations

Presentation on theme: "1 A Unified Tagging Approach to Text Normalization Conghui Zhu 1, Jie Tang 2, Hang Li 3, Hwee Tou Ng 4, and Tiejun Zhao 1 1 Harbin Institute of Technology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 A Unified Tagging Approach to Text Normalization Conghui Zhu 1, Jie Tang 2, Hang Li 3, Hwee Tou Ng 4, and Tiejun Zhao 1 1 Harbin Institute of Technology.

Similar presentations

Presentation on theme: "1 A Unified Tagging Approach to Text Normalization Conghui Zhu 1, Jie Tang 2, Hang Li 3, Hwee Tou Ng 4, and Tiejun Zhao 1 1 Harbin Institute of Technology."— Presentation transcript:

Similar presentations

About project

Feedback