Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring.

Similar presentations


Presentation on theme: "Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring."— Presentation transcript:

1 Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring for Cleaning Dirty Texts (ISSAC v2)

2 Authors: Wilson Wong, Wei Liu and Mohammed Bennamoun (University of Western Australia) Presented By: Benjamin Johnston (University of Technology, Sydney)INTRODUCTIONS

3 1.BackgroundBackground 2.Problems & ChallengesProblems & Challenges 3.SolutionSolution 4.EvaluationsEvaluations 5.Future WorksFuture WorksINDEX

4 itme time, with i and t swapped item, with m and e swapped ITME, Institute of Electronics Materials Technology in Warsaw, Poland its me, with missing sBACKGROUND These three errors are interrelated: Splling erors Abbre IMPROPER cAsing Research mostly (traditionally) carried out separately. 3 types of errors

5 BACKGROUND Spelling error detection and correction: Minimum edit distance (Damerau-Levenshtein, Wagner-Fisher, etc.) Similarity key (SOUNDEX, Metaphone, Double Metaphone, Daitch-Mokotoff, etc.) Abbreviation expansion: most research carried out in the area of named-entity recognition. Rely on: Letter casing. E.g. NASA Use of periods. E.g. U.S.A. Use of parentheses. E.g. North Atlantic Treaty Organisation (NATO) Number of letters in words. Spelling error and abbreviation

6 BACKGROUND Letter casing Case restoration: improper casing in words are detected and restored. Common approaches include: Use N-grams to predict the most likely case (LC, MC, UC) of a token based on its local context. Rely on unambiguous introduction of ambiguous tokens. The ambiguity of Riders will reduce when we encounter John Riders in the same text. new information york subsequent token likely to be LC categorize into LC less likely to be LC

7 INDEX

8 PROBLEMS & CHALLENGES Test data are either artificial or not-so-dirty dirty text. Techniques are isolated. Existing techniques, their accuracies and test data

9 np, ty Example of dirty texts PROBLEMS & CHALLENGES Ad-hoc abbreviations, common in the Internet era, pose extra challenges (e.g. ty, u).

10 [Aspell 0.50.3] Mi Teaser constantly REMINDS mer that eduction is an inerrant asper of LIFO. She sad, "Few yrs in school will ensue a beater LIFO for u". 2/16 [Aspell 0.50.3] [htp://www.spellcheck.net] MI Teacher kinsman REMINDS meek that education is an important speak of life. She sad, "Few yes in Scholl will ensure a better LIFO for U". 5/16 [htp://www.spellcheck.net] [MS Office Word 2003] Mi Teacher constantly REMINDS me that education is an important aspect of life. She sad, "Few yrs in Scholl will ensure a better LIFO for u". 8/16 [MS Office Word 2003] [Original] Mi Teacer konstanly REMINDS mee that edicotion is an inporrant aspek of lifu. She sad, "Few yrs in scholl will ensur a beter liFO for u". 16 errors [Original] Examples of existing applications PROBLEMS & CHALLENGES

11 Techniques for abbrev. expansion, etc based on patterns and static dictionary face problems with expansion. Integrated approaches for automatically correcting all three types of errors are rare. The accuracy of corrections by the existing isolated techniques can be further improved. The accuracy of existing techniques (individual or integrated) on extremely challenging dirty texts (e.g. chat records) has yet to be demonstrated. PROBLEMS & CHALLENGES Challenges to be addressed

12 INDEX

13 SOLUTION ISSAC v2 Suggestions and rank by Aspell Expansions for abbreviations by Stands4.com Googles page count and spell check Domain corpora (i.e. dirty texts collection) Our solution must put into consideration the followings: Integrated approach (for all 3 types of errors) High accuracy Automatic (i.e. no user involvement) Evaluations using real-world dirty texts Overview

14 SOLUTION Aspell A term is fed into Aspell and a list of suggestions for each error term will be generated.

15 SOLUTION Stands4.com Stands4.com is consulted for possible expansions for each erroneous term. Local copy is maintained for future use.

16 SOLUTION Google Googles ability to search for phrases The page count that Google returns Googles suggestions for spelling errors in queries.

17 SOLUTION m expansions, all with rank 1 n suggestions by Aspell, according to their original rank the error term itself = j th suggestion with rank i in the set S Notations Googles suggestion

18 SOLUTION Notations itme timeitemInstitute of Electronics Materials Technology … We use the neighbouring words to disambiguate and identify the most ideal suggestion from S for automatic correction. The left and right words are considered as context. itme shipping itme frame Left word, l = shipping Right word, r = frame

19 SOLUTION ISSAC v2 Reuse factor, RF(e,s i,j ) {0, 1} Abbreviation factor, AF(e,s i,j ) {0, 1} Domain significance, DS(l,s i,j,r) (0,1) General significance, GS(l,s i,j,r) (0,1) Normalized edit distance, NED(e,s i,j ) (0,1] Original rank by Aspell, i -1 (0,1] Different weights in ISSAC

20 SOLUTION The list of suggestions S is re-ranked using Individual weights contribute to the overall ranking of each suggestion. Suggestion with highest NS is taken as the most ideal replacement given the surrounding context. Correction using ISSAC

21 SOLUTION Heuristic: correct replacement should not deviate too far from the error. itme item time it me timer Tim Edit distance

22 SOLUTION Reuse and abbreviation factors If a suggestion is a potential expansion for an abbrev. (i.e. error term), AF will yield 1 and 0 otherwise. The abbreviation dictionary is consulted. Return 1 if suggestions appear in spelling dictionary. Two types of entries in the spelling dictionary. Suggestions by Google for spelling errors. Automatically updated every time Google suggest a replacement for an error. Suggestions for errors provided by users (optional)

23 SOLUTION s j,i is not common both individually and in context s j,i occurs very frequent, both individually and in context but nearly all documents contain the term (i.e. too common) s j,i occurs very frequent, and appears exclusively only in few documents A B C D where B, D > 0 Domain significance

24 SOLUTION A B C D where B, D > 0 and B < D s j,i appears very rarely in context s j,i, appears often in context, appears often individually (i.e. term is very common) s j,i appears often in context, individual appearance approaches appearance in context (i.e. term is exclusive to the context) General significance

25 INDEX

26 EVALUATIONS Accuracy of ISSAC Evaluation data (700 chat sessions, 3313 errors) are actual chat records between agents and customers provided by 247Customer.com.

27 EVALUATIONS Accuracy of ISSAC

28 EVALUATIONS Cause 1 (0.8%): The accuracy of correction by ISSAC is bounded by the coverage of S produced by Aspell. Due to the absence of the correct replacement from the list of suggestions produced by Aspell. For example, the correct replacement for dotn is not present in the list of suggestion by Aspell. When ISSAC doesnt work

29 EVALUATIONS Cause 2 (0.7%): Due to two flaws related to l and r : Neighbouring words are not correctly spelt. Example, morel iberal return. The left and right words are inadequate. Example, both ocats <. Cause 3 (0.5%): Two anomalies where ISSAC does not apply: Suggestions who are equally likely to be the correct replacement. Example, Cheng or Cheung in the context of Janice Cheng <. Contrasting disagreement among weights. When ISSAC doesnt work

30 INDEX

31 FUTURE WORKS [ISSAC v2] My teacher constantly reminds me that education is an important aspect of life. She said, Few years in school will ensure a better Life for you". 15/16 [ISSAC v2] [Original] Mi Teacer konstanly REMINDS mee that edicotion is an inporrant aspek of lifu. She sad, "Few yrs in scholl will ensur a beter liFO for u". 16 errors [Original] Look for solutions to overcome the 3 causes to improve the accuracy. Carry out evaluations on larger data sets. Evaluate ISSAC in terms of time complexity.

32 THANK YOU

33

34 Widely adopted classes of techniques for detecting and correcting spelling errors: Minimum edit distance Similarity key (phonetic algorithms) Minimum edit distance: minimal number of insertions, deletions, substitutions and transpositions needed to transform one string into the other. Example: wear beard require a minimum of 2 operations. Damerau-Levenshtein, Wagner-Fisher, etc.BACKGROUND Spelling error substitute w with binsert d beardwear bear

35 BACKGROUND Similarity key: map every string into a key such that similarly spelled strings will have identical keys. The key, computed for each spelling error, will act as a pointer to all similarly pronounced words (i.e. soundslike) in the dictionary. SOUNDEX, Metaphone, Double Metaphone, etc. wear w006 w6 ware w060 w6 Spelling error


Download ppt "Wilson Wong, Wei Liu and Mohammed Bennamoun School of Computer Science and Software Engineering University of Western Australia Enhanced Integrated Scoring."

Similar presentations


Ads by Google