Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1.

Similar presentations


Presentation on theme: "LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1."— Presentation transcript:

1 LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

2 Adminstrivia Homeworks 7 and 8 graded

3 Last Time 1.Homework 8 review 2.CKY algorithm for parsing context-free grammars – Chomsky Normal Form (CNF) – 2D table w1w1 w2w2 w3w3 [0,1 ] x1 [0,2] y [0,3] s s [1,2] x2 [1,3] z [2,3] x3 s --> y, x3. s --> x1, z. s --> y, z. y --> x1, x2. z --> x2, x3. x1 --> [w1]. x2 --> [w2]. x3 --> [w3].

4 Adminstrivia TCEs are fully online (paper TCEs are history) https://tce.oirps.arizona.edu/TCEOnline

5 Spelling Errors Sections 3.10 and 5.9

6 Spelling Errors Textbook cites (Kukich, 1992): – Non-word detection (easiest) graffe (giraffe) – Isolated-word (context-free) error correction graffe (giraffe,…) graffed (gaffed,…) by definition cannot correct when error word is a valid word – Context-dependent error detection and correction (hardest) your an idiot  you’re an idiot Their is  there is (Microsoft Word corrects this by default)

7 Spelling Errors OCR – visual similarity h  b, e  c, jump  jurnps Typing – keyboard distance small  smsll, spell  spel; Graffiti (many HCI studies) – stroke similarity Common error characters are: V, T, 4, L, E, Q, K, N, Y, 9, P, G, X Two stroke characters: B, D, P(error: two characters) Cognitive Errors – bad spellers separate  seperate

8 correct – textbook section 5.9 Kernighan et al. ( correct ) – take typo t (not a word) mutate t minimally by deleting, inserting, substituting or transposing (swapping) a letter look up “mutated t“ in a dictionary candidates are “mutated t“ that are real words – example (5.2) t = acress C = {actress,cress, caress, access, across, acres, acres}

9 correct formula – = max c  C P(t|c) P(c)(Bayesian Inference) – C = {actress, cress, caress, access, across, acres, acres} Prior: P(c) – estimated using frequency information over a large corpus (N words) – P(c) = freq(c)/N – P(c) = freq(c)+0.5/(N+0.5V) avoid zero counts (non-occurrences) (add fractional part 0.5) add one (0.5) smoothing V is vocabulary size of corpus

10 correct Likelihood: P(t|c) – using some corpus of errors – compute following 4 confusion matrices – del[x,y] = freq(correct xy mistyped as x) – ins[x,y] = freq(correct x mistyped as xy) – sub[x,y] = freq(correct x mistyped as y) – trans[x,y] = freq(correct xy mistyped as yx) – P(t|c) = del[x,y]/f(xy) if c related to t by deletion of y – P(t|c) = ins[x,y]/f(x) if c related to t by insertion of yetc… probability of typo t given candidate word c 26 x 26 matrix a–z Very hard to collect this data Very hard to collect this data

11 correct example – t = acress – = acres (44%) despite all the math wrong result for –was called a stellar and versatile acress what does Microsoft Word use? –was called a stellar and versatile acress

12 Microsoft Word corrected here

13 Google

14

15

16

17

18

19

20

21

22

23

24

25

26 Part 2 Another algorithm using dynamic programming: – Minimum Edit Distance Textbook: section 3.11 File: eds.xls

27 27 Minimum Edit Distance general string comparison edit operations are insertion, deletion and substitution not just limited to distance defined by a single operation away we can ask how different is string a from b by the minimum edit distance

28 28 Minimum Edit Distance applications – could be used for multi-typo correction – used in Machine Translation Evaluation (MTEval) – example Source: 生産工程改善について Translations: (Standard) For improvement of the production process (MT-A) About a production process betterment (MT-B) About the production process improvement method – compute edit distance between MT-A and Standard and MT-B and Standard in terms of word insertion/substitution etc.

29 29 Minimum Edit Distance cost models – Levenshtein insertion, deletion and substitution all have unit cost – Levenshtein (alternate) insertion, deletion have unit cost substitution is twice as expensive substitution = one insert followed by one delete – Typewriter insertion, deletion and substitution all have unit cost modified by key proximity

30 Minimum Edit Distance Dynamic Programming – divide-and-conquer to solve a problem we divide it into sub-problems – sub-problems may be repeated don’t want to re-solve a sub-problem the 2nd time around – idea: put solutions to sub-problems in a table and just look up the solution 2nd time around, thereby saving time memoization we’ll use a spreadsheet…

31 Minimum Edit Distance Consider a simple case: xy ⇄ yx Minimum # of operations: insert and delete cost = 2 Minimum # of operations: swap cost = ?

32 Minimum Edit Distance Generally

33 Minimum Edit Distance Programming Practice: could be easily implemented in Perl

34 Minimum Edit Distance Generally

35 Minimum Edit Distance Computation Or in Microsoft Excel, file: eds.xls (on course webpage) $ in a cell reference means don’t change when copied from cell to cell e.g. in C$1 1 stays the same in $A3 A stays the same

36 Minimum Edit Distance Task: transform string s 1..s i into string t 1..t j – each s n and t n are letters – string s is of length i, t is of length j Example: – s = leader, t = adapter – i = 6, j = 7 – Let’s say you’re allowed just three operations: (1) delete a letter, (2) insert a letter, or (3) substitute a letter for another letter – What is one possible way to generate t from s?

37 Minimum Edit Distance Example: – s = leader, t = adapter – What is one possible way to generate t from s? – leader – ↕↕ –adapter – cost is 2 deletes and 3 inserts, total 5 operations – Question: is this the minimum possible? leader◄ leade◄ lead◄ lea◄ le◄ l◄l◄◄ a◄ ad◄ ada◄adap◄ adapt◄ adapte◄ adapter◄ Simplest method cost: 13 operations

38 Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e 6 r

39 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada

40 Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada cell (6,7) cost of transforming leader into adapter cell (6,7) cost of transforming leader into adapter

41 Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (3,0) cost of transforming lea into (empty) cell (3,0) cost of transforming lea into (empty)

42 Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (0,4) cost of transforming (empty) into adap cell (0,4) cost of transforming (empty) into adap

43 Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e k 6 r cell (5,6) cost of transforming leade into adapte cell (5,6) cost of transforming leade into adapte

44 Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e k 6 r cell (5,6) cost of transforming leade into adapte cell (5,6) cost of transforming leade into adapte ➡

45 Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e k 6 r k cell (5,6) cost of transforming leade into adapte cell (5,6) cost of transforming leade into adapte

46 Minimum Edit Distance 01234567 adapter 0 1 l 2 e k 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada

47 Minimum Edit Distance 01234567 adapter 0 1 l 2 e k 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada cell (2,4) cost of transforming le into adap cell (2,4) cost of transforming le into adap ➡

48 Minimum Edit Distance 01234567 adapter 0 1 l 2 e kk+1 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada cell (2,4) cost of transforming le into adap cell (2,4) cost of transforming le into adap ➡ le adap

49 Minimum Edit Distance 01234567 adapter 0 1 l k 2 e 3 a 4 d 5 e 6 r cell (1,4) cost of transforming l into adap cell (1,4) cost of transforming l into adap ➡

50 Minimum Edit Distance 01234567 adapter 0 1 l k 2 e k+1 3 a 4 d 5 e 6 r cell (1,4) cost of transforming l into adap cell (1,4) cost of transforming l into adap ➡ le adap

51 Minimum Edit Distance 01234567 adapter 0 1 l k 2 e 3 a 4 d 5 e 6 r cell (1,3) cost of transforming l into ada cell (1,3) cost of transforming l into ada ➡

52 Minimum Edit Distance 01234567 adapter 0 1 l k 2 e k+2 3 a 4 d 5 e 6 r cell (1,3) cost of transforming l into ada cell (1,3) cost of transforming l into ada ➡ assuming the cost of swapping e for p is 2 le adap

53 Minimum Edit Distance 01234567 adapter 0 1 l k 1,3 k 1,4 2 e k 2,3 ? 3 a 4 d 5 e 6 r ➡ ➡ ➡ cell (2,4) minimum of the three costs to get here in one step cell (2,4) minimum of the three costs to get here in one step

54 Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (3,0) cost of transforming lea into (empty) cell (3,0) cost of transforming lea into (empty)

55 Minimum Edit Distance 01234567 adapter 00 1 l 2 e 3 a 4 d 5 e 6 r

56 01234567 adapter 00 1 l 1 2 e 3 a 4 d 5 e 6 r

57 01234567 adapter 00 1 l 1 2 e 2 3 a 4 d 5 e 6 r ➡ cost of le  = cost of l , plus the cost of deleting the e

58 Minimum Edit Distance 01234567 adapter 00 1 l 1 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6

59 01234567 adapter 00 1 l 2 e 3 a 4 d 5 e 6 r

60 01234567 adapter 001 1 l 2 e 3 a 4 d 5 e 6 r

61 01234567 adapter 001234567 1 l 2 e 3 a 4 d 5 e 6 r

62 01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6

63 01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6

64 01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6 ➡ ➡ ➡

65 01234567 adapter 001234567 1 l 12 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6 ➡ ➡ ➡

66 01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 456 5 e 56 6 r 6

67 01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 456 5 e 56 6 r 6 ➡

68 01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 456 5 e 565 6 r 6 ➡

69 01234567 adapter 001234567 1 l 1 2 e 267 3 a 35 4 d 4 5 e 5 6 r 6

70 01234567 adapter 001234567 1 l 1 2 e 267 3 a 35 4 d 4 5 e 5 6 r 6 ➡

71 01234567 adapter 001234567 1 l 1 2 e 267 3 a 356 4 d 4 5 e 5 6 r 6 ➡

72 01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 4 5 e 565 6 r 67

73 01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 4 5 e 565 6 r 67 ➡

74 01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 4 5 e 565 6 r 67 6 ➡

75


Download ppt "LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1."

Similar presentations


Ads by Google