LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1.

Slides:



Advertisements
Similar presentations
Word processing A word processor can be used to write, edit, format and print text. Before word processors, printed documents were typed directly on to.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/25.
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong. Administrivia We'll postpone Homework 4 review until next week …
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 15 th.
Simplifying CFGs There are several ways in which context-free grammars can be simplified. One natural way is to eliminate useless symbols those that cannot.
Programming in Visual Basic
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 16: 10/19.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.
Distance Functions for Sequence Data and Time Series
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Computational Language Andrew Hippisley. Computational Language Computational language and AI Language engineering: applied computational language Case.
Spelling Checkers Daniel Jurafsky and James H. Martin, Prentice Hall, 2000.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Metodi statistici nella linguistica computazionale The Bayesian approach to spelling correction.
CS276 – Information Retrieval and Web Search Checking in. By the end of this week you need to have: Watched the online videos corresponding to the first.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning, Pandu Nayak.
Mathcad Variable Names A string of characters (including numbers and some “special” characters (e.g. #, %, _, and a few more) Cannot start with a number.
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Lecture 21: Languages and Grammars. Natural Language vs. Formal Language.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
Lesson No:9 MS-Word Tools, Mail Merge and working with Tables CHBT-01 Basic Micro process & Computer Operation.
LING 438/538 Computational Linguistics
Data Structures and Algorithm Analysis Hashing Lecturer: Jing Liu Homepage:
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
LING/C SC/PSYC 438/538 Lecture 19 Sandiway Fong. Administrivia Next Monday – guest lecture from Dr. Jerry Ball of the Air Force Research Labs to be continued.
1 Working with MS SQL Server Textbook Chapter 14.
1 CSA4050: Advanced Topics in NLP Spelling Models.
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
CS276: Information Retrieval and Web Search
David Luebke 1 10/25/2015 CS 332: Algorithms Skip Lists Hash Tables.
Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.
Minimum Edit Distance Definition of Minimum Edit Distance.
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
12/7/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
LING/C SC/PSYC 438/538 Lecture 14 Sandiway Fong. Administrivia Homework 6 graded.
Cross Language Clone Analysis Team 2 February 3, 2011.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Error Detection and Correction – Hamming Code
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
Weighted Minimum Edit Distance
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Autumn Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.
Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.
Proofing Documents Lesson 9 #1.09.
January 2012Spelling Models1 Human Language Technology Spelling Models.
 The term “spreadsheet” covers a wide variety of elements useful for quantitative analysis of all kinds. Essentially, a spreadsheet is a simple tool.
Minimum Edit Distance Definition of Minimum Edit Distance.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
Dynamic Programming for the Edit Distance Problem.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
Definition of Minimum Edit Distance
Definition of Minimum Edit Distance
Dynamic Programming Computation of Edit Distance
CSA3180: Natural Language Processing
CPSC 503 Computational Linguistics
Chapter 2 Tables and Forms: Design, Properties, Views, and Wizards
Bioinformatics Algorithms and Data Structures
CS621/CS449 Artificial Intelligence Lecture Notes
Presentation transcript:

LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Adminstrivia Homeworks 7 and 8 graded

Last Time 1.Homework 8 review 2.CKY algorithm for parsing context-free grammars – Chomsky Normal Form (CNF) – 2D table w1w1 w2w2 w3w3 [0,1 ] x1 [0,2] y [0,3] s s [1,2] x2 [1,3] z [2,3] x3 s --> y, x3. s --> x1, z. s --> y, z. y --> x1, x2. z --> x2, x3. x1 --> [w1]. x2 --> [w2]. x3 --> [w3].

Adminstrivia TCEs are fully online (paper TCEs are history)

Spelling Errors Sections 3.10 and 5.9

Spelling Errors Textbook cites (Kukich, 1992): – Non-word detection (easiest) graffe (giraffe) – Isolated-word (context-free) error correction graffe (giraffe,…) graffed (gaffed,…) by definition cannot correct when error word is a valid word – Context-dependent error detection and correction (hardest) your an idiot  you’re an idiot Their is  there is (Microsoft Word corrects this by default)

Spelling Errors OCR – visual similarity h  b, e  c, jump  jurnps Typing – keyboard distance small  smsll, spell  spel; Graffiti (many HCI studies) – stroke similarity Common error characters are: V, T, 4, L, E, Q, K, N, Y, 9, P, G, X Two stroke characters: B, D, P(error: two characters) Cognitive Errors – bad spellers separate  seperate

correct – textbook section 5.9 Kernighan et al. ( correct ) – take typo t (not a word) mutate t minimally by deleting, inserting, substituting or transposing (swapping) a letter look up “mutated t“ in a dictionary candidates are “mutated t“ that are real words – example (5.2) t = acress C = {actress,cress, caress, access, across, acres, acres}

correct formula – = max c  C P(t|c) P(c)(Bayesian Inference) – C = {actress, cress, caress, access, across, acres, acres} Prior: P(c) – estimated using frequency information over a large corpus (N words) – P(c) = freq(c)/N – P(c) = freq(c)+0.5/(N+0.5V) avoid zero counts (non-occurrences) (add fractional part 0.5) add one (0.5) smoothing V is vocabulary size of corpus

correct Likelihood: P(t|c) – using some corpus of errors – compute following 4 confusion matrices – del[x,y] = freq(correct xy mistyped as x) – ins[x,y] = freq(correct x mistyped as xy) – sub[x,y] = freq(correct x mistyped as y) – trans[x,y] = freq(correct xy mistyped as yx) – P(t|c) = del[x,y]/f(xy) if c related to t by deletion of y – P(t|c) = ins[x,y]/f(x) if c related to t by insertion of yetc… probability of typo t given candidate word c 26 x 26 matrix a–z Very hard to collect this data Very hard to collect this data

correct example – t = acress – = acres (44%) despite all the math wrong result for –was called a stellar and versatile acress what does Microsoft Word use? –was called a stellar and versatile acress

Microsoft Word corrected here

Google

Part 2 Another algorithm using dynamic programming: – Minimum Edit Distance Textbook: section 3.11 File: eds.xls

27 Minimum Edit Distance general string comparison edit operations are insertion, deletion and substitution not just limited to distance defined by a single operation away we can ask how different is string a from b by the minimum edit distance

28 Minimum Edit Distance applications – could be used for multi-typo correction – used in Machine Translation Evaluation (MTEval) – example Source: 生産工程改善について Translations: (Standard) For improvement of the production process (MT-A) About a production process betterment (MT-B) About the production process improvement method – compute edit distance between MT-A and Standard and MT-B and Standard in terms of word insertion/substitution etc.

29 Minimum Edit Distance cost models – Levenshtein insertion, deletion and substitution all have unit cost – Levenshtein (alternate) insertion, deletion have unit cost substitution is twice as expensive substitution = one insert followed by one delete – Typewriter insertion, deletion and substitution all have unit cost modified by key proximity

Minimum Edit Distance Dynamic Programming – divide-and-conquer to solve a problem we divide it into sub-problems – sub-problems may be repeated don’t want to re-solve a sub-problem the 2nd time around – idea: put solutions to sub-problems in a table and just look up the solution 2nd time around, thereby saving time memoization we’ll use a spreadsheet…

Minimum Edit Distance Consider a simple case: xy ⇄ yx Minimum # of operations: insert and delete cost = 2 Minimum # of operations: swap cost = ?

Minimum Edit Distance Generally

Minimum Edit Distance Programming Practice: could be easily implemented in Perl

Minimum Edit Distance Generally

Minimum Edit Distance Computation Or in Microsoft Excel, file: eds.xls (on course webpage) $ in a cell reference means don’t change when copied from cell to cell e.g. in C$1 1 stays the same in $A3 A stays the same

Minimum Edit Distance Task: transform string s 1..s i into string t 1..t j – each s n and t n are letters – string s is of length i, t is of length j Example: – s = leader, t = adapter – i = 6, j = 7 – Let’s say you’re allowed just three operations: (1) delete a letter, (2) insert a letter, or (3) substitute a letter for another letter – What is one possible way to generate t from s?

Minimum Edit Distance Example: – s = leader, t = adapter – What is one possible way to generate t from s? – leader – ↕↕ –adapter – cost is 2 deletes and 3 inserts, total 5 operations – Question: is this the minimum possible? leader◄ leade◄ lead◄ lea◄ le◄ l◄l◄◄ a◄ ad◄ ada◄adap◄ adapt◄ adapte◄ adapter◄ Simplest method cost: 13 operations

Minimum Edit Distance adapter 0 1 l 2 e 3 a 4 d 5 e 6 r

adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada

Minimum Edit Distance adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada cell (6,7) cost of transforming leader into adapter cell (6,7) cost of transforming leader into adapter

Minimum Edit Distance adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (3,0) cost of transforming lea into (empty) cell (3,0) cost of transforming lea into (empty)

Minimum Edit Distance adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (0,4) cost of transforming (empty) into adap cell (0,4) cost of transforming (empty) into adap

Minimum Edit Distance adapter 0 1 l 2 e 3 a 4 d 5 e k 6 r cell (5,6) cost of transforming leade into adapte cell (5,6) cost of transforming leade into adapte

Minimum Edit Distance adapter 0 1 l 2 e 3 a 4 d 5 e k 6 r cell (5,6) cost of transforming leade into adapte cell (5,6) cost of transforming leade into adapte ➡

Minimum Edit Distance adapter 0 1 l 2 e 3 a 4 d 5 e k 6 r k cell (5,6) cost of transforming leade into adapte cell (5,6) cost of transforming leade into adapte

Minimum Edit Distance adapter 0 1 l 2 e k 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada

Minimum Edit Distance adapter 0 1 l 2 e k 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada cell (2,4) cost of transforming le into adap cell (2,4) cost of transforming le into adap ➡

Minimum Edit Distance adapter 0 1 l 2 e kk+1 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada cell (2,4) cost of transforming le into adap cell (2,4) cost of transforming le into adap ➡ le adap

Minimum Edit Distance adapter 0 1 l k 2 e 3 a 4 d 5 e 6 r cell (1,4) cost of transforming l into adap cell (1,4) cost of transforming l into adap ➡

Minimum Edit Distance adapter 0 1 l k 2 e k+1 3 a 4 d 5 e 6 r cell (1,4) cost of transforming l into adap cell (1,4) cost of transforming l into adap ➡ le adap

Minimum Edit Distance adapter 0 1 l k 2 e 3 a 4 d 5 e 6 r cell (1,3) cost of transforming l into ada cell (1,3) cost of transforming l into ada ➡

Minimum Edit Distance adapter 0 1 l k 2 e k+2 3 a 4 d 5 e 6 r cell (1,3) cost of transforming l into ada cell (1,3) cost of transforming l into ada ➡ assuming the cost of swapping e for p is 2 le adap

Minimum Edit Distance adapter 0 1 l k 1,3 k 1,4 2 e k 2,3 ? 3 a 4 d 5 e 6 r ➡ ➡ ➡ cell (2,4) minimum of the three costs to get here in one step cell (2,4) minimum of the three costs to get here in one step

Minimum Edit Distance adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (3,0) cost of transforming lea into (empty) cell (3,0) cost of transforming lea into (empty)

Minimum Edit Distance adapter 00 1 l 2 e 3 a 4 d 5 e 6 r

adapter 00 1 l 1 2 e 3 a 4 d 5 e 6 r

adapter 00 1 l 1 2 e 2 3 a 4 d 5 e 6 r ➡ cost of le  = cost of l , plus the cost of deleting the e

Minimum Edit Distance adapter 00 1 l 1 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6

adapter 00 1 l 2 e 3 a 4 d 5 e 6 r

adapter l 2 e 3 a 4 d 5 e 6 r

adapter l 2 e 3 a 4 d 5 e 6 r

adapter l 1 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6

adapter l 1 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6

adapter l 1 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6 ➡ ➡ ➡

adapter l 12 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6 ➡ ➡ ➡

adapter l 1 2 e 2 3 a 3 4 d e 56 6 r 6

adapter l 1 2 e 2 3 a 3 4 d e 56 6 r 6 ➡

adapter l 1 2 e 2 3 a 3 4 d e r 6 ➡

adapter l 1 2 e a 35 4 d 4 5 e 5 6 r 6

adapter l 1 2 e a 35 4 d 4 5 e 5 6 r 6 ➡

adapter l 1 2 e a d 4 5 e 5 6 r 6 ➡

adapter l 1 2 e 2 3 a 3 4 d 4 5 e r 67

adapter l 1 2 e 2 3 a 3 4 d 4 5 e r 67 ➡

adapter l 1 2 e 2 3 a 3 4 d 4 5 e r 67 6 ➡