Download presentation

Presentation is loading. Please wait.

Published byJaiden Dewhurst Modified about 1 year ago

1
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 15 th

2
Course Webpage for lecture slides and Panopto recordings: – http://dingo.sbs.arizona.edu/~sandiway/ling581-15/ http://dingo.sbs.arizona.edu/~sandiway/ling581-15/ Meeting information

3
Course Objectives Follow-on course to LING/C SC/PSYC 438/538 Computational Linguistics: – continue with selected material from the 538 textbook (J&M): 25 chapters, a lot of material not covered in 438/538 And gain more extensive experience – with new stuff not in textbook – dealing with natural language software packages – Installation, input data formatting – operation – project exercises – useful “real-world” computational experience – abilities gained will be of value to employers

4
Computational Facilities Use your own laptop/desktop – can also make use of the computers in this lab (Shantz 338) but you don’t have installation rights on these computers Plus the alarm goes off after hours and campus police will arrive… Platforms Windows is maybe possible but you really should run some variant of Unix… (for your task #1 for this week) – Linux (separate bootable partition or via virtualization software) de facto standard for advanced/research software https://www.virtualbox.org/ (free!) https://www.virtualbox.org/ – Cygwin on Windows http://www.cygwin.com/ Linux-like environment for Windows making it possible to port software running on POSIX systems (such as Linux, BSD, and Unix systems) to Windows. – OSX Not quite Linux, some porting issues, especially with C programs, can use Virtual Box (Linux under OSX)

5
Grading Completion of all homework tasks will result in a satisfactory grade (A) Tasks should be completed before the next class. – email me your work (sandiway@email.arizona.edu). – also be prepared to come up and present your work (if called upon).

6
Today's Topics Homework Task 1: Install tregex Minimum Edit Distance

7
Homework Task 1: Install Tregex Computer language: java http://nlp.stanford.edu/software/tregex.shtml (538: Perl regex on strings) 581: regex for trees …

8
Homework Task 1: Install Tregex We’ll use the program tregex from Stanford University to explore the Penn Treebank – current version:

9
Penn Treebank Availability – Source: Linguistic Data Consortium (LDC) U. of Arizona is a (fee-paying) member of this consortium Resources are made available to the community through the main library URL – http://sabio.library.arizona.edu/search/X

10
Penn Treebank (V3) Call Record Have it on a usb drive here that I will pass around TREEBANK_3.zip (65.2MB) Have it on a usb drive here that I will pass around TREEBANK_3.zip (65.2MB)

11
Penn Treebank (V3) Raw data:

12
tregex Tregex is a Tgrep2-style utility for matching patterns in trees. written in Java run-tregex-gui.command shell script -mx flag, the 300m default memory size may need to be increased depending on the platform

13
tregex Select the PTB directory – TREEBANK_3/parsed/mrg/wsj/ Browse Deselect any unwanted files

14
Part 2 Minimum Edit Distance Textbook: section 3.11

15
15 Minimum Edit Distance general string comparison edit operations are insertion, deletion and substitution not just limited to distance defined by a single operation away we can ask how different is string a from b by the minimum edit distance

16
16 Minimum Edit Distance applications – could be used for multi-typo correction – used in Machine Translation Evaluation (MTEval) – example Source: 生産工程改善について Translations: (Standard) For improvement of the production process (MT-A) About a production process betterment (MT-B) About the production process improvement method – compute edit distance between MT-A and Standard and MT-B and Standard in terms of word insertion/substitution etc.

17
17 Minimum Edit Distance cost models – Levenshtein insertion, deletion and substitution all have unit cost – Levenshtein (alternate) insertion, deletion have unit cost substitution is twice as expensive substitution = one insert followed by one delete – Typewriter insertion, deletion and substitution all have unit cost modified by key proximity

18
Minimum Edit Distance Dynamic Programming – divide-and-conquer to solve a problem we divide it into sub-problems – sub-problems may be repeated don’t want to re-solve a sub-problem the 2nd time around – idea: put solutions to sub-problems in a table and just look up the solution 2nd time around, thereby saving time memoization we’ll use a spreadsheet…

19
Minimum Edit Distance Consider a simple case: xy ⇄ yx Minimum # of operations: insert and delete cost = 2 Minimum # of operations: swap cost = ?

20
Minimum Edit Distance Generally

21
Minimum Edit Distance Programming Practice: could be easily implemented in Perl

22
Minimum Edit Distance Generally

23
Minimum Edit Distance Computation Or in Microsoft Excel, file: eds.xls (on course webpage) $ in a cell reference means don’t change when copied from cell to cell e.g. in C$1 1 stays the same in $A3 A stays the same

24
Minimum Edit Distance Task: transform string s 1..s i into string t 1..t j – each s n and t n are letters – string s is of length i, t is of length j Example: – s = leader, t = adapter – i = 6, j = 7 – Let’s say you’re allowed just three operations: (1) delete a letter, (2) insert a letter, or (3) substitute a letter for another letter – What is one possible way to generate t from s?

25
Minimum Edit Distance Example: – s = leader, t = adapter – What is one possible way to generate t from s? – leader – ↕↕ –adapter – cost is 2 deletes and 3 inserts, total 5 operations – Question: is this the minimum possible? leader◄ leade◄ lead◄ lea◄ le◄ l◄l◄◄ a◄ ad◄ ada◄adap◄ adapt◄ adapte◄ adapter◄ Simplest method cost: 13 operations

26
Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e 6 r

27
01234567 adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada

28
Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada cell (6,7) cost of transforming leader into adapter cell (6,7) cost of transforming leader into adapter

29
Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (3,0) cost of transforming lea into (empty) cell (3,0) cost of transforming lea into (empty)

30
Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (0,4) cost of transforming (empty) into adap cell (0,4) cost of transforming (empty) into adap

31
Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e k 6 r cell (5,6) cost of transforming leade into adapte cell (5,6) cost of transforming leade into adapte

32
Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e k 6 r cell (5,6) cost of transforming leade into adapte cell (5,6) cost of transforming leade into adapte ➡

33
Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e k 6 r k cell (5,6) cost of transforming leade into adapte cell (5,6) cost of transforming leade into adapte

34
Minimum Edit Distance 01234567 adapter 0 1 l 2 e k 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada

35
Minimum Edit Distance 01234567 adapter 0 1 l 2 e k 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada cell (2,4) cost of transforming le into adap cell (2,4) cost of transforming le into adap ➡

36
Minimum Edit Distance 01234567 adapter 0 1 l 2 e kk+1 3 a 4 d 5 e 6 r cell (2,3) cost of transforming le into ada cell (2,3) cost of transforming le into ada cell (2,4) cost of transforming le into adap cell (2,4) cost of transforming le into adap ➡ le adap

37
Minimum Edit Distance 01234567 adapter 0 1 l k 2 e 3 a 4 d 5 e 6 r cell (1,4) cost of transforming l into adap cell (1,4) cost of transforming l into adap ➡

38
Minimum Edit Distance 01234567 adapter 0 1 l k 2 e k+1 3 a 4 d 5 e 6 r cell (1,4) cost of transforming l into adap cell (1,4) cost of transforming l into adap ➡ le adap

39
Minimum Edit Distance 01234567 adapter 0 1 l k 2 e 3 a 4 d 5 e 6 r cell (1,3) cost of transforming l into ada cell (1,3) cost of transforming l into ada ➡

40
Minimum Edit Distance 01234567 adapter 0 1 l k 2 e k+2 3 a 4 d 5 e 6 r cell (1,3) cost of transforming l into ada cell (1,3) cost of transforming l into ada ➡ assuming the cost of swapping e for p is 2 le adap

41
Minimum Edit Distance 01234567 adapter 0 1 l k 1,3 k 1,4 2 e k 2,3 ? 3 a 4 d 5 e 6 r ➡ ➡ ➡ cell (2,4) minimum of the three costs to get here in one step cell (2,4) minimum of the three costs to get here in one step

42
Minimum Edit Distance 01234567 adapter 0 1 l 2 e 3 a 4 d 5 e 6 r cell (3,0) cost of transforming lea into (empty) cell (3,0) cost of transforming lea into (empty)

43
Minimum Edit Distance 01234567 adapter 00 1 l 2 e 3 a 4 d 5 e 6 r

44
01234567 adapter 00 1 l 1 2 e 3 a 4 d 5 e 6 r

45
01234567 adapter 00 1 l 1 2 e 2 3 a 4 d 5 e 6 r ➡ cost of le = cost of l , plus the cost of deleting the e

46
Minimum Edit Distance 01234567 adapter 00 1 l 1 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6

47
01234567 adapter 00 1 l 2 e 3 a 4 d 5 e 6 r

48
01234567 adapter 001 1 l 2 e 3 a 4 d 5 e 6 r

49
01234567 adapter 001234567 1 l 2 e 3 a 4 d 5 e 6 r

50
01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6

51
01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6

52
01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6 ➡ ➡ ➡

53
01234567 adapter 001234567 1 l 12 2 e 2 3 a 3 4 d 4 5 e 5 6 r 6 ➡ ➡ ➡

54
01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 456 5 e 56 6 r 6

55
01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 456 5 e 56 6 r 6 ➡

56
01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 456 5 e 565 6 r 6 ➡

57
01234567 adapter 001234567 1 l 1 2 e 267 3 a 35 4 d 4 5 e 5 6 r 6

58
01234567 adapter 001234567 1 l 1 2 e 267 3 a 35 4 d 4 5 e 5 6 r 6 ➡

59
01234567 adapter 001234567 1 l 1 2 e 267 3 a 356 4 d 4 5 e 5 6 r 6 ➡

60
01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 4 5 e 565 6 r 67

61
01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 4 5 e 565 6 r 67 ➡

62
01234567 adapter 001234567 1 l 1 2 e 2 3 a 3 4 d 4 5 e 565 6 r 67 6 ➡

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google