LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.

LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong

Last time

Today's Topics Homework 8: a Perl regex homework
Examples of advanced uses of Perl regexs: Recursive regexs Prime number testing

Homework 8: Part 1 It contains nearly 50,000 sentences (one per line).
wsj.txt is a tokenized text file containg the Wall Street Journal (WSJ) corpus It contains nearly 50,000 sentences (one per line). (The syntactically annotated version is core data used to train and test many statistical parsers, e.g. the Stanford and Berkeley parsers.) Note: tokenized here means punctuation is spaced; also 's and n't are separated by spaces.

Homework 8: Part 1 English past participle forms are used in passives and perfectives: e.g. the apple was/is eaten the apple(s) will be eaten the apples were eaten the apples were being eaten the apples had been eaten the women have/had eaten the apples Mary has/had eaten the apples There are also negated versions of the passives and perfectives: e.g. the apple was n't eaten the apple was not eaten the apple was n't yet eaten the apple was not yet eaten Mary hasn't eaten the apples Mary has not eaten the apples Mary has not yet eaten the apples

Homework 8: Part 1 Based on the data shown in the previous slide, assuming past participle ending –en, write a Perl regex program that searches the WSJ corpus, computes and prints the frequency of regular passives, perfectives, and the negated counterparts, i.e. Hint: for readability you may want to incorporate regex variables (see qr/../ from previous lecture) Hint: be careful of regex precedence: (a|b)\s+c is not the same as a|b\s+c Note: your program will underreport the true frequencies for several reasons.

Homework 8: Part 2 One reason for underreporting: Question:
not all past participles conveniently end in –en e.g. the cookies were burnt, the rope had been cut e.g. the demonstrators were arrested Question: what are some other possible reasons for underreporting? Give some examples from the WSJ corpus.

Homework 8: Part 2 File: irregular_verbs.txt
grammar-rules/verbs/list-of-irregular- verbs/ File: irregular_verbs.txt Note: \t (tab) separates the three columns borne simplified

Homework 8: Part 2 Write a Perl program to extract the irregular past participles from file irregular_verbs.txt and print them out one per line Make sure to split alternate forms into two lines: e.g. burnt/burned Ignore: (been able) and … (ellipsis) Extra Credit: give a Perl one liner that does the job… Hint: you may want to make use of join("\n", split( "/", … ))

Homework 8: Part 3 Incorporate the past participles found in Part 2 into your program from Part 1. Hints: one method: do it in stages; e.g. save those irregular past participles into a file. Read them into your program for Part 3. another method: combine your programs so you parse irregular_verbs.txt directly another method: copy and paste them into your program for Part 3 directly Add to your program printout of the frequency of the irregular verb counterparts: i.e. regular passives: # irregular passives: # regular perfectives: # irregular perfectives: # negated regular passives: # etc.

General instructions for submission (repeated)
One pdf file containing everything Code and output must both be submitted Summarize/explain what you did If you like, you may add your programs separately as attachments to the (so I can download and run them if necessary). Submission due date: next Wednesday midnight (before Thursday class)

Regex Recursion Word pallindrome = a word that reads the same backwards or forwards, e.g. kayak and racecar. Normally regexs cannot express pallindromes but Perl regexs can because we can use backreferences recursively. Note: recursion here refers to the ability to repeatedly embed regexs inside

Regex Recursion Program: (?group-ref)

Regex Lookahead and Lookback
Zero-width regexs: ^ (start of string) $ (end of string) \b (word boundary) matches the imaginary position between \w\W or \W\w, or just before beginning of string if ^\w, just after the end of the string if \w$ Current position of match (so far) doesn't change! (?=regex) (lookahead from current position) (?<=regex) (lookback from current position) (?!regex) (negative lookahead) (?<!regex) (negative lookback)

Regex Lookahead and Lookback
Example: looks for a word beginning with _ such that there is a duplicate ahead without the _ Restriction: lookback cannot be variable length in Perl

Debugging Perl regex (?{ Perl code }) can be inserted anywhere in a regex can assist with debugging Example:

Prime Number Testing using Perl Regular Expressions
Another example: the set of prime numbers is not a regular language Lprime = {2, 3, 5, 7, 11, 13, 17, 19, 23,.. } Turns out, we can use a Perl regex to determine membership in this set .. and to factorize numbers /^(11+?)\1+$/

can be proved using the Pumping Lemma for regular languages (later) L = {1n | n is prime} is not a regular language Keys to making this work: \1 backreference unary notation for representing numbers, e.g. 11111 “five ones” = 5 “six ones” = 6 unary notation allows us to factorize numbers by repetitive pattern matching (11)(11)(11) “six ones” = 6 (111)(111) “six ones” = 6 numbers that can be factorized in this way aren’t prime no way to get nontrivial subcopies of “five ones” = 5 Then /^(11+?)\1+$/ will match anything that’s greater than 1 that’s not prime

Let’s analyze this Perl regex /^(11+?)\1+$/ ^ and $ anchor both ends of the strings, forces (11+?)\1+ to cover the string exactly (11+?) is non-greedy match version of (11+) \1+ provides one or more copies of what we matched in (11+?) Question: is the non-greedy operator necessary?

Compare /^(11+?)\1+$/ with /^(11+)\1+$/ i.e. non-greedy vs. greedy matching finds smallest factor vs. largest 90021 factored using 3, not a prime (0 secs) vs. 90021 factored using 30007, not a prime (0 secs) affects computational efficiency for non-primes Puzzling behavior: same output non-greedy vs. greedy factored using , not a prime (48 secs vs. 13 secs)

Prime Numbers 100003 200003 300007 400009 500009 600011 700001 800011 900001 testing with prime numbers only can take a lot of time to compute …

/^(11+?)\1+$/ vs. /^(11+)\1+$/ i.e. non-greedy vs. greedy matching finds smallest factor vs. largest 90021 factored using 3, not a prime (0 secs) vs. 90021 factored using 30007, not a prime (0 secs) Puzzling behavior: same output non-greedy vs. greedy factored using , not a prime (48 secs vs. 13 secs)

nearest primes to preset limit 3* *32771 = = 98313

When preset limit is exceeded: Perl’s regex matching fails quietly

Can also get non-greedy to skip several factors Example: pick non-prime = 3 x 5 x (prime factorization) Non-greedy: missed factors 3 and 5 … Because 3 * = 5 * = 32766 limit 15 * = greedy version

Results are still right so far though: wrt. prime vs. non-prime But we predict it will report an incorrect result for 1,070,009,521 it should claim (incorrectly) that this is prime since =

LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.

Similar presentations

Presentation on theme: "LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.

Similar presentations

Presentation on theme: "LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong."— Presentation transcript:

Similar presentations

About project

Feedback