Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING/C SC/PSYC 438/538 Lecture 9 9/22 Sandiway Fong.

Similar presentations


Presentation on theme: "LING/C SC/PSYC 438/538 Lecture 9 9/22 Sandiway Fong."— Presentation transcript:

1 LING/C SC/PSYC 438/538 Lecture 9 9/22 Sandiway Fong

2 Administrivia Review – Paragraph/sentence homework – Set of states construction exercises

3 Prime Number Testing using Perl Regular Expressions [Thanks to Mark Tokutomi…] I mentioned that – Regular expressions are equivalent to regular grammars and finite state automata – But Perl has added so many bells and whistles to its flavor of regular expressions so that this equivalence no longer holds true. In fact, you can code up prime number testing using Perl regular expressions… lots of references on the web. If we represent a number in unary notation, e.g. 5 = “11111” /^(11+?)\1+$/ will match anything that’s greater than 1 that’s not prime Key to making this work: \1 backreference L = {1 n | n is prime} is not a regular language In fact, you can code up prime number testing using Perl regular expressions… lots of references on the web. If we represent a number in unary notation, e.g. 5 = “11111” /^(11+?)\1+$/ will match anything that’s greater than 1 that’s not prime Key to making this work: \1 backreference L = {1 n | n is prime} is not a regular language

4 Homework Review From lecture 6 Task 1 438/538 (15pts) – write a Perl program that counts the number of paragraphs and sentences for Article247_499.txt – “Raw text” Blank lines separate paragraphs Sentences are not in 1-to-1 correspondence with lines We can do fairly well. But a perfect solution would require “solving” language.

5 Background Article – Article247_499.txt comes from the American National Corpus (ANC) (Release 2) ANC sentence boundary labeling – All sentence markup was automatically produced by the sentence splitter included in the Gate system, and there are occasional errors, usually due to the presence of unrecognized abbreviations in mid- sentence. – The sentence splitter also puts punctuation appearing after the terminating period (e.g., closing quotation mark, closing parenthesis) outside the sentence boundary. Open ANC (OANC) – All annotations were originally produced automatically using GATE's ANNIE system. – Some of the texts in the OANC include manually validated sentence boundaries (the list of texts validated for sentence boundaries is here).

6 Background Stand-off annotation:

7 Background p1 p1 has 4 sentences according to the official automatic sentence boundary detector p1 has 4 sentences according to the official automatic sentence boundary detector

8 Homework Review Let’s develop this program step-by-step… want to initialize and maintain two counters – $p number of paragraphs – $snumber of sentences Paragraph Counting – Blank lines separate paragraphs – Regular expression: /^\s*\n$/ – i.e. containing only a newline (possibly preceded by superfluous whitespace) Start with file reading code template: open($txt, $ARGV[0]) or die "can't open $ARGV[0]!\n"; my $p = 0; my $s = 0; while ($line = ) { }

9 Homework Review Code: open($txt, $ARGV[0]) or die "can't open $ARGV[0]!\n"; my $p = 0; my $s = 0; while ($line = ) { $p++ if $line is a blank line and last line wasn’t a blank line } print “Number of paragraphs: $p\n”; Paragraph counting – blank line: /^\s*\n$/ – consecutive blank lines count as one – Programming technique: use a flag to signal whether last line was blank

10 Homework Review Sentence counting Basics – In English, periods, question marks and exclamation marks (not present in supplied article) are used to mark the end of a sentence – [.!?] – followed optionally by closing double or single quotes and/or right parentheses, square brackets, curly braces etc. (not all present in supplied article) – []"')}]* – followed by white space or end of the line – ($|\s) First cut Increment sentence counter if we find a sentence ending period… $s++ if $line =~ /[.!?][]"')}]*($|\s)/ Examples: – U.N. Security – U.N. inspectors – R.J. Reynolds – James Q. Wilson, – Georgia Gov. Zell – strategy. – teens. This – Times. The – – created." – paraphernalia." And – already?" – thrilled.") – Really Juvenile Reynolds

11 Homework Review Debugging help – Useful to print out what preceded the matching regular expression if ($line =~ /[.!?][]"')}]*($|\s)/) { $s++; print " \n”; } if ($line =~ /[.!?][]"')}]*($|\s)/) { $s++; print " \n”; } Paragraph 2, 6 sentences Paragraph 3, 2 sentences PrecedingMatchedAfter $`$&$’

12 Homework Review Examples: – U.N. Security – U.N. inspectors – R.J. Reynolds – James Q. Wilson, – Georgia Gov. Zell – strategy. – teens. This – Times. The – – created." – paraphernalia." And – already?" – thrilled.") – Really Juvenile Reynolds Heuristics – Inexact methods (not foolproof) Spot – Initials (Q.) – Abbreviations (U.N.) and titles (Mr., Ms., Ms) – headlines No period title Be careful… Consider Call IBM. I like the letter J. Be careful… Consider Call IBM. I like the letter J. Suppose headline counts as one sentence $p++; $s++ if ($s == 0); print "Paragraph $p, $s sentence(s)\n”; $s = 0; $p++; $s++ if ($s == 0); print "Paragraph $p, $s sentence(s)\n”; $s = 0;

13 Homework Review Examples: – U.N. Security – U.N. inspectors – R.J. Reynolds – James Q. Wilson, – Georgia Gov. Zell Heuristic – Period doesn’t end sentence if immediately preceded by a capital letter – works for all the outstanding cases in the sample article except Gov. Implementation: – /[^A-Z][.!?][]"')}]*($|\s)/ – or – /(?<![A-Z])[.!?][]"')}]*($|\s)/ – (?<!re) lookbehind negative assertion (perlretut) – for better compatibility with $` Call IBM. I like the letter J. Call IBM. I like the letter J. block Gov. explicitly if ($line =~ /(?<![A-Z])[.!?][]"')}]*($|\s)/) { $s++ if $` !~ /\bGov$/; } block Gov. explicitly if ($line =~ /(?<![A-Z])[.!?][]"')}]*($|\s)/) { $s++ if $` !~ /\bGov$/; }

14 Homework Review Examples – "There is nothing in the [Unabomber] manifesto that looks at – all like the work of a madman. The language is clear, precise and calm. The Problem – multiple sentence ending periods on one line One solution: – use global matching //g with a while loop – see perlretut E.g. while ($line =~ /[.!?][]"')}]*($|\s)/g) { $s++; }

15 Homework Review Answer: – Paragraph 1, 1 sentence(s) – Paragraph 2, 3 sentence(s) – Paragraph 3, 2 sentence(s) – Paragraph 4, 2 sentence(s) – Paragraph 5, 1 sentence(s) – Paragraph 6, 2 sentence(s) – Paragraph 7, 2 sentence(s) – Paragraph 8, 3 sentence(s) – Paragraph 9, 1 sentence(s) – Paragraph 10, 2 sentence(s) – Paragraph 11, 5 sentence(s) – Paragraph 12, 1 sentence(s) 1 2 3 4 5 6 7 8 9 10 11 12

16 Homework Review Task 2 438/538 (15pts) – Modify your Perl program to produce xml paragraph and sentence boundary markup for Article247_499.txt – i.e. produces reformatted raw text as sentence 1 sentence 2 … Revisions – No need to maintain counter for paragraphs or sentences – Re-use variable $s to accumulate bits of current sentence – Print whenever we have a blank line that wasn’t preceded by another blank line while ($line = ) { if ($line =~ /^\s*\n$/) { print " \n” if (!$lastblank); $lastblank = 1; } else { … – Use a function trim to take out leading and trailing spaces before printing string inside … sub trim{ my $str = shift; $str =~ s/^\s+//; $str =~ s/\s+$//; return $str }

17 Homework Review Revisions – Print whenever we see a first non-blank line while ($line = ) { if ($line =~ /^\s*\n$/) { print " \n” if (!$lastblank); $lastblank = 1; } else { … print " \n" if $lastblank == 1; – Split the line according to the sentence- ending regular expression but add extra parentheses to retain the split pattern in the array $line =~ s/\n/ /; @a = split /((?<![A-Z])[.!?][]"')}]*($|\s))/, $line; – Notes: convert newline into space first ($|\s) is also stored! (discard it) Revisions – Example: persistently attempted to market cigarettes to teens. This is also the top gets split into a[0] a[1] a[2] a[3] – Example: sites. gets split into a[0] a[1] a[2] Discard using shift @a; Discard using shift @a; $s.= shift @a; to append these items onto the end of $s $s.= shift @a; to append these items onto the end of $s

18 Homework Review Revisions – Example: all like the work of a madman. The language is clear, precise and calm. The gets split into a[0] a[1] a[2] a[3] a[4] a[5] a[6] Code (shift and discard): @a = split /((?<![A-Z])[.!?][]"')}]*($|\s))/, $line; if($#a == 0) { $s.= $line; } else { while ($#a >= 0) { if ($#a == 0 || $a[0] =~ / Gov$/) { $s = shift @a; } elsif ($#a >= 2) { $s.= shift @a; shift @a; # superfluous ($|\s) print " ".trim($s)." \n"; $s = ""; } else { print error } } $lastblank = 0;

19 Homework Review Output – – Really Juvenile Reynolds – – USA Today and the Washington Post lead with revelations from newly disclosed R.J. Reynolds internal documents that seem to show that the company has persistently attempted to market cigarettes to teens. – This is also the top national story at the Los Angeles Times. – The New York Times leads with the U.N. Security Council's vote telling Iraq to honor previous promises to allow U.N. inspectors complete access to suspected weapons sites. – – Social scientist James Q. Wilson, in a Times op-ed piece, makes the following argument for the sanity of Ted Kacszynski: "There is nothing in the [Unabomber] manifesto that looks at all like the work of a madman. – The language is clear, precise and calm. – The argument is subtle and carefully developed, lacking anything even faintly resembling the wild claims or irrational speculation that a lunatic might produce." – Wilson also observes that besides the Unabomber's manifesto, his "skill in manufacturing bombs and the clever ways in which he concealed his identity suggest to me that he was clearly sane." – Of course, the legal usefulness of these observations is somewhat dubious, because this attempt to show that Theodore Kacszynski is fit to stand trial depends on the assumption that he is the Unabomber, which in turn requires a trial first, which first requires showing that he is fit to stand trial, and so on. – – With NBC's deal to retain "ER" at a cost of $13 million an episode getting front-page coverage at the NYT, LAT, and the WSJ, maybe the next big domestic policy issue will be controlling television health care costs. – … Your output should resemble this…

20 Homework Review That’s all folks! (If you were a non-programmer before this class, and you made it this far – Congratulations! You are entitled to call yourself a programmer now…)


Download ppt "LING/C SC/PSYC 438/538 Lecture 9 9/22 Sandiway Fong."

Similar presentations


Ads by Google