Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING/C SC/PSYC 438/538 Lecture 6 9/13 Sandiway Fong.

Similar presentations


Presentation on theme: "LING/C SC/PSYC 438/538 Lecture 6 9/13 Sandiway Fong."— Presentation transcript:

1 LING/C SC/PSYC 438/538 Lecture 6 9/13 Sandiway Fong

2 Administrivia Homework – out today – Due next Monday (September 20 th ) by midnight

3 Shortest vs. Greedy Matching default behavior – in Perl RE match: longest possible matching string – aka “greedy matching” This behavior can be changed, see following slide RE search is supposed to be fast – but searching is not necessarily proportional to the length of the input being searched – in fact, Perl RE matching can can take exponential time (in length) – non-deterministic may need to backtrack (revisit) if it matches incorrectly part of the way through time length linear time length exponential

4 Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*)bar/ ) { print "got \n"; } Notes: – $_ is the default variable for matching – $1 refers to the parenthesized part of the match (.*) Output: –got

5 Shortest vs. Greedy Matching from http://www.perl.com/doc/manual/html/pod/perlre.html Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*?)bar/ ) { print "got \n"; } Notes: – ? immediately following a repetition operator like * makes the operator work in non-greedy mode Output: –got

6 Split @array = split /re/, string – splits string into a list of substrings split by re. Each substring is stored as an element of @array. Examples (from perlrequick tutorial):

7 Split m!re! (using ! – or some other character - as a RE delimiter) Is equivalent to /re/ More examples:

8 Words and Lines Range Abbreviations: – period (.) stands for any character (except newline) – \d (digit) = [0-9] – \s (whitespace character) = space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF) – \w (word character) = [0-9a-zA-Z_] – uppercase versions, e.g. \D and \W denote negation... Line-oriented metacharacters: – caret (^) at the beginning of a regexp string matches the “beginning of a line” – dollar sign ($) at the end of a regexp string matches the “end of the line” Word-oriented metacharacters: – a word is any sequence of digits [0-9], underscores (_) and letters [a-zA-Z] – \b matches a word boundary could be the beginning of line, a whitespace character, etc.

9 Homework Sample : – – Really Juvenile Reynolds – – USA – Today and the Washington Post lead with revelations from newly disclosed – R.J. Reynolds internal documents that seem to show that the company has – persistently attempted to market cigarettes to teens. This is also the top – national story at the Los Angeles Times. The New York Times – leads with the U.N. Security Council's vote telling Iraq to honor previous – promises to allow U.N. inspectors complete access to suspected weapons – sites. – – The new tobacco documents (many of them marked "Secret"), released as part – of a lawsuit settlement, show a company strategy of attracting teenagers – through advertising and various youth-oriented promotions such as, according to – USAT, "NASCAR sponsorship," "inner city activities," and "T-shirts and – other paraphernalia." And says USAT, the documents show that RJR's – introduction of "Joe Camel" fits in to this strategy. Theme: dealing with raw text File: data/written_1/journal/slate/3/Article247_499.txt (ANC – American National Corpus: 100 million words) Genre: journal, (Slate Magazine article from 1998) Theme: dealing with raw text File: data/written_1/journal/slate/3/Article247_499.txt (ANC – American National Corpus: 100 million words) Genre: journal, (Slate Magazine article from 1998)

10 Homework One of the first steps in processing raw text is to clean and mark it up (xml) Task 1 438/538 (15pts) – write a Perl program that counts the number of paragraphs and sentences for Article247_499.txt (download from class webpage) See next slide for output format – Discuss what the technical problems are with sentence boundary markup and describe your solution. e.g. what regular expressions you are going to use – Submit your program and its output on Article247_499.txt

11 Homework Help Useful code fragment – use previously described template: open($txtfile,$ARGV[0]) or die "$ARGV[0] not found!\n"; while ($line = ) { do RE stuff with $line } – Example: perl processfile.pl Article247_499.txt

12 Homework Help reads in a line of text including the newline (\n) character – so lines are one character longer than you might think The real world is messy – Article247_499.txt is not quite uniform: sentences are split across lines, it may contain extra whitespace and invisible characters you can’t see with a regular text editor. – The file Article247_499.txt you are given is actually not quite raw text – I’ve pre-converted it to ASCII (UTF-8) for you to make life a bit easier – Original was in UTF-16 (big-endian) with nasty non-printable BOM (U+FEFF) and null characters

13 Homework Help You will need to determine how you’re going to pattern match paragraph separators and end of sentences. Input Delimiter http://www.bayview.com/blog/2002/07/29/input-delimiter/ Input Delimiter http://www.bayview.com/blog/2002/07/29/input-delimiter/

14 Homework Sample : – – Really Juvenile Reynolds – – USA – Today and the Washington Post lead with revelations from newly disclosed – R.J. Reynolds internal documents that seem to show that the company has – persistently attempted to market cigarettes to teens. This is also the top – national story at the Los Angeles Times. The New York Times – leads with the U.N. Security Council's vote telling Iraq to honor previous – promises to allow U.N. inspectors complete access to suspected weapons – sites. – – The new tobacco documents (many of them marked "Secret"), released as part – of a lawsuit settlement, show a company strategy of attracting teenagers – through advertising and various youth-oriented promotions such as, according to – USAT, "NASCAR sponsorship," "inner city activities," and "T-shirts and – other paraphernalia." And says USAT, the documents show that RJR's – introduction of "Joe Camel" fits in to this strategy. paragraph Note: Assume blank lines separate paragraphs Output Format Paragraph 1: No. of sentences: 1 Paragraph 2: No. of sentences: 3 Paragraph 3: No. of sentences: 3 etc. Output Format Paragraph 1: No. of sentences: 1 Paragraph 2: No. of sentences: 3 Paragraph 3: No. of sentences: 3 etc.

15 Homework Task 2 438/538 (15pts) – Modify your Perl program to produce xml paragraph and sentence boundary markup for Article247_499.txt – i.e. produces reformatted raw text as sentence 1 sentence 2 … – Each.. should occupy exactly one line of your output. – Leading and trailing spaces of a sentence should be deleted, e.g. The new tobacco … vs. The new tobacco … – Submit your program and its output on Article247_499.txt (Cut and paste everything from both tasks into one file for submission)


Download ppt "LING/C SC/PSYC 438/538 Lecture 6 9/13 Sandiway Fong."

Similar presentations


Ads by Google