LING/C SC/PSYC 438/538 Lecture 6 9/13 Sandiway Fong.

Slides:



Advertisements
Similar presentations
Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Regular Expressions BKF03 Brian Ciccolo. Agenda Definition Uses – within Aspen and beyond Matching Replacing.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
LING/C SC/PSYC 438/538 Lecture 4 9/1 Sandiway Fong.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 3: 8/28.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 2: 8/23.
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23.
Regular Expressions Regular Expression (or pattern) in Perl – is a template that either matches or doesn’t match a given string. if( $str =~ /hello/){
LING 388: Language and Computers Sandiway Fong Lecture 3: 8/28.
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
Scripting Languages Chapter 8 More About Regular Expressions.
Lesson 3 – Regular Expressions Sandeepa Harshanganie Kannangara MBCS | B.Sc. (special) in MIT.
Last Updated March 2006 Slide 1 Regular Expressions.
Lecture 7: Perl pattern handling features. Pattern Matching Recall =~ is the pattern matching operator A first simple match example print “An methionine.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
Lecture 8 perl pattern matching features
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 4: 8/30.
Regular Expressions in Perl Part I Alan Gold. Basic syntax =~ is the matching operator !~ is the negated matching operator // are the default delimiters.
CS190/295 Programming in Python for Life Sciences: Lecture 3 Instructor: Xiaohui Xie University of California, Irvine.
Programming Languages Meeting 13 December 2/3, 2014.
LING 388: Language and Computers Sandiway Fong Lecture 6: 9/15.
Regular Expressions CSC207 – Software Design. Motivation Handling white space –A program ought to be able to treat any number of white space characters.
Matching in list context (Chapter 11 = ($str =~ /pattern/); This stores the list of the special ($1, $2,…) capturing variables into the.
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2015, Fred McClurg, All Rights.
LING/C SC/PSYC 438/538 Lecture 9 9/22 Sandiway Fong.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
Review Please hand in your practicals and homework Regular Expressions with grep.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Regular Expressions for PHP Adding magic to your programming. Geoffrey Dunn
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
12. Regular Expressions. 2 Motto: I don't play accurately-any one can play accurately- but I play with wonderful expression. As far as the piano is concerned,
Time to talk about your class projects!. Shell Scripting Awk (lecture 2)
Computer Security coursework 1 Dr Alexei Vernitski.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
R EGULAR E XPRESSION IN P ERL (P ART 1) Thach Nguyen.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
CSE 374 Programming Concepts & Tools Hal Perkins Fall 2015 Lecture 6 – sed, command-line tools wrapup.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
What is grep ?  % man grep  DESCRIPTION  The grep utility searches text files for a pattern and prints all lines that contain that pattern. It uses.
An Introduction to Programming with C++ Sixth Edition Chapter 13 Strings.
LING/C SC/PSYC 438/538 Online Lecture 7 Sandiway Fong.
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Regular Expressions Copyright Doug Maxwell (
C++ Memory Management – Homework Exercises
RE Tutorial.
Do-more Technical Training
CSE 374 Programming Concepts & Tools
Lists 1 Day /17/14 LING 3820 & 6820 Natural Language Processing
CS 330 Class 7 Comments on Exam Programming plan for today:
Regular Expressions and perl
Vi Editor.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 10 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.
CS190/295 Programming in Python for Life Sciences: Lecture 3
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong.
PolyAnalyst Web Report Training
Lecture 25: Regular Expressions
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Lecture 23: Regular Expressions
Perl Regular Expressions – Part 1
LING 388: Computers and Language
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.
Presentation transcript:

LING/C SC/PSYC 438/538 Lecture 6 9/13 Sandiway Fong

Administrivia Homework – out today – Due next Monday (September 20 th ) by midnight

Shortest vs. Greedy Matching default behavior – in Perl RE match: longest possible matching string – aka “greedy matching” This behavior can be changed, see following slide RE search is supposed to be fast – but searching is not necessarily proportional to the length of the input being searched – in fact, Perl RE matching can can take exponential time (in length) – non-deterministic may need to backtrack (revisit) if it matches incorrectly part of the way through time length linear time length exponential

Shortest vs. Greedy Matching from Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*)bar/ ) { print "got \n"; } Notes: – $_ is the default variable for matching – $1 refers to the parenthesized part of the match (.*) Output: –got

Shortest vs. Greedy Matching from Example: $_ = "The food is under the bar in the barn."; if ( /foo(.*?)bar/ ) { print "got \n"; } Notes: – ? immediately following a repetition operator like * makes the operator work in non-greedy mode Output: –got

= split /re/, string – splits string into a list of substrings split by re. Each substring is stored as an element Examples (from perlrequick tutorial):

Split m!re! (using ! – or some other character - as a RE delimiter) Is equivalent to /re/ More examples:

Words and Lines Range Abbreviations: – period (.) stands for any character (except newline) – \d (digit) = [0-9] – \s (whitespace character) = space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF) – \w (word character) = [0-9a-zA-Z_] – uppercase versions, e.g. \D and \W denote negation... Line-oriented metacharacters: – caret (^) at the beginning of a regexp string matches the “beginning of a line” – dollar sign ($) at the end of a regexp string matches the “end of the line” Word-oriented metacharacters: – a word is any sequence of digits [0-9], underscores (_) and letters [a-zA-Z] – \b matches a word boundary could be the beginning of line, a whitespace character, etc.

Homework Sample : – – Really Juvenile Reynolds – – USA – Today and the Washington Post lead with revelations from newly disclosed – R.J. Reynolds internal documents that seem to show that the company has – persistently attempted to market cigarettes to teens. This is also the top – national story at the Los Angeles Times. The New York Times – leads with the U.N. Security Council's vote telling Iraq to honor previous – promises to allow U.N. inspectors complete access to suspected weapons – sites. – – The new tobacco documents (many of them marked "Secret"), released as part – of a lawsuit settlement, show a company strategy of attracting teenagers – through advertising and various youth-oriented promotions such as, according to – USAT, "NASCAR sponsorship," "inner city activities," and "T-shirts and – other paraphernalia." And says USAT, the documents show that RJR's – introduction of "Joe Camel" fits in to this strategy. Theme: dealing with raw text File: data/written_1/journal/slate/3/Article247_499.txt (ANC – American National Corpus: 100 million words) Genre: journal, (Slate Magazine article from 1998) Theme: dealing with raw text File: data/written_1/journal/slate/3/Article247_499.txt (ANC – American National Corpus: 100 million words) Genre: journal, (Slate Magazine article from 1998)

Homework One of the first steps in processing raw text is to clean and mark it up (xml) Task 1 438/538 (15pts) – write a Perl program that counts the number of paragraphs and sentences for Article247_499.txt (download from class webpage) See next slide for output format – Discuss what the technical problems are with sentence boundary markup and describe your solution. e.g. what regular expressions you are going to use – Submit your program and its output on Article247_499.txt

Homework Help Useful code fragment – use previously described template: open($txtfile,$ARGV[0]) or die "$ARGV[0] not found!\n"; while ($line = ) { do RE stuff with $line } – Example: perl processfile.pl Article247_499.txt

Homework Help reads in a line of text including the newline (\n) character – so lines are one character longer than you might think The real world is messy – Article247_499.txt is not quite uniform: sentences are split across lines, it may contain extra whitespace and invisible characters you can’t see with a regular text editor. – The file Article247_499.txt you are given is actually not quite raw text – I’ve pre-converted it to ASCII (UTF-8) for you to make life a bit easier – Original was in UTF-16 (big-endian) with nasty non-printable BOM (U+FEFF) and null characters

Homework Help You will need to determine how you’re going to pattern match paragraph separators and end of sentences. Input Delimiter Input Delimiter

Homework Sample : – – Really Juvenile Reynolds – – USA – Today and the Washington Post lead with revelations from newly disclosed – R.J. Reynolds internal documents that seem to show that the company has – persistently attempted to market cigarettes to teens. This is also the top – national story at the Los Angeles Times. The New York Times – leads with the U.N. Security Council's vote telling Iraq to honor previous – promises to allow U.N. inspectors complete access to suspected weapons – sites. – – The new tobacco documents (many of them marked "Secret"), released as part – of a lawsuit settlement, show a company strategy of attracting teenagers – through advertising and various youth-oriented promotions such as, according to – USAT, "NASCAR sponsorship," "inner city activities," and "T-shirts and – other paraphernalia." And says USAT, the documents show that RJR's – introduction of "Joe Camel" fits in to this strategy. paragraph Note: Assume blank lines separate paragraphs Output Format Paragraph 1: No. of sentences: 1 Paragraph 2: No. of sentences: 3 Paragraph 3: No. of sentences: 3 etc. Output Format Paragraph 1: No. of sentences: 1 Paragraph 2: No. of sentences: 3 Paragraph 3: No. of sentences: 3 etc.

Homework Task 2 438/538 (15pts) – Modify your Perl program to produce xml paragraph and sentence boundary markup for Article247_499.txt – i.e. produces reformatted raw text as sentence 1 sentence 2 … – Each.. should occupy exactly one line of your output. – Leading and trailing spaces of a sentence should be deleted, e.g. The new tobacco … vs. The new tobacco … – Submit your program and its output on Article247_499.txt (Cut and paste everything from both tasks into one file for submission)