Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong.

Similar presentations


Presentation on theme: "LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong."— Presentation transcript:

1 LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong

2 Administrivia Homework 9 Perl regex Python re
import re slightly complicated string handling: use raw g/3/library/re.html

3 File I/O Summary Common: Perl: Python: open
filehandle (concept comes from the underlying OS) streams: STDIN STDOUT STDERR (Perl) streams: sys.stdin sys.stdout sys.stderr (Python) close Perl: <filehandle> (context: reads a line or the whole file) print filehandle String Python: .read() (methods) .readline() .readlines() .write(String) (no newline) print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False) (function)

4 Regular Expressions to the rescue

5 Regular Expressions from Hell
validation: RFC 5322: (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~- ]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01- 9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1- 9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9- ]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01- \x09\x0b\x0c\x0e-\x7f])+)\])

6 Homework 9 File: hw9.txt Contents: each line has 3 fields
56 lines Contents: each line has 3 fields name of state or US territory (in alphabetical order) population area (sq. miles) fields are separated by a tab (\t) Source: Wikipedia

7 Homework 9 Question 1 Using Perl
supply the file hw9.txt on the command line DO NOT MODIFY hw9.txt read the file use regex to extract the information create hash table(s) indexed by name containing population and land area Print a table of states/territories inversely ranked by land area Print a table of states/territories ranked by population (i.e. 1st is highest population) compute the density (population per sq. mile) Print a table of states/territories ranked by density (i.e. 1st is highest density)

8 Homework 9 Question 1 Hints:
note that some state/territory names consist of more than one word note that numeric values may have commas read read about split read about tr: $num =~ tr/,//d deletes the pesky commas in $num revisit sort parameters: if you need to trim whitespace from the ends: $line =~ s/^\s+|\s+$//g; for nicely-formatted lists, read about printf FORMAT

9 Homework 9: Question 2 538 only (optional for 438):
Do the same exercise as Question 1 in Python3 using a dictionary or dictionaries In your opinion, which code is simpler? These may prove useful: str.strip() str.replace() str.split() sys.argv int()

10 Homework 9 Usual submission rule: ONE PDF file
Submit code/run/comments subject heading: 438/538 Homework 4 Your Name Due date by midnight of next Monday (review in class on Tuesday)

11 regex Read textbook chapter 2: section 1 on Regular Expressions

12 Perl regex Read up on the syntax of Perl regular expressions
Online tutorials

13 Perl regex Perl regex matching: Perl regex match and substitute:
$s =~ /foo/ (/…/ contains a regex) can use in a conditional: e.g. if ($s =~ /foo/) … evaluates to true/false depending on what’s in $s can also use as a statement: e.g. $s =~ /foo/; global variable $& contains the match Perl regex match and substitute: $s =~ s/foo/bar/ s/…match… /…substitute… / contains two expressions will modify $s by looking for a single occurrence of match and replacing that with substitute s/…match… /…substitute… /g global substitution

14 Perl regex Most useful with the code template for reading in a file line-by-line: open($fh, $ARGV[0]) or die "$ARGV[0] not found!\n"; while ($line = <$fh>) { do RE stuff with $line } close($fh)

15 Chapter 2: JM spaces matter! character class: Perl lingo

16 Chapter 2: JM range: in ASCII table
backslash lowercase letter for class Uppercase variant for all but class

17 Chapter 2: JM

18 Chapter 2: JM Can use (…) if > 1 char Sheeptalk

19 Perl regex \s is a whitespace, so \S is a non-whitespace
\S+ing\b \s is a whitespace, so \S is a non-whitespace + is repetition (1 or more) \b is a word boundary, (words are made up of \w characters)

20 Perl regex global variables \b or \b{wb}
other boundary metacharacters: ^ (beginning of line), $ (end of line)

21 Perl regex: Unicode and \b
\b{wb} Note: global match in while-loop Note: .*? is the non-greedy version of .*

22 Perl regex: Unicode and \w
\w is [0-9A-Za-z_] Definition is expanded for Unicode: use utf8; use open qw(:std :utf8); my $str = "school école École šola trường स्कूल škole โรงเรียน"; @words = ($str =~ /(\w+)/g); foreach $word { print "$word\n" } list context Pragma

23 Chapter 2: JM Why? * means zero or more repetitions of the previous char/expr . means any single character ? means previous char/expr is optional

24 Chapter 2: JM Precedence of operators Perl: Precedence Hierarchy:
Example: Column 1 Column 2 Column 3 … /Column [0-9]+ */ /(Column [0-9]+ *)*/ /house(cat(s|)|)/ (| = disjunction; ? = optional) Perl: in a regular expression the pattern matched by within the pair of parentheses is stored in global variables $1 (and $2 and so on). (?: … ) group but exclude from storage Precedence Hierarchy: space

25 Online regex tester

26 returns 1 (true) or "" (empty if false)
Perl regex returns 1 (true) or "" (empty if false) A shortcut: list context for matching returns a list


Download ppt "LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong."

Similar presentations


Ads by Google