Presentation is loading. Please wait.

Presentation is loading. Please wait.

REGULAR EXPRESSIONS 3 DAY 8 - 9/12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "REGULAR EXPRESSIONS 3 DAY 8 - 9/12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 REGULAR EXPRESSIONS 3 DAY 8 - 9/12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization 12-Sept-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/

3 How did the homework go? Review 12-Sept-2014 3 NLP, Prof. Howard, Tulane University

4 Open Spyder 12-Sept-2014 4 NLP, Prof. Howard, Tulane University import re

5 4.3. Variable-length matching §4. Regular expressions 2 12-Sept-2014 5 NLP, Prof. Howard, Tulane University

6 4.3.1. Match an unknown number of characters with + and *  Imagine that you are given string S5 and told to match the prepositions at the end of each word: >>> S5 = 'break breakup breakout breakdown breakthrough ' >>> re.findall('break([a-z]+) ', S5) ['up', 'out', 'down', 'through'] >>> re.findall('break([a-z]*) ', S5) ['', 'up', 'out', 'down', 'through'] 12-Sept-2014NLP, Prof. Howard, Tulane University 6

7 4.3.2. Match optional characters with ?  A similar sort of problem is to match the singular and plural forms of fish in S6: >>> S6 = 'fish fishes fishy fisher ' >>> re.findall('fish(?:es)? ', S6) ['fish ', 'fishes '] >>> re.findall('fish |fishes ', S6) ['fish ', 'fishes ']  But this formulation doesn’t encode the asymmetry between the two words. Fishes is a variant of fish. That is to say, | over-fits the data.  Optionality is equivalent to matching zero or one instances of the string in question, so it can be mimicked with curly brackets: >>> re.findall('fish(?:es){0,1} ', S6) ['fish ', 'fishes '] 12-Sept-2014NLP, Prof. Howard, Tulane University 7

8 Match optional characters with ?, cont.  Optionality provides a means to do a rough morphological matching: >>> S6 = 'fish fishes fishy fisher fishers ' >>> re.findall('fish(?:er)?(?:s)? ', S6) ['fish ', 'fisher ', 'fishers ']  Restating the optional substrings as a choice among entire words obfuscates the linguistic generalization that er and s are suffixes to nouns: >>> re.findall('fish |fisher |fishers ', S6) ['fish ', 'fisher ', 'fishers '] 12-Sept-2014NLP, Prof. Howard, Tulane University 8

9 4.3.3. +, * and ? match lazily with ?  You wouldn’t want to trust the plus, star and question meta-characters with your wallet, since they take as much as they can get. Consider the problem of matching one of a sequence of quotes: >>> S7 = "'Cool' never goes out of style, but 'gnarly' does." >>> re.findall("'.*'", S7) ["'Cool' never goes out of style, but 'gnarly'"]  This is because * (and + and ?) are greedy – they match the largest number of characters that they can. So even though you may see ‘Cool’ and gnarly as the only accurate matches, what * actually sees is something like ‘Cool never goes out of style, but gnarly’.  If such greedy matching is not desired, it can be turned off by suffixing the three meta- characters with a question mark: >>> re.findall("'.*?'", S7) ["'Cool'", "'gnarly'"]  Turning off the greediness of ? means typing two question marks, ??, which I have not found a good example of yet. 12-Sept-2014NLP, Prof. Howard, Tulane University 9

10 4.3.4. Summary table meta-charactermatchesnamenotes a|ba or bdisjunction (ab)a and bgrouping only outputs what is in (); (?:ab) for rest of pattern [ab]a or brange [a-z] lowercase, [A-Z] uppercase, [0-9] digits [^a]all but anegation a{m, n}from m to n of arepetitiona{n} a number n of a ^aa at start of S a$a at end of S a+one or more of a a+? lazy + a*zero or more of aKleene stara*? lazy * a?with or without aoptionalitya?? lazy ? 12-Sept-2014NLP, Prof. Howard, Tulane University 10

11 Practice with answers on a different page  4.3.5. Further practice of variable-length matching 12-Sept-2014NLP, Prof. Howard, Tulane University 11

12 Q2 to be emailed to you and due in class on Monday, on material since last quiz, regular expressions, up to and including 4.3.4. Summary table Finish regular expressions, maybe start lists Next time 12-Sept-2014NLP, Prof. Howard, Tulane University 12


Download ppt "REGULAR EXPRESSIONS 3 DAY 8 - 9/12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google