Presentation is loading. Please wait.

Presentation is loading. Please wait.

REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization 15-Sept-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/

3 The quiz was the review. Review 15-Sept-2014 3 NLP, Prof. Howard, Tulane University

4 4.3.4. Summary table meta-charactermatchesnamenotes a|ba or bdisjunction (ab)a and bgrouping only outputs what is in (); (?:ab) for rest of pattern [ab]a or brange [a-z] lowercase, [A-Z] uppercase, [0-9] digits [^a]all but anegation a{m, n}from m to n of arepetitiona{n} a number n of a ^aa at start of S a$a at end of S a+one or more of a a+? lazy + a*zero or more of aKleene stara*? lazy * a?with or without aoptionalitya?? lazy ? 15-Sept-2014NLP, Prof. Howard, Tulane University 4

5 There is a bit more to say. §4. Regular expressions 4 15-Sept-2014 5 NLP, Prof. Howard, Tulane University

6 Open Spyder 15-Sept-2014 6 NLP, Prof. Howard, Tulane University

7 Sample string import re >>> S = '''This above all: to thine own self be true, And it must follow, as the night the day, Thou canst not then be false to any man.''' 15-Sept-2014NLP, Prof. Howard, Tulane University 7

8 4.4. Character classes classabbreviatesnamenotes \w[a-zA-Z0-9_]alphanumericit’s really alphanumeric and underscore, but we are lazy \W[^a-zA-Z0-9_] not alphanumeric \d[0-9]digit \D[^0-9] not a digit \s[ tvnrf]whitespace \S[^ tvnrf] not whitespace \t horizontal tab \v vertical tab \n newline \r carriage return \f form-feed \b word boundary \B not a word boundary \A^ \Z$ 15-Sept-2014NLP, Prof. Howard, Tulane University 8

9 4.4.2. Raw string notation with r’‘  Python interprets regular expressions just like any other expression. This can lead to unexpected results with class meta-characters, because the backslash that they incorporate is sometimes also used by Python for its own constructs.  For instance, we just met a class meta-character \b, which marks the edge of a word. It will be extremely useful for us, but it happens to overlap with Python’s own backspace operator, \b. 15-Sept-2014NLP, Prof. Howard, Tulane University 9

10 Raw text  The way to resolve this ambiguity is to prefix an r to a regular expression. The r marks the regular expression as raw text, so Python does not process it for special characters. The previous example is augmented with the raw text notation below: 1. >>> re.findall(r'\b\w\w\b', S) 2. ['to', 'be', 'it', 'as', 'be', 'to'] 3. >>> re.findall(r'\b\w{2}\b', S) 4. ['to', 'be', 'it', 'as', 'be', 'to'] 15-Sept-2014NLP, Prof. Howard, Tulane University 10

11 More raw text  As a further illustration, what do you think are the non-alphanumeric characters in the Shakespeare text?: >>> re.findall(r'\W', S) [' ', ' ', ':', ' ', ' ', ' ', ' ', ' ', ' ', ',', '\n', ' ', ' ', ' ', ',', ' ', ' ', ' ', ' ', ' ', ',', '\n', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '.'] 15-Sept-2014NLP, Prof. Howard, Tulane University 11

12 Practice  4.3.5. Further practice of variable-length matching  4.6. Further practice  Practice with answers on a different page 15-Sept-2014NLP, Prof. Howard, Tulane University 12

13 There is a bit more to say. §5. Lists1 15-Sept-2014 13 NLP, Prof. Howard, Tulane University

14 Introduction  In working with re.findall(), you have seen many instances of a collection of strings held within square brackets, such as the one below: >>> S = '''This above all: to thine own self be true,... And it must follow, as the night the day,... Thou canst not then be false to any man.''' >>> re.findall(r'\b[a-zA-Z]{4}\b', S) ['This', 'self', 'true', 'must', 'Thou', 'then'] 15-Sept-2014NLP, Prof. Howard, Tulane University 14

15 Definition of list  A list in Python is a sequence of objects delimited by square brackets, []. The objects are separated by commas. Consider this sentence from Shakespeare’s A Midsummer Night’s Dream represented as a list:  >>> L = ['Love', 'looks', 'not', 'with', 'the', 'eyes', ',', 'but', 'with', 'the', 'mind', '.'] >>> type(L) >>> type(L[0])  L is a list of strings. You may think that a string is also a list of characters, and you would be correct for ordinary English, but in pythonic English, the word ‘list’ refers exclusively to a sequence of objects delimited by square brackets. 15-Sept-2014NLP, Prof. Howard, Tulane University 15

16 An example with numerical objects 1. >>> i = 2 2. >>> type(i) 3. >>> I = [0,1,i,3] 4. >>> type(I) 5. >>> type(I[0]) 6. >>> n = 2.3 7. >>> type(n) 8. >>> N = [2.0,2.1,2.2,n] 9. >>> type(N) 10. >>> type(N[0]) 15-Sept-2014NLP, Prof. Howard, Tulane University 16

17 Most of the string methods work just as well on lists 1. >>> len(L) 2. >>> sorted(L) 3. >>> set(L) 4. >>> sorted(set(L)) 5. >>> len(sorted(set(L))) 6. >>> L+'!' 7. >>> len(L+'!') 8. >>> L*2 9. >>> len(L*2) 10. >>> L.count('the') 15-Sept-2014NLP, Prof. Howard, Tulane University 17

18 String methods work on lists, cont. 1. >>> L.count('Love') 2. >>> L.count('love') 3. >>> L.index('with') 4. >>> L.rindex('with') 5. >>> L[2:] 6. >>> L[:2] 7. >>> L[-2:] 8. >>> L[:-2] 9. >>> L[2:-2] 10. >>> L[-2:2] 11. >>> L[:] 12. >>> L[:-1]+['!'] 15-Sept-2014NLP, Prof. Howard, Tulane University 18

19 Q1  MIN 5.0  AVG 9.5  MAX 10.0 15-Sept-2014NLP, Prof. Howard, Tulane University 19

20 More on lists Next time 15-Sept-2014NLP, Prof. Howard, Tulane University 20


Download ppt "REGULAR EXPRESSIONS 4 DAY 9 - 9/15/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google