Presentation is loading. Please wait.

Presentation is loading. Please wait.

REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization 10-Sept-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/

3 Regular expressions Review 10-Sept-2014 3 NLP, Prof. Howard, Tulane University

4 Regular expressions  re.findall(' be ', S)  Regex meta- characters: ||  ()  [], [^]  {} ..  ` to | be | it | as `  ` (to|be|it|as) `  ` ([a-z][a-z]) `  ` ([a-z]{2}) `  ` (..) `  ` (.{2}) ` 10-Sept-2014 4 NLP, Prof. Howard, Tulane University

5 Open Spyder 10-Sept-2014 5 NLP, Prof. Howard, Tulane University import re

6 4.2.7. Will the best regex please stand up? §4. Regular expressions 2 10-Sept-2014 6 NLP, Prof. Howard, Tulane University

7 4.2.7.1. Under-fitting vs. over-fitting  This challenge of finding the regular expression that is just right may remind you of the story of Goldilocks and the three bears, in which Goldilocks tried to find the bowl of porridge that was neither too hot nor too cold.  Statisticians have their own version of Goldilocks, which evaluates how well a statistical analysis fits the data that it is applied to.  An analysis that over-fits the data is too specific, in that it excludes data points from a larger data set that should be included.  Conversely, an analysis that under-fits the data is too general, in that it includes data points from a larger data set that should be excluded. In our example, the first two regular expressions over-fit the data set (at should be included), while the last two under-fit it (19 should be excluded). 10-Sept-2014NLP, Prof. Howard, Tulane University 7

8 4.2.7.2. False positives and false negatives  Statistical test theory provides an alternative way of conceptualizing the problem, which I unfortunately can’t figure out how to tie in to Goldilocks.  Though it is usually illustrated in terms of medical tests, I believe that explaining it in terms of legal ‘tests’ is easier to understand. 10-Sept-2014NLP, Prof. Howard, Tulane University 8

9 A trial  Imagine that a person is charged with a crime and goes through a trial.  If she is guilty and the verdict is guilty, the trial has produced a true positive data point: a guilty person is found guilty.  Conversely, if she is innocent and the verdict is not guilty, the trial has produced a true negative data point: a not-guilty person is found not guilty.  We expect that an accurate test only produces true positives and true negatives, but there are two more logical possibilities that leave room for a test to be nearly accurate.  One is for an innocent person to be found guilty.  This is called a false positive data point, because the accused should have failed the test but instead passed it.  Alternatively, if a guilty person is found innocent, the legal test has produced a false negative data point, because the accused should have passed the test but instead failed it. 10-Sept-2014NLP, Prof. Howard, Tulane University 9

10 Four outcomes of a trial truefalse positive guilty found guiltyinnocent found guilty negative innocent found not guiltyguilty found not guilty 10-Sept-2014NLP, Prof. Howard, Tulane University 10

11 4.2.7.3. Summary of the two sorts of regex evaluation truefalse positive evaluation of ‘to’ by [a- z]{2} results in good fit evaluation of ‘at’ by (?:to|be|it|as) results in under-fit negative evaluation of ‘the’ by [a- z]{2} results in bad fit evaluation of ‘19’ by.{2} results in over-fit 10-Sept-2014NLP, Prof. Howard, Tulane University 11

12 4.2.8. More on ranges and negation >>> S2 = 'otolaryngologist' English only has five letters for vowels, so it would be easy enough list them in a disjunction: >>> re.findall('a|e|i|o|u', S2) ['o', 'o', 'a', 'o', 'o', 'i'] I>>> re.findall('[aeiou]', S2) ['o', 'o', 'a', 'o', 'o', 'i'] >>> re.findall('[^aeiou]', S2) ['t', 'l', 'r', 'y', 'n', 'g', 'l', 'g', 's', 't'] 10-Sept-2014NLP, Prof. Howard, Tulane University 12

13 4.2.9. A range of repetition with {} character{minimum, maximum} >>> S3 = 'bookkeeper' >>> S4 = 'goddessship' >>> re.findall('[aeiou]{2}', S3) ['oo', 'ee'] >>> re.findall('[^aeiou]{3}', S4) ['sss'] >>> re.findall('[^aeiou]{2,3}', S4) ['dd', 'sss'] 10-Sept-2014NLP, Prof. Howard, Tulane University 13

14 4.2.10. Match the beginning or end of a string with ^ and $ >>> re.findall('^.|.$', S) ['T', '.'] 10-Sept-2014NLP, Prof. Howard, Tulane University 14

15 http://www.tulane.edu/~howard/CompCultEN/rege x.html#further-practice-of-fixed-length-matching 4.3. Variable-length matching Next time 10-Sept-2014NLP, Prof. Howard, Tulane University 15


Download ppt "REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google