Presentation is loading. Please wait.

Presentation is loading. Please wait.

regular expressions: string matching

Similar presentations


Presentation on theme: "regular expressions: string matching"— Presentation transcript:

1 regular expressions: string matching
Perl regular expressions: string matching

2 For this lecture, we focus on string matching using a if statement
The format if ($str =~ /pattern to match/) # true when match if ($str !~ /patch to match/) #true when no match the same as if ($str =~ m/pattern to match/) # true when match if ($str !~ m/patch to match/) #true when no match

3 match a string or string variable if($str =~ /dog/)
simple matching match a string or string variable if($str =~ /dog/) true if $str contains dog If the $str and =~ or !~ is left off, then it uses $_ for matching

4 case insensitive matching
/i ignore case if ($str =~ /dog/i) true if $str contains dog. The match is case insensitive. if ($str =~ /DOG/i) #same

5 | allows matching with an or if ($str =~ /Fred|Wilma|Pebbles/)
alternation matching | allows matching with an or if ($str =~ /Fred|Wilma|Pebbles/) True if contains Fred, Wilma, or Pebbles if ($str =~/Fred|Wilma|Pebbles Flintstone/) matches Fred, Wilma, or Pebbles Flintstone Grouping if ($str =~/(Fred|Wilma|Pebbles) Flintstone/) matches Fred Flintstone, Wilma Flintstone, or Pebbles Flintstone if ($str =~/(Blue|Song)bird/) matches Bluebird or Songbird

6 alternation matching (2)
if ($str =~/th(is|at)/) true if $str contains this or that if ($str =~ /(p|g|m|s|b)et/) true if $str contains: pet, get, met, set, or bet

7 Single character matching
Use [] if($str =~ /[abc]/) true if $str contains a and/or b and/or c if ($str =~ /[pgmsb]et/) true if $str contains for pet, get, met, set or bet if($str =~/[Fred]/) true if $str contains F and/or r and/or e and/or d Not listed characters ^ character if($str =~/[^abc]/) true if $str does not contain a and b and c if($str =~/[a-z]/) true if $str contains any lower case letter a through z

8 Single character or'd matching (2)
if ($str =~/[0-9]/) true if $str contains any number 0 through 9 if ($str =~/[0-9\-]/) matches 0 through 9 or the minus if ($str =~/[a-z0-9\^]/) matches any single lowercase letter or digit or ^ if ($str =~/[a-zA-Z0-9_]/) matches any single letter, digit, or underscore if ($str =~/[^aeiouAEIOU]/) matches any non-vowel in $str if ($str !~ /[aeiouAEIOU]/) matches only if there are no vowels in $str

9 multiple uses {min,max} if ($str =~ /a{3}/) common mistake
matching quantifiers multiple uses {min,max} if ($str =~ /a{3}/) true if $str contains aaa common mistake if($str =~ /Fred{3}/) matches Freddd, not FredFredFred if ($str =~/(Fred){3}/) matches FredFredFred if ($str =~/a{3,}/) matches aaa, aaaa, aaaaa, aaaaaa, etc. if ($str =~/a{3,5}b/) matches aaab, aaaab, aaaaab

10 matching quantifiers (2)
if ($str =~/a{0,5}/) match a, aa, aaa, aaaa, aaaaa, and if there are no a's if ($str =~/a*/) * match 0 or more times (max match) if ($str =~/a*?/) * match 0 or more times (min match) Difference between min and max matching $_ ="aaaa"; #matches all three above Difference *, matches "aaaa" while *? matches "a" max matches as many characters as it can while min, matches as few characters as it can This becomes important in the next lecture.

11 matching quantifiers (3)
+ 1 or more times (max match) +? 1 or more times (min match) if ($str =~ /a+/) true if there are 1 or more "a"s ? match 0 or 1 time (max match) ?? match 0 or 1 time (min match) if ($str =~ /a?/) true if there 1 a or no "a"s Also {3,5}? min match tries to match only 3 where possible and {3,5} max match tries to match 5 where possible

12 matching quantifiers (4)
if ($str =~ /fo+ba?r/) matches f, 1 or more o's, b, 0 or 1 a, then an r match: fobar, foobar, foobr Non-match: fbar (missing o), foobaar (to many a's) if ($str =~ /fo*ba?r/) matches f, 0 or more o's, b, 0 or 1 a, then an r match: fobar, fbr, fooobr, etc… Inside [], matching quantifiers are "normal" characters. if ($str =~/[.?!+]*/) matches zero or more ., ?, !, or +

13 What will the following match?
Exercise 7 What will the following match? /a+[bc]/ /(a|be)t/i /Hi{1,3} There\!?/ /(Foo)?Bar/i /[1-9][1-9][a-z]*/ /[a-zA-Z]+, [A-Z]{2} [0-9]{5}/ Write an regular expression for these Match a social security number (with or without dashes) A street address: number Name with either St, Ln, Rd or nothing. Also case insensitive

14 . match one character (except newline) if($str =~ /./)
metasymbols . match one character (except newline) if($str =~ /./) Always true, except when $str = "" if ($str =~ /d.g/) true for d and anycharacter and g so dog, dbg, dag, dcg, d g, etc. if ($str =~ /d.*g/) true d and 0 or more character and g so dg, dog, dasdfg, d g, etc. if ($str =~ /d.+g/) true d and 1 or more character and g so NOT dg, but the rest dog, dasdfg, d g, etc.

15 metasymbols (2) if ($str =~ /d.?g/) if ($str =~ /d.{0,1}g/)
true for d and any single character and g AND dg if ($str =~ /d.{0,1}g/) same as above if ($str =~ /d.{2}g/) true for d and 2 characters and g so doog, dafg, dghg, etc… if ($str =~ /d.{2,5}g/) true for d and 2 to 5 characters and g so dooog, doog, dXXXXXg, gXobgg, etc…

16 ^ beginning of the string (only a not in []) $ end of the string
metasymbols (3) Anchoring ^ beginning of the string (only a not in []) $ end of the string if ($str =~ /^dog$/) true only for "dog", not "ddogg" if ($str =~ /^dog/) true only when the string start with "dog" so "dog", "doga", etc.

17 metasymbols (4) if ($str =~ /dog$/) if ($str =~ /^.$/)
true when the string ends with "dog" "dog", "asdfadfdog", "ddddooodog" if ($str =~ /^.$/) true when the string is one character long and not the newline symbol if ($str =~/^[abc]+/) true when the string start with "a", "aa", "aaa", etc with any characters following. "b", "bb", "bbb", etc with any characters following. "c", "cc", "ccc", etc with any characters following As well as any combination of a's, b's, and c's "abcabc", etc.

18 metasymbols (5) \d match a Digit [0-9] \D match a Nondigit [^0-9] \s match whitespace [ \t\n\r\f] \S match a Nonwhitespace [^ \t\n\r\f] \w match a Word character [a-zA-Z0-9_] \W match a Non word Character [^a-zA-Z0-9_]

19 Examples if ($str =~ /\d/) #true when $str contains a digit
if ($str =~ /\d+/) #true when $str contains 1 or more digit if ($str =~/\w\d/) #true contains a word character and 1 digit if ($str =~/\w+\d/) #true when contains 1 or more word characters and 1 digit true "abc1" "a1" "11" "_9" "Z8" and "a1a1" if ($str =~/^\s\w\d/) true when it starts with a whitespace, then a word character, and then a digit " 11" "\ta1" "\n11" etc. if ($str =~/^\s*\w\d/) true when it starts with 0 or more whitespaces, then a word character, and then a digit " 11" "11" " \t a1" etc

20 boundaries assertions
\b matches at any word boundary as defined by \w and \W \B matches at any non word boundary as defined by \W and \w /\bis\b/ #matches "what it is" and "that is it" can also be writing as /\Wis\W/ won't match "tist" /\Bis\B/ #matches "thistle" and "artist" can also be writing as /\wis\w/ won't match "that is it"

21 boundaries assertions (2)
/\bis\B/ #matches "istanbul" and "so—isn't that" similar to /\Wis\w/ but won't match "istanbul", because "is" is at the front of the string and won't match \W. Since \w is [a-zA-Z0-9_], then all punctuation counts as a word boundary. So /\bisn\B/ won't match "isn't", because of ' is not a Word character /\Bis\b/ #matches "this" and "this is for you" similar to /\wis\W/ For the second example, the match is for "this", instead of "is". As in example above \W won't match at the end of a string.

22 What will the following match?
Exercise 8 What will the following match? /a+\w*?/ /\w\s*\w+/ /\bHi\bThere/ /\b\w+\b.+There[!]?$/i /^\d+[a-z]*/ /\w+,\s\w{2}\s{2}\d{5}/ Write an regular expression for these Rewrite #6 so the city can two or more words. Must start with a letter, then have any number of letters and/or numbers or none at all, but end with a number

23 $1 holds the first match inside a () if ($str =~ /(\d)asdf\1/)
Parentheses as memory special variables \1 .. \9 $1 holds the first match inside a () if ($str =~ /(\d)asdf\1/) true when has a digit, then asdf, then the same digit examples: 1asdf1, 3asdf3 if($str =~ /(\w+)(\d+)as\2\1/) true for a word, then digits, as, same digits, then same word examples: "hi12as12hi" "1_31as311_"

24 Parentheses as memory (2)
if ($str =~ /(\d)+asdf\1/) Note: (\d)+ is different from (\d+) (\d+) match max digits, goes into \1 (\d)+ match a digit, but last match goes into \1 examples: (\d)+ on 123, \1 = 3, but the match is on 123 So 123asdf3 would match from the top if In the next lecture, it does some strange things on substitutions.

25 Parentheses as memory (3)
parentheses around parentheses if ($str =~ /((\w+) (\w+))/) \1, \2, \3 are bound to values $str = "Hi There"; \2 = "Hi", \3 = "There", \1="Hi There" Perl works from the outer most parentheses to the inner, ( is 1, ((\w+) is 2, the second (\w+) is 3 (((\w+) )(\w+)) has \1, \2, \3, \4 \1 = "Hi There", \2 = "Hi ", \3 ="Hi", \4 = "There"

26 Variable Interpolation
Using variables inside in the match $find = "abc"; if ($str =~/$find/) matches when $str contains the value of $find $str = "ddogg"; if($str =~ /\w$dog\w/) true if $str contains the string in $dog and a word letter in front and behind.

27 Variable interpolation (2)
Variable interpolation sometimes does more then you think. Example $match = “hi*”; $x = “h”; If ($x =~ /$match/) { print “$match matches $x\n”; } else { print “Failed as It should\n”; Ouptut Hi* matches h Why? hi* doesn’t match h.

28 Variable interpolation (3)
Answer, The variable interpolated out as a regex, So hi*  find an h and zero or more lower case Is. The following example leads to error $match = “*hi*”; If ($x =~ /$match/) { Perl error out saying Quantifier follows nothing in regex; … To stop meta and regex interpolation of variables \Q … \E Between \Q and \E matches the characters, instead of interpolating as regex.

29 Variable interpolation (3)
$match = “hi*”; $x = “h”; if ($x =~ /\Q$match\E/) { print “$match matches $x\n”; } else { print “Failed as It should\n”; Ouptut Failed as It should.

30 Special Read-Only variables
We've seen \1 .. \9. There only have a value inside the match. But $1 .. $9 hold they value (same as \1 .. \9) after the match if ($str =~ /(\d+)asd\1/) { print "matched $1 \n"; } If $str = "123asd123", then the output would be matched 123

31 Special Read-Only variables (2)
$` (backquote) The characters to the left of the match $’ (quote) The characters to the right of the match $& The characters that matched Helpful for debugging your regex.

32 $str = "a xxx c xxxxxc xxx d"; ($a, $b) = ($str =~ m/(.+)x(.+)c/);
capturing matches $str = "a xxx c xxxxxc xxx d"; ($a, $b) = ($str =~ m/(.+)x(.+)c/); $a = "a xxx c xxx"; Also $1 = "a xxx c xxx"; $b = "x"; Also $2 = "x";

33 / / returns a true/false value returning the switch structure
match as a true value / / returns a true/false value returning the switch structure SWITCH: { $str =~ /abc/ && do {$a =1; last SWITCH;}; $str =~ /def/ and do {$d = 1; last SWITCH;}; $c = 1; } Strange looking code. Also, this is one of the very few places a ; is needed after a } NOTE either && or and could be used.

34 Commenting your matches
/x ignore most white space and allows comments /\w+: #Match a word and a colon ( #Begin group \s+ #match one or more spaces \w+ #match another word ) #end group \s* #match zero or more spaces \d+ #match 1 more digit /x; same as /\w+:(\s+\w+)\s*\d+/;

35 Commenting your matches (2)
Be careful in comments that you don't use / otherwise perl thinks it is the end of the match You have think about where the whitespace is in the match. If you need to match a #, use \#

36 more flags for pattern matching
matching with newline in the string //s let the . match the newline (\n) $str = "asdf\n asdf\n"; /(f.)/; no match /(f.)/s; #$1 = "f\n"; //m lets ^ and $ match next to embedded \n $str = "af\nasdf\n"; /(af$)/; # won't match /(af$)/m; # $1 = "af"; /^(as)/; #won't match /^(as)/m; # matches, $1= "as"; /(f.)$/ms; # matches only the last "f\n", because the . matched the \n, so it's "end of line marker".

37 if ($str =~ /\/usr\/local/)
Pattern Delimiters if ($str =~ /\/usr\/local/) true if $str contains /usr/local To avoid backslashing / we can change the delimiter choose another delimiter, which is a nonalpanumeric character, such %%, ##, {}, [] , <>, etc must use the m in front of the match so perl knows what you want if ($str =~ m%/usr/local%) if ($str =~ m[/usr/local]) true if $str contains /usr/local, but confusing since it can be mistaking for [] single character matching.

38 What will the following match this /\-?(\|)?m\(\d+\)\1/i
Exercise 9 What will the following match this /\-?(\|)?m\(\d+\)\1/i "–|m(12)|" "|M(12)|" "-|M(12)" "m(12)|" "M(12)" For /\-?(\|)?m\(\d+\)\1?/i "|m(12)|" "m(12)" "-|m(12)|" "-|m(12)"

39 Q A &


Download ppt "regular expressions: string matching"

Similar presentations


Ads by Google