Perl & Regular Expressions (RegEx)

Perl & Regular Expressions (RegEx)

Regular Expressions A regular expression (“regex” for short) is a way to describe a text pattern to search for within a string. Regexes can be simple strings, which generally match letter-for-letter, or more complex expressions written using Perl’s regex grammar. Examples and explanations will follow.

Regular Expressions The Match Operator
m/PATTERN/ is Perl’s built-in match operator. It operates on a scalar (or by default on $_). It searches the scalar for the text described by PATTERN and returns 1 for success or “” for failure. To test if PATTERN exists in the string, use the =~ binding operator. The !~ binding operator is equivalent to !( =~ ). Important: using == and != instead of =~ and !~ will result in wrong results.

#!/usr/bin/perl $name = ‘Amir Sahar’; if ($name =~ m/Dov/) { print “I thought that I was Amir?!\n”; } elsif ($name =~ m/Amir/) { print “whew… I still remember my name\n”; else { print “what the hell is my name?\n”;

The m// operator interpolates its contents like double-quoted strings. This lets you use variables in your search patterns. The m// can use any nonalphanumeric, nonwhitespace delimiter instead of the ‘/’ delimiter. This can come in handy if you are trying to match a pattern that includes ‘/’ such as a path name. Instead of: m/\/usr\/local\/bin/ Prefer: m!/usr/local/bin! or m(/usr/local/bin) When using a delimiter other than ‘/’, the m must be specified; otherwise it may be omitted: /PATTERN/

Regular Expressions Built-in Match Variables
There are a few useful built-in variables that can help you with processing your pattern matching results: $& - contains the matched string $` - contains everything before the matched string $’ - contains anything after the matched string Notice that if the match fails, these variables will not change from their previous values, therefore they can’t be used for testing for a successful match. Will produce: before:I am running after:of funny examples matched:out #!/usr/bin/perl $problem = ‘I am running out of funny examples’; $problem =~ m/out/; print “before:$`\nafter:$'\nmatched:$&“\n”;

The Substitute Operator
In addition to the m/PATTERN/ operator, a s/PATTERN/SUBSTITUTION/ operator exists. This operator searches for PATTERN in the string and replaces it with SUBSTITUTION. The SUBSTITUTION is treated as a double quoted string including interpolation of variables. s/// is used the same way as m// The script above will produce: I wish that this course would continue forever #!/usr/bin/perl $wish = ‘I wish that this course would be over’; $wish =~ s/be over/continue forever/; print “$wish\n”;

Metasymbols So far we’ve seen patterns which match a fixed text sequence. More generic text patterns can be described using various predefined metasymbols and metacharacters. the most useful metasymbols are described in the following slides.

Metasymbols Each of the metasymbols in this table matches a single character in the text being searched: Matches Metasymbol Any alphanumeric character, as well as underscore \w Any character that \w doesn’t match \W Any digit character \d Any non digit character \D Any whitespace character \s Any non whitespace character \S Anything but a newline (\n). This is a don’t-care match .

Metasymbols #!/usr/bin/perl $number = 1234;
print “huh?\n” if $number =~ /\D/; # won’t print anything print “well, I knew that\n” if $number =~ /\w/; # will print: well, I knew that print “matching a don’t care\n” if $number =~ /./; # will print: matching a don’t care print “of course there are no whitespace in numbers\n” unless $number =~ /\s/; # will print: of course there are no whitespace in numbers

Metasymbols The following metasymbols help describe where in the string to search for PATTERN: Meaning (meta)symbol Beginning of string (or line if string contains multiple lines) ^ End of string (or line if string contains multiple lines) $ Word boundary (between \W and \w or vice versa) \b Anything but word boundaries \B Beginning of string only \A End of string only \Z

Metasymbols $string = “beetlejuice beetlemania beetles”;
$string =~ /^beetlemania/; # won’t match anything $string =~ /^beetlejuice$/; # won’t match anything $string =~ /\bbeetle\b/; # won’t match anything $string =~ /\bbeetle/; # will match the first word $string =~ /\bbeetles$/; # will match the last word

Metacharacters In general, “metacharacters” are characters that have a special meaning when they appear in regular expressions. If you want to search for one of the metacharacters themselves, you must prefix it with a backslash: ‘\’. The characters are: \ | ( ) [ ]{ } ^ $ * + ? . Therefore, if you are searching for a ‘?’, your match pattern should look like this: m/is this a question\?/ We’ll see more about what these characters mean in the following slides.

Quantifiers ? – Succeeds if the preceding pattern appears 0 or 1 times
In order to search for a repeated pattern, it is possible to specify the fact that your search PATTERN should be repeated using a quantifier after the PATTERN. Perl has three basic quantifiers: ? – Succeeds if the preceding pattern appears 0 or 1 times * – Succeeds if the preceding pattern appears 0 or more times in succession + – Succeeds if the preceding pattern appears 1 or more times in succession Note that quantifiers have no meaning on their own; they always modify the immediately preceding regex symbol

Quantifiers $fruit = “fifteen (15) bananas”;
$fruit =~ /e+/; print “$&\n”; # will print ‘ee’ $fruit =~ /an*/; print “$&\n”; # will print ‘an’ $fruit =~ /(an)+/; print “$&\n”; # will print ‘anan’ $fruit =~ /e*/; print “$&\n”; # will not print, however it will succeed print “ok” if $fruit =~ m{(abc)?}; # will print ‘ok’ print “ok” if $fruit =~ m{(abc)+}; # will not print anything, and fail print “ok” if $fruit =~ /tef*en/; # will print ‘ok’ $fruit =~ /\w+/; print “$&\n”; # will print ‘fifteen’ $fruit =~ /\b\d+\b/; print “$&\n”; # will print ‘15’ $fruit =~ /$.*$/; print “$&\n”; # will print ‘(15)’ print “not found” if $fruit !~ /^banana/; # will print ‘not found’

Quantifiers Meaning Minimal match Maximal match Match 0 or more times
*? * Match 1 or more times +? + Match 0 or 1 times ?? ? Match exactly COUNT times {COUNT} Match at least MIN times {MIN,}? {MIN,} Match at least MIN but not more than MAX times {MIN,MAX}? {MIN,MAX} More precise specification of repeated matches is possible with these additional quantifiers. Generally speaking, quantifiers will try to match as many times as possible if the “maximal match” version is used or as few times as possible if the “minimal match” version is used.

Quantifiers $phrase = “Hold your horses”;
$phrase =~ /.+o/; print “$&\n”; # will print ‘Hold your ho’ $phrase =~ /.+?o/; print “$&\n”; # will print ‘Ho’ print “match\n” if $phrase =~ /.+H/; # will print nothing print “match\n” if $phrase =~ /.*H/; # will print ’match’ print “match\n” if $phrase =~ /^H.{14}s$/; # will print ’match’ print “hold is a word\n” if $phrase =~ /\BHold/; # will print nothing

Alternatives If you wish to match one of many possible subexpressions, use the ‘|’ token to separate them and the round parentheses to enclose them. The script above is expected to produce: I am in the right course #!/usr/bin/perl $course = ‘Perl course’; print “I am in the right course’ if $course =~ /(Perl|Tcl|C) course/;

Character Sets To match any one of a set of possible characters, use the square brackets to surround them: [0-9] is the same as \d To match anything except the characters in the square brackets put a caret sign (^) after the opening bracket: [^0-9] is the same as \D Keep in mind the difference between the square brackets, which group a set of characters and the round parentheses, which group alternative expressions: [fee|fie|foe] is the same as [feio|] m%(0[1-9]|[12]\d|3[01])/(0[1-9]|1[0-2])/\d{4}% #matches a date dd/mm/yyyy Why not m%([0-2]\d)/(0\d|1[0-2])/\d{4}%

Captured Matches In order to remember your matches for further reference, Perl has the built in $1, $2, $3, … variables. Each variable contains the contents of a match that was surrounded by round parentheses. The parentheses are numbered according to the order of the opening parentheses, from the leftmost one towards the rightmost one. Notice that these variables get clobbered every time a match is performed, therefore it is good practice to save them in your own variables. A subroutine call may overwrite them without your knowledge. If the result of the match operator is taken in list context, the elements of the resulting list are what $1, $2… would have returned.

Captured Matches $song = “oh lord wont you buy me a Mercedes Benz”;
@matches = ($song =~ /(.*o)([a-z\s]*)(.*)/); $1 and $matches[0] will equal ‘oh lord wont yo’ $2 and $matches[1] will equal ‘u buy me a ’ $3 and $matches[2] will equal ‘Mercedes Benz’ $1, $2 etc. can be used in the substitution pattern: $time = "12:34"; $time =~ s/(..):(..)/$2:$1/; print $time Is expected to produce: 34:12

Modifiers The match rules for a pattern can be modified by certain flags that can be used after the closing delimiter of the match operator: Meaning Modifier Ignore alphabetic case distinctions (case insensitive). /i Let . match all newlines in the string /s Let ^ and $ match next to embedded newlines in the string. /m Ignore (most) whitespace and permit comments in pattern. /x Compile pattern once only. /o

Modifiers print “match” if “Perl” =~ /perl/i; # will print ‘match’ ‘cause ignoring case print “match” if “line 1\nLine 2” =~ /^l.*2/s; # will print ‘match’ ‘cause ‘.’ include ‘\n’ print “match” if “line 1\nLine 2” =~ /^l.*2/m; # will print nothing print “match” if “line 1\nLine 2” =~ /^L.*2/m; # will print ‘match’ The following 3 regexes all match the same thing: m/\w+:(\s+\w+)\s*\d+/; # A word, colon, spaces, word, spaces, digits. m/\w+: (\s+ \w+) \s* \d+/x; # A word, colon, spaces, word, spaces, digits. m{ \w+: # Match a word and a colon. ( # (begin group) \s+ # Match one or more spaces. \w+ # Match another word. ) # (end group) \s* # Match zero or more spaces. \d+ # Match some digits }x;

Modifiers An additional modifier is /g, the global match. It behaves slightly differently for m// and s///. For s///, the PATTERN is replaced throughout EXPR as many times as it is found. For m//, the PATTERN is repetitively matched each time from where the last match left off.

Regular Expressions The following script Is expected to produce: 1 2 3
banana The following script Is expected to produce: 1 bqnqnq #!/usr/bin/perl $fruit = “banana”; $counter = 0; while ($fruit =~ m/a/g) { print ++$counter, “\n”; } print “$fruit \n”; #!/usr/bin/perl $fruit = “banana”; $counter = 0; while ($fruit =~ s/a/q/g) { print ++$counter, “\n”; } print “$fruit \n”;

Perl & Regular Expressions (RegEx)

Similar presentations

Presentation on theme: "Perl & Regular Expressions (RegEx)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Perl & Regular Expressions (RegEx)

Similar presentations

Presentation on theme: "Perl & Regular Expressions (RegEx)"— Presentation transcript:

Similar presentations

About project

Feedback