Perl & Regular Expressions (RegEx)

Slides:



Advertisements
Similar presentations
Session 3BBK P1 ModuleApril 2010 : [#] Regular Expressions.
Advertisements

Regular Expressions Pattern and Match objects Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
BBK P1 Module2010/11 : [‹#›] Regular Expressions.
Searching using regular expressions. A regular expression is also a ‘special text string’ for describing a search pattern. Regular expressions define.
COMP234 Perl Printing Special Quotes File Handling.
Tcl and Otcl Tutorial Part I Internet Computing KUT Youn-Hee Han.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
Regular Expressions in Java. Namespace in XML Transparency No. 2 Regular Expressions Regular expressions are an extremely useful tool for manipulating.
Regular Expressions in Java. Regular Expressions A regular expression is a kind of pattern that can be applied to text ( String s, in Java) A regular.
CS 330 Programming Languages 10 / 10 / 2006 Instructor: Michael Eckmann.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
More Regular Expressions. List/Scalar Context for m// Last week, we said that m// returns ‘true’ or ‘false’ in scalar context. (really, 1 or 0). In list.
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
Regular Expressions. What are regular expressions? A means of searching, matching, and replacing substrings within strings. Very powerful (Potentially)
Scripting Languages Chapter 8 More About Regular Expressions.
1 Lecture 3  Lexical elements  Some operators:  /, %, =, +=, ++, --  precedence and associativity  #define  Readings: Chapter 2 Section 1 to 10.
QUOTATION This chapter teaches you about a unique feature of the shell programming language: the way it interprets quote characters. Basically, the shell.
Chapter 3: Introduction to C Programming Language C development environment A simple program example Characters and tokens Structure of a C program –comment.
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
Programming Perl in UNIX Course Number : CIT 370 Week 4 Prof. Daniel Chen.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
Computer Programming for Biologists Class 5 Nov 20 st, 2014 Karsten Hokamp
INFO 320 Server Technology I Week 7 Regular expressions 1INFO 320 week 7.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
Perl 6 Update - PGE and Pugs Dr. Patrick R. Michaud April 26, 2005.
Finding the needle(s) in the textual haystack
Regular Expressions Regular expressions are a language for string patterns. RegEx is integral to many programming languages:  Perl  Python  Javascript.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
Agenda Regular Expressions (Appendix A in Text) –Definition / Purpose –Commands that Use Regular Expressions –Using Regular Expressions –Using the Replacement.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
By: Andrew Cory. Grouping Things & Hierarchical Matching Grouping characters – ( and ) Allows parts of a regular expression to be treated as a single.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Freescale™ and the Freescale logo are trademarks of Freescale Semiconductor, Inc. All other product or service names are the property of their respective.
Regular Expressions in Perl CS/BIO 271 – Introduction to Bioinformatics.
Regular Expressions What is this line all about? while (!($search =~ /^\s*$/)) { It’s a string search just like before, but with a huge twist – regular.
_______________________________________________________________________________________________________________ PHP Bible, 2 nd Edition1  Wiley and the.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
Pattern Matching II. Greedy Matching When dealing with quantifiers, Perl’s pattern matcher is by default greedy. For example, –$_ = “Bob sat next to the.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
CGS – 4854 Summer 2012 Web Site Construction and Management Instructor: Francisco R. Ortega Chapter 5 Regular Expressions.
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. ADVANCED.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
Chapter 4 © 2009 by Addison Wesley Longman, Inc Pattern Matching - JavaScript provides two ways to do pattern matching: 1. Using RegExp objects.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
May 2006CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Chapter 18 The HTML Tag
Regular Expressions Copyright Doug Maxwell (
RE Tutorial.
Strings and Serialization
Looking for Patterns - Finding them with Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
Regular Expressions in Perl
Regular Expressions in Pearl - Part II
Regular Expressions and perl
CSC 594 Topics in AI – Natural Language Processing
CSCI 431 Programming Languages Fall 2003
- Regular expressions:
Regular Expression: Pattern Matching
Presentation transcript:

Perl & Regular Expressions (RegEx)

Regular Expressions A regular expression (“regex” for short) is a way to describe a text pattern to search for within a string. Regexes can be simple strings, which generally match letter-for-letter, or more complex expressions written using Perl’s regex grammar. Examples and explanations will follow.

Regular Expressions The Match Operator m/PATTERN/ is Perl’s built-in match operator. It operates on a scalar (or by default on $_). It searches the scalar for the text described by PATTERN and returns 1 for success or “” for failure. To test if PATTERN exists in the string, use the =~ binding operator. The !~ binding operator is equivalent to !( =~ ). Important: using == and != instead of =~ and !~ will result in wrong results.

Regular Expressions The Match Operator #!/usr/bin/perl $name = ‘Amir Sahar’; if ($name =~ m/Dov/) { print “I thought that I was Amir?!\n”; } elsif ($name =~ m/Amir/) { print “whew… I still remember my name\n”; else { print “what the hell is my name?\n”;

Regular Expressions The Match Operator The m// operator interpolates its contents like double-quoted strings. This lets you use variables in your search patterns. The m// can use any nonalphanumeric, nonwhitespace delimiter instead of the ‘/’ delimiter. This can come in handy if you are trying to match a pattern that includes ‘/’ such as a path name. Instead of: m/\/usr\/local\/bin/ Prefer: m!/usr/local/bin! or m(/usr/local/bin) When using a delimiter other than ‘/’, the m must be specified; otherwise it may be omitted: /PATTERN/

Regular Expressions Built-in Match Variables There are a few useful built-in variables that can help you with processing your pattern matching results: $& - contains the matched string $` - contains everything before the matched string $’ - contains anything after the matched string Notice that if the match fails, these variables will not change from their previous values, therefore they can’t be used for testing for a successful match. Will produce: before:I am running after:of funny examples matched:out #!/usr/bin/perl $problem = ‘I am running out of funny examples’; $problem =~ m/out/; print “before:$`\nafter:$'\nmatched:$&“\n”;

The Substitute Operator In addition to the m/PATTERN/ operator, a s/PATTERN/SUBSTITUTION/ operator exists. This operator searches for PATTERN in the string and replaces it with SUBSTITUTION. The SUBSTITUTION is treated as a double quoted string including interpolation of variables. s/// is used the same way as m// The script above will produce: I wish that this course would continue forever #!/usr/bin/perl $wish = ‘I wish that this course would be over’; $wish =~ s/be over/continue forever/; print “$wish\n”;

Metasymbols So far we’ve seen patterns which match a fixed text sequence. More generic text patterns can be described using various predefined metasymbols and metacharacters. the most useful metasymbols are described in the following slides.

Metasymbols Each of the metasymbols in this table matches a single character in the text being searched: Matches Metasymbol Any alphanumeric character, as well as underscore \w Any character that \w doesn’t match \W Any digit character \d Any non digit character \D Any whitespace character \s Any non whitespace character \S Anything but a newline (\n). This is a don’t-care match .

Metasymbols #!/usr/bin/perl $number = 1234; print “huh?\n” if $number =~ /\D/; # won’t print anything print “well, I knew that\n” if $number =~ /\w/; # will print: well, I knew that print “matching a don’t care\n” if $number =~ /./; # will print: matching a don’t care print “of course there are no whitespace in numbers\n” unless $number =~ /\s/; # will print: of course there are no whitespace in numbers

Metasymbols The following metasymbols help describe where in the string to search for PATTERN: Meaning (meta)symbol Beginning of string (or line if string contains multiple lines) ^ End of string (or line if string contains multiple lines) $ Word boundary (between \W and \w or vice versa) \b Anything but word boundaries \B Beginning of string only \A End of string only \Z

Metasymbols $string = “beetlejuice beetlemania beetles”; $string =~ /^beetlemania/; # won’t match anything $string =~ /^beetlejuice$/; # won’t match anything $string =~ /\bbeetle\b/; # won’t match anything $string =~ /\bbeetle/; # will match the first word $string =~ /\bbeetles$/; # will match the last word

Metacharacters In general, “metacharacters” are characters that have a special meaning when they appear in regular expressions. If you want to search for one of the metacharacters themselves, you must prefix it with a backslash: ‘\’. The characters are: \ | ( ) [ ]{ } ^ $ * + ? . Therefore, if you are searching for a ‘?’, your match pattern should look like this: m/is this a question\?/ We’ll see more about what these characters mean in the following slides.

Quantifiers ? – Succeeds if the preceding pattern appears 0 or 1 times In order to search for a repeated pattern, it is possible to specify the fact that your search PATTERN should be repeated using a quantifier after the PATTERN. Perl has three basic quantifiers: ? – Succeeds if the preceding pattern appears 0 or 1 times * – Succeeds if the preceding pattern appears 0 or more times in succession + – Succeeds if the preceding pattern appears 1 or more times in succession Note that quantifiers have no meaning on their own; they always modify the immediately preceding regex symbol

Quantifiers $fruit = “fifteen (15) bananas”; $fruit =~ /e+/; print “$&\n”; # will print ‘ee’ $fruit =~ /an*/; print “$&\n”; # will print ‘an’ $fruit =~ /(an)+/; print “$&\n”; # will print ‘anan’ $fruit =~ /e*/; print “$&\n”; # will not print, however it will succeed print “ok” if $fruit =~ m{(abc)?}; # will print ‘ok’ print “ok” if $fruit =~ m{(abc)+}; # will not print anything, and fail print “ok” if $fruit =~ /tef*en/; # will print ‘ok’ $fruit =~ /\w+/; print “$&\n”; # will print ‘fifteen’ $fruit =~ /\b\d+\b/; print “$&\n”; # will print ‘15’ $fruit =~ /\(.*\)/; print “$&\n”; # will print ‘(15)’ print “not found” if $fruit !~ /^banana/; # will print ‘not found’

Quantifiers Meaning Minimal match Maximal match Match 0 or more times *? * Match 1 or more times +? + Match 0 or 1 times ?? ? Match exactly COUNT times {COUNT} Match at least MIN times {MIN,}? {MIN,} Match at least MIN but not more than MAX times {MIN,MAX}? {MIN,MAX} More precise specification of repeated matches is possible with these additional quantifiers. Generally speaking, quantifiers will try to match as many times as possible if the “maximal match” version is used or as few times as possible if the “minimal match” version is used.

Quantifiers $phrase = “Hold your horses”; $phrase =~ /.+o/; print “$&\n”; # will print ‘Hold your ho’ $phrase =~ /.+?o/; print “$&\n”; # will print ‘Ho’ print “match\n” if $phrase =~ /.+H/; # will print nothing print “match\n” if $phrase =~ /.*H/; # will print ’match’ print “match\n” if $phrase =~ /^H.{14}s$/; # will print ’match’ print “hold is a word\n” if $phrase =~ /\BHold/; # will print nothing

Alternatives If you wish to match one of many possible subexpressions, use the ‘|’ token to separate them and the round parentheses to enclose them. The script above is expected to produce: I am in the right course #!/usr/bin/perl $course = ‘Perl course’; print “I am in the right course’ if $course =~ /(Perl|Tcl|C) course/;

Character Sets To match any one of a set of possible characters, use the square brackets to surround them: [0-9] is the same as \d To match anything except the characters in the square brackets put a caret sign (^) after the opening bracket: [^0-9] is the same as \D Keep in mind the difference between the square brackets, which group a set of characters and the round parentheses, which group alternative expressions: [fee|fie|foe] is the same as [feio|] m%(0[1-9]|[12]\d|3[01])/(0[1-9]|1[0-2])/\d{4}% #matches a date dd/mm/yyyy Why not m%([0-2]\d)/(0\d|1[0-2])/\d{4}%

Captured Matches In order to remember your matches for further reference, Perl has the built in $1, $2, $3, … variables. Each variable contains the contents of a match that was surrounded by round parentheses. The parentheses are numbered according to the order of the opening parentheses, from the leftmost one towards the rightmost one. Notice that these variables get clobbered every time a match is performed, therefore it is good practice to save them in your own variables. A subroutine call may overwrite them without your knowledge. If the result of the match operator is taken in list context, the elements of the resulting list are what $1, $2… would have returned.

Captured Matches $song = “oh lord wont you buy me a Mercedes Benz”; @matches = ($song =~ /(.*o)([a-z\s]*)(.*)/); $1 and $matches[0] will equal ‘oh lord wont yo’ $2 and $matches[1] will equal ‘u buy me a ’ $3 and $matches[2] will equal ‘Mercedes Benz’ $1, $2 etc. can be used in the substitution pattern: $time = "12:34"; $time =~ s/(..):(..)/$2:$1/; print $time Is expected to produce: 34:12

Modifiers The match rules for a pattern can be modified by certain flags that can be used after the closing delimiter of the match operator: Meaning Modifier Ignore alphabetic case distinctions (case insensitive). /i Let . match all newlines in the string /s Let ^ and $ match next to embedded newlines in the string. /m Ignore (most) whitespace and permit comments in pattern. /x Compile pattern once only. /o

Modifiers print “match” if “Perl” =~ /perl/i; # will print ‘match’ ‘cause ignoring case print “match” if “line 1\nLine 2” =~ /^l.*2/s; # will print ‘match’ ‘cause ‘.’ include ‘\n’ print “match” if “line 1\nLine 2” =~ /^l.*2/m; # will print nothing print “match” if “line 1\nLine 2” =~ /^L.*2/m; # will print ‘match’ The following 3 regexes all match the same thing: m/\w+:(\s+\w+)\s*\d+/; # A word, colon, spaces, word, spaces, digits. m/\w+: (\s+ \w+) \s* \d+/x; # A word, colon, spaces, word, spaces, digits. m{ \w+: # Match a word and a colon. ( # (begin group) \s+ # Match one or more spaces. \w+ # Match another word. ) # (end group) \s* # Match zero or more spaces. \d+ # Match some digits }x;

Modifiers An additional modifier is /g, the global match. It behaves slightly differently for m// and s///. For s///, the PATTERN is replaced throughout EXPR as many times as it is found. For m//, the PATTERN is repetitively matched each time from where the last match left off.

Regular Expressions The following script Is expected to produce: 1 2 3 banana The following script Is expected to produce: 1 bqnqnq #!/usr/bin/perl $fruit = “banana”; $counter = 0; while ($fruit =~ m/a/g) { print ++$counter, “\n”; } print “$fruit \n”; #!/usr/bin/perl $fruit = “banana”; $counter = 0; while ($fruit =~ s/a/q/g) { print ++$counter, “\n”; } print “$fruit \n”;