Regular Expressions and perl

Slides:



Advertisements
Similar presentations
Searching using regular expressions. A regular expression is also a ‘special text string’ for describing a search pattern. Regular expressions define.
Advertisements

Regular Expressions in Perl By Josue Vazquez. What are Regular Expressions? A template that either matches or doesn’t match a given string. Often called.
Regular Expression Original Notes by Song Guo. What Regular Expressions Are Exactly - Terminology a regular expression is a pattern describing a certain.
ISBN Chapter 6 Data Types Character Strings Pattern Matching.
More Regular Expressions. List/Scalar Context for m// Last week, we said that m// returns ‘true’ or ‘false’ in scalar context. (really, 1 or 0). In list.
Regular Expressions. What are regular expressions? A means of searching, matching, and replacing substrings within strings. Very powerful (Potentially)
Regular Expressions Comp 2400: Fall 2008 Prof. Chris GauthierDickey.
Regular Expressions in ColdFusion Applications Dave Fauth DOMAIN technologies Knowledge Engineering : Systems Integration : Web.
Last Updated March 2006 Slide 1 Regular Expressions.
Regular Expressions Week 07 TCNJ Web 2 Jean Chu. Regular Expressions Regular Expressions are a powerful way to validate and format text strings that may.
Regular Expression Darby Tien-Hao Chang (a.k.a. dirty) Department of Electrical Engineering, National Cheng Kung University.
 Text Manipulation and Data Collection. General Programming Practice Find a string within a text Find a string ‘man’ from a ‘A successful man’
2440: 211 Interactive Web Programming Expressions & Operators.
Regular Expressions in Perl Part I Alan Gold. Basic syntax =~ is the matching operator !~ is the negated matching operator // are the default delimiters.
Strings The Basics. Strings can refer to a string variable as one variable or as many different components (characters) string values are delimited by.
Perl and Regular Expressions Regular Expressions are available as part of the programming languages Java, JScript, Visual Basic and VBScript, JavaScript,
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 4. Document Search and Regular Expressions.
Programming in Perl regular expressions and m,s operators Peter Verhás January 2002.
Regular Expressions in PHP. Supported RE’s The most important set of regex functions start with preg. These functions are a PHP wrapper around the PCRE.
VBScript Session 13.
Overview A regular expression defines a search pattern for strings. Regular expressions can be used to search, edit and manipulate text. The pattern defined.
Working with Forms and Regular Expressions Validating a Web Form with JavaScript.
When you read a sentence, your mind breaks it into tokens—individual words and punctuation marks that convey meaning. Compilers also perform tokenization.
Module 6 – Generics Module 7 – Regular Expressions.
CS 330 Programming Languages 10 / 02 / 2007 Instructor: Michael Eckmann.
R EGULAR E XPRESSION IN P ERL (P ART 1) Thach Nguyen.
CSC 2720 Building Web Applications PHP PERL-Compatible Regular Expressions.
Copyright © Curt Hill Regular Expressions Providing a Search Pattern.
JavaScript III ECT 270 Robin Burke. Outline Validation examples password more complex Form validation Regular expressions.
Karthik Sangaiah.  Developed by Larry Wall ◦ “There’s more than one way to do it” ◦ “Easy things should be easy and hard things should be possible” 
Standard Types and Regular Expressions CS 480/680 – Comparative Languages.
Introduction to Programming the WWW I CMSC Winter 2004 Lecture 13.
Chapter 4 © 2009 by Addison Wesley Longman, Inc Pattern Matching - JavaScript provides two ways to do pattern matching: 1. Using RegExp objects.
-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman.
Introduction to Programming the WWW I CMSC Winter 2003 Lecture 17.
Pattern Matching: Simple Patterns. Introduction Programmers often need to scan a file, directory, etc. for a specific substring. –Find all files that.
CS 330 Programming Languages 09 / 30 / 2008 Instructor: Michael Eckmann.
OOP Tirgul 11. What We’ll Be Seeing Today  Regular Expressions Basics  Doing it in Java  Advanced Regular Expressions  Summary 2.
Lesson 4 String Manipulation. Lesson 4 In many applications you will need to do some kind of manipulation or parsing of strings, whether you are Attempting.
COMP234-Perl Variables, Literals Context, Operators Command Line Input Regex Program template.
Regular Expressions Copyright Doug Maxwell (
RE Tutorial.
Regular Expressions.
Looking for Patterns - Finding them with Regular Expressions
Lecture 19 Strings and Regular Expressions
CSC 594 Topics in AI – Natural Language Processing
Regular Expressions in Perl
Regular Expressions in Pearl - Part II
Regular Expression - Intro
Lecture 9 Shell Programming – Command substitution
Chapter 19 PHP Part II Credits: Parts of the slides are based on slides created by textbook authors, P.J. Deitel and H. M. Deitel by Prentice Hall ©
Regular Expression Beihang Open Source Club.
CSC 594 Topics in AI – Natural Language Processing
Pattern Matching in Strings
Intro to PHP & Variables
Advanced Find and Replace with Regular Expressions
CSCI 431 Programming Languages Fall 2003
Functions, Regular expressions and Events
Regular Expressions
Regular Expressions and Grep
Lecture 25: Regular Expressions
- Regular expressions:
Output Manipulation.
Regular Expressions in Java
Regular Expressions in Java
Regular Expression in Java 101
CIS 136 Building Mobile Apps
ADVANCE FIND & REPLACE WITH REGULAR EXPRESSIONS
Perl Regular Expressions – Part 1
Regular Expressions in Java
Presentation transcript:

Regular Expressions and perl perl and Bioinformatics Dan Rochowiak drochowi@cs.uah.edu perlRegEx.ppt

Regular Expressions What is a regular expression? A regular expression is simply a string that describes a pattern. Patterns are in common use these days; examples are the patterns typed into a search engine to find web pages and the patterns used to list files in a directory, e.g., ls *.txt or dir *.*. In Perl, the patterns described by regular expressions are used to search strings, extract desired parts of strings, and to do search and replace operations.

Regular Expressions In regular expressions, the following character set symbols match a single character. So, to handle arbitrary blank space you must use \s+. These symbols may be used in a character set. So, when looking for a hex integer you could use [a-fA-F\d]. Symbol Equiv Description \w [a-zA-Z0-9_] A "word" character (alphanumeric plus "_") \W [^a-zA-Z0-9_] Match a non-word character \s [ \t\n\f\r] Match a whitespace character \S [^\s] Match a non-whitespace character \d [0-9] Match a digit character \D [^0-9] Match a non-digit character

Regular Expressions Perl has the standard regex quantifiers or closures, where r is any regular expressions. r* Zero or more occurences of r (greedy match). r+ One or more occurences of r (greedy match). r? Zero or one occurence of r (greedy match). r*? Zero or more occurences of r (match minimal). r+? One or more occurences of r (match minimal). r?? Zero or one occurence of r (match minimal). Let q be a regex with a quantifier. If there are many ways for q to match some text, a greedy quantifier will match (or "eats up") as much text as possible; a minimal matcher does the opposite. If a regex contains more than one quantifier, the quantifiers are "fed" left to right.

Regular Expressions The two main regex operations are searching/finding and substituting. In searching, we test if a string contains a regular expression. In substituting, we replace part of the original string with a new string; the new string is often based on the original. Both of these operations use the regular expression operator =~ which consists of two characters. This operator is not related to either equals = or ~

Searching To determine if the string $line contains a recent year such as 1998 or 1983, use the search operator =~ /.../. The slashes '/' delimit the beginning and the end of the regular expression. if ($line =~ /19[89]\d/) { # found a year in $line } In general, to determine if string $var contains the regular expression re use any of the following forms. If the regular expression contains a slash '/' itself, then use mXreX form, where each X is the same single character not appearing in re. In mX...X, the m stands for "match". if ($var =~ /re/) { ... } if ($var =~ m:re:) { ... } # can replace ':' with any other character while ($var =~ m/re/) { ... } # can replace '/' with any other character

Substrings To access the substring in $var matched by part of the regular expression re, put the part of re in parenthesis. The matched text is accessible via the variables $1, $2, ..., $k, where $k matches the k-th parenthesized part of the regular expression. For example to break up an e-mail address user@machine in $line we could do if ($line =~ /(\S+)@(\S+)/) { # \S = any non-space character my($user, $machine) = ($1, $2); ... } The submatch variables $1, $2, ... $k are updated after each successful regex operation, which wipes out the previous values.

Substrings Use \k, not $k, in the regular expression itself to refer to a previously matched substring. For example, to search for identical begining and ending HTML tags <xyz> ... </xyz> on a single line $line use if ($line =~ m|<(.*)>(.*)</\1>|) { # search for: <xyz>stuff</xyz> my($stuff) = $2; ... }

Substitution To replace or substitute text in $var from the regular expression old to new use the following form. $var =~ s/old/new/; # replace old with new if ($var =~ s:old:new:) { ... } # replace ':' with any other character To use part of the actual text matched by the old regex, the new regex can use the $k variables. Taking our previous example involving years, to replace the year 19xy with xy, use $line =~ s/19(\d\d)/$1/;

Examples: Simple Case The simplest regexp is simply a word, or string of characters. A regexp consisting of a word, matches any string that contains that word: "Hello World" =~ /World/; # matches "Hello World" is a double quoted string. World is the regular expression and the // enclosing /World/ tells perl to search a string for a match. The operator =~ associates the string with the regexp match and produces a true value if the regexp matched, or false if it did not match. In this case, World matches the second word in "Hello World", so the expression is true.

Examples: More Simple Cases There are useful variations. The sense of the match can be reversed by using !~ operator: if ("Hello World" !~ /World/) { print "It doesn't match\n"; } else { print "It matches\n"; The literal string in the regexp can be replaced by a variable: $greeting = "World"; if ("Hello World" =~ /$greeting/) {

Examples: Special Cases If you're matching against the special default variable $_, the $_ =~ part can be omitted $_ = "Hello World"; if (/World/) { print "It matches\n"; } else { print "It doesn't match\n"; And finally, the // default delimiters for a match can be changed to arbitrary delimiters by putting an 'm' out front: "Hello World" =~ m!World!; # matches, delimited by '!' "Hello World" =~ m{World}; # matches, note the matching '{}'

Examples: Exact Match Consider how different regexps would match "Hello World": "Hello World" =~ /world/; # doesn't match "Hello World" =~ /o W/; # matches "Hello World" =~ /oW/; # doesn't match "Hello World" =~ /World /; # doesn't match The first regexp world doesn't match because regexps are case-sensitive. The second regexp matches because the substring 'o W' occurs in the string . The space character ' ' is treated like any other character and is needed to match in this case. The lack of a space character is the reason the third regexp doesn't match. The fourth regexp doesn't match because there is a space at the end of the regexp, but not at the end of the string. Regexps must match a part of the string exactly in order for the statement to be true.

Examples: Metacharacters With respect to character matching, there is another point to consider. Not all characters can be used 'as is' in a match. Some characters, called metacharacters, are reserved for use in regexp notation. The metacharacters are {}[]()^$.|*+?\ A metacharacter can be matched by putting a backslash before it: "2+2=4" =~ /2+2/; # doesn't match, + is a metacharacter "2+2=4" =~ /2\+2/; # matches, \+ is treated like an ordinary + "The interval is [0,1)." =~ /[0,1)./ # is a syntax error! "The interval is [0,1)." =~ /\[0,1\)\./ # matches "/usr/bin/perl" =~ /\/usr\/local\/bin\/perl/; # matches In the last regexp, the forward slash '/' is also backslashed, because it is used to delimit the regexp.

Matching and Variables Similar escape sequences are used in double-quoted strings and the regexps in Perl are mostly treated as double-quoted strings.This means that variables can be used in regexps . The values of the variables in the regexp will be substituted in before the regexp is evaluated for matching purposes. So: $foo = 'house'; 'housecat' =~ /$foo/; # matches 'cathouse' =~ /cat$foo/; # matches 'housecat' =~ /${foo}cat/; # matches

Anchors "housekeeper" =~ /keeper/; # matches So far, if the regexp matched anywhere in the string it was considered a match. Sometimes we'd like to specify where in the string the regexp should try to match. To do this, use the anchor metacharacters ^ and $. The anchor ^ means match at the beginning of the string and the anchor $ means match at the end of the string, or before a newline at the end of the string. So: "housekeeper" =~ /keeper/; # matches "housekeeper" =~ /^keeper/; # doesn't match "housekeeper" =~ /keeper$/; # matches "housekeeper\n" =~ /keeper$/; # matches The second regexp doesn't match because ^ constrains keeper to match only at the beginning of the string, but "housekeeper" has keeper starting in the middle. The third regexp does match, since the $ constrains keeper to match only at the end of the string.

Character Classes A character class allows a set of possible characters, rather than just a single character, to match at a particular point in a regexp. Character classes are denoted by brackets [...], with the set of characters to be possibly matched inside. So: /cat/; # matches 'cat' /[bcr]at/; # matches 'bat, 'cat', or 'rat' /item[0123456789]/; # matches 'item0' or ... or 'item9' "abc" =~ /[cab]/; # matches 'a' In the last statement, even though 'c' is the first character in the class, 'a' matches because the first character position in the string is the earliest point at which the regexp can match.

Character Classes /[yY][eE][sS]/; # match 'yes' as case-insensitive # 'yes', 'Yes', 'YES', etc. This regexp displays a common task: perform a case-insensitive match. Perl provides away of avoiding all those brackets by simply appending an 'i' to the end of the match. Then /[yY][eE][sS]/; can be rewritten as /yes/i;. The 'i' stands for case-insensitive and is an example of a modifier of the matching operation. There are more modifications.

Character Class - Ranges The special character '-' acts as a range operator within character classes, so that a contiguous set of characters can be written as a range. With ranges, the unwieldy [0123456789] and [abc...xyz] become the svelte [0-9]and [a-z]. So: /item[0-9]/; # matches 'item0' or ... or 'item9' /[0-9bx-z]aa/; # matches '0aa', ..., '9aa', # 'baa', 'xaa', 'yaa', or 'zaa' /[0-9a-fA-F]/; # matches a hexadecimal digit /[0-9a-zA-Z_]/; # matches an alphanumeric character, # like those in a perl variable name If '-' is the first or last character in a character class, it is treated as an ordinary character; [-ab], [ab-] and [a\-b] are all equivalent.

Character Classes - Negation The special character ^ in the first position of a character class denotes a negated character class, which matches any character but those in the brackets. Both [...] and [^...] must match a character, or the match fails. So: /[^a]at/; # doesn't match 'aat' or 'at', but # matches all other 'bat', 'cat, '0at', '%at', etc. /[^0-9]/; # matches a non-numeric character /[a^]at/; # matches 'aat' or '^at'; here '^' is ordinary