Regular Expressions Copyright 2005-2007 Doug Maxwell (http://www.unixlore.net)

Regular Expressions Copyright Doug Maxwell (

Regular Expressions Regular expressions (regexes) allow you to describe and parse patterns in text They are extremely useful as implemented in programming languages and other tools, including editors Examples of such tools are grep, find, sed, awk, Perl, Python, Vim, and Emacs

What They Can Do They can help you search for complex text patterns in one or many files This Emacs Lisp regex finds duplicate words \\b\$[^\n\t]+\$[ \n\t]+\\1\\b They can alter text on the fly In Vim, this uppercases the first word in a sentence, if it was lowercase s/$[.!?]$$\s\+$$[a-z]$/\1\2\u\2/g

Terminology I Metacharacters vs. Literals
Metacharacters have special meaning, while literals just represent themselves Examples of metacharacters include ^, $, ., *, +, ? Quantifiers (how much of something) *, +, ?, {1,3} Character class (matching any one of several) [a-zA-Z] or [^!.?] Alternation (a|b) (read the vertical bar as “or”)

Terminology II Anchors Match a specific, fixed place in the text
^ (beginning of line) $ (end of line) Escape A backslash '\' can be used to remove the special meaning from a metacharacter, or add meaning to a literal Examples \s, \\, \*

Basics The simplest regexes just specify literal text
egrep 'hack' resume This just finds and prints all the lines in the file resume that contain the text hack Note that this will print lines containing hacker, hacking and shack The metacharacter . matches any single character, except a newline egrep 'h.ck' resume will match both hick and shack, among others

Quantifiers and Grouping I
We can specify the number of times a character or group of characters must match by using the quantifiers * zero or more + one or more ? zero or one {m,n}, {m,} or {m} m to n inclusive, at least m, or exactly m, respectively egrep 'hack*' resume would print lines containing hac, hack, and hackk (any number of k's, including none)

Quantifiers and Grouping II
We can constrain the quantifier to a group of characters by using parentheses egrep 'ha(ck)?' resume prints lines containing ha followed by zero or one occurrences of ck So ha and hack are the only two valid matches here egrep 'h(ack)+' resume would match an h, followed by one or more ocurrences of ack (hack, hackack, etc)

Anchors Anchors match a specific point in the pattern, but don't consume a character Two of the most commonly used anchors are ^ and $, for start and end of line, respectively There are also /< and />, for start and end of word egrep '^hack' resume now matches hack, but only at the start of a line egrep 'hack$' resume matches hack, but only at the end of a line egrep '^hack$' resume matches lines with only the word hack in them

Character Classes I You can specify that one of several characters be matched by placing them in brackets [?!.] matches any one of ?, !, or . Note the metacharacters ? and . have lost their special meaning inside the brackets [^?!.] matches anything but ?, !, or . In this case, the ^ has a different meaning, logical not It is just a literal anywhere else but at the front of a character class

Character Classes II [a-z] The dash specifies a range of characters, so this matches a through z, lowercase [-!?.] Put the dash in front if you don't want it to mean “range”, and be just another literal You can quantify character classes, just like groupings [a-z]* matches zero or more lowercase letters

More Special Characters
\w matches a word character (alphanumeric and underscore) \s matches a whitespace character \d matches a digit character \b matches a word boundary These are all complemented by \W, \S, \D, and \B Some tools (like grep, sed, awk, and Emacs) use \< and \> anchors for the start and end of a word, respectively

Greediness One thing to be aware of – the quantifiers *, +, ? and {} will eat as much text as possible during a match They are called “greedy” for this reason Given the string “Just Another Perl Hacker”, the pattern /^J*e/ matches Just Another Perl Hacke We can make these quantifiers non-greedy in some implementations by adding a ? So the pattern /^J*?e/ now matches Just Anothe

Remembering Matches I Enclosing portions of the pattern in parentheses will force the regex engine to “remember” the text actually matched, and store it for later use “Later use” can mean later in the same pattern, or after the match is complete You can have more than one parenthesized expression, they are stored in order Later in the same pattern, use \1, \2, etc. After the regex has completed, use $1, $2, etc. (Perl)

Remembering Matches II
s/(\d{3})-(\d{3})-(\d{4})/\2-\1-\3/ will swap phone number area code and exchange (first and second expressions) egrep -i '\<([a-z]+) +\1\>' resume This will find all the doubled words in your resume But it does this one line at a time, and so can't find doubled words that cross line boundaries More sophisticated regex engines, like Perl's, can match across lines

Examples Those are the basics of regular expressions
Let's see some real-world examples

Perl Example Look through a directory of report files and extract the report name. Assume that the filenames are of the form "author-title-date.pdf". find . -name "*.pdf" | perl -pe 's/.\/\w+- (\w+)-.*/$1/' | sort | uniq find passes it's results to perl with a leading / the -p argument runs through all the arguments supplied as files and prints the result ($_) we can use $1 instead of \1 here because the initial pattern of the s// has already been compiled when the $1 is seen sort sorts the filenames alphabetically uniq removes duplicate lines

Yet Another Perl Example
In-place edit of shell scripts, changing all occurrences of to making backup files as you go perl -p -i.bak -e cm.org!g' *.sh We can use almost anything as a substitution delimiter, in place of / Here, we use ! Note that we escape the dots so they don't match any character, just a dot

Sed sed 's/^[ \t]*//' Delete whitespace from the front of each line
Use it like this cat file | sed 's/^[ \t]*//' > altered_file OR sed 's/^[ \t]*//' < file > altered_file Sed is a filter, and so by default will accept input on standard input, and output on standard output It won't alter the input file in-place by default This will cat file | sed 's/^[ \t]*//' > file

Awk Awk is also a stdin-to-stdout filter, like sed
Awk deals well with columnar data awk '/foo/{print $1,$3}' file Prints the first and third fields of all lines in file that match the regex /foo/ awk '$2~/foo/{print $1,$3}' file Prints the first and third fields of all lines in file whose second field matches /foo/

Grep Grep finds patterns in the lines of files passed as arguments
egrep is just grep -E, and handles “extended” regexes egrep 'CRIT.+FW:' /var/log/messages Prints all lines in /var/log/messages that are critical firewall entries egrep -v -i 'crit.+fw:' /var/log/messages Prints all lines in /var/log/messages that do not contain critical firewall entries Case is ignored here with -i

Copyright & License Copyright (c) Doug Maxwell ( Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is at

Regular Expressions Copyright 2005-2007 Doug Maxwell (http://www.unixlore.net)

Similar presentations

Presentation on theme: "Regular Expressions Copyright 2005-2007 Doug Maxwell (http://www.unixlore.net)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regular Expressions Copyright 2005-2007 Doug Maxwell (http://www.unixlore.net)

Similar presentations

Presentation on theme: "Regular Expressions Copyright 2005-2007 Doug Maxwell (http://www.unixlore.net)"— Presentation transcript:

Similar presentations

About project

Feedback