Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regular Expressions A simple and powerful way to match characters Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Similar presentations


Presentation on theme: "Regular Expressions A simple and powerful way to match characters Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node."— Presentation transcript:

1 Regular Expressions A simple and powerful way to match characters Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node

2 Regular Expressions What is a regular expression? Literal (or normal characters) Alphanumeric abc…ABC…0123... Punctuation -_,.;:=()/+ *%&{}[]?!$’^|\<>"@# Metacharacters Ex: ls *.java Flavors… awk, egrep, Emacs, grep, Perl, POSIX, Tcl, PROSITE !

3 Example: PROSITE Patterns are regular expressions Pattern: <A-x-[ST](2)-x(0,1)-{V} Perl Regexp: ^A.[ST]{2}.?[^V] Text: The sequence must start with an alanine, followed by any amino acid, followed by a serine or a threonine, two times, followed by any amino acid or nothing, followed by any amino acid except a valine. Simply the syntax differ…

4 Regular Expressions (1) In Perl: /…/  Start and End of line ^ start, $ end  Match any of several […] or (…|…)  Match 0, 1 or more. 1 of any ? 0 or 1 + 1 or more * 0 or more {m,n} range ! negation Examples Match every instance of a SwissProt AC m/[OPQ][0-9][A-Z0-9]{3}[0-9]/; m/ [OPQ]\d[A-Z0-9]{3}\d/; Match every instance of a SwissProt ID m/[A-Z0-9]{2,5}_[A-Z0-9]{3,5}/;

5 Regular Expressions (2) Escape character or back reference  \char or \num Shorthand  \d digit [0-9]  \s whitespace [space\f\n\r\t]  \w character [a-zA-Z0-9_]  \D\S\W complement of \d\s\w Byte notation  \num character in octal  \xnum character in hexadecimal  \cchar control character Match operator  m/…/ $var =~ m/colou?r/; $var !~ m/colou?r/; Substitution operator  s/…/…/ $var =~ s/colou?r/couleur/; Translate operator  tr/…/…/ $revcomp =~ tr/ACGT/tgca/; Modifiers /…/# /i case insensitive /g global match Many other /s,/m,/o,/x...

6 Regular Expressions (3) Grouping  External reference $var =~ s/sp\:(\w\d{5})/swissprot AC=$1/;  Internal reference $var =~ s/tr\:(\w\d{5})\|\1/trembl AC=$1/;  Numbering $1 to $9 $10 to more if needed... Exercises  Create a regexp to recognize any pseudo IP address: 012.345.678.912  Create a regexp to recognize any email address: Jean.Dupond@isb-sib.ch  Create a regexp to change any HTML tag to another ->  On sib-dea: use visual_regexp-1.2.tcl to check your regular expressions (requires X-windows)

7 Regular Expressions (4)

8 Solution RegExp /[\d{1,3}\.]{3}\d{1,3}/ /\w+\.\w+\@\w+\-?\w+\.[a-z]{2,4}/ /\ /\ / generalized:  address = \w+

9 Perl In-liners

10 In-liners: some options -a autosplit (only with -n or -p) -c check syntax -d debugger -e pass script lines -h help -i direct editing of a file -n loop without print -p loop with print -v version … Example: perl -e 'print qq(hello world\n);'

11 In-liners: -n and -p  perl -pe ‘s/\r/\n/g’ is equivalent to: open READ, “file”; while ( ) { s/\r/\n/g; print; } close(READ); perl -i -pe ‘s/\r/\n/g’ Warning: the -i option modifies the file directly perl -ne is the same without the “print”

12 In-liners: -a (only with -n or -p) perl -ane ‘print @F, “\n”;’ is equivalent to: open READ, “file”; while ( ) { @F = split(‘ ‘); print @F, “\n”; } close(READ); Example: hits -b 'sw' -o pff2 prf:CARD | perl -ane 'print join("\t", reverse(@F)),"\n";'

13 In-liners: -a (only with -n or -p) sw:ICEA_XENLA 1 90 prf:CARD 5 -1 18.553 sw:RIK2_MOUSE 435 513 prf:CARD 5 -11 15.058 sw:CARC_HUMAN 1 88 prf:CARD 6 -1 15.395 sw:NAL1_HUMAN 1380 1463 prf:CARD 7 -1 15.058 sw:ASC_HUMAN 113 195 prf:CARD 7 -2 15.374 sw:CAR8_HUMAN 347 430 prf:CARD 8 -1 18.343 sw:CARF_HUMAN 134 218 prf:CARD 9 -1 12.932 hits -b 'sw' -o pff2 prf:CARD | perl -ane 'print join("\t", reverse(@F)),"\n";' hits -b 'sw' -o pff2 prf:CARD 18.553 -1 5 prf:CARD 90 1 sw:ICEA_XENLA 15.058 -11 5 prf:CARD 513 435 sw:RIK2_MOUSE 15.395 -1 6 prf:CARD 88 1 sw:CARC_HUMAN 15.058 -1 7 prf:CARD 1463 1380 sw:NAL1_HUMAN 15.374 -2 7 prf:CARD 195 113 sw:ASC_HUMAN 18.343 -1 8 prf:CARD 430 347 sw:CAR8_HUMAN 12.932 -1 9 prf:CARD 218 134 sw:CARF_HUMAN

14 In-liners: examples perl -e ‘print int(rand(100)),"\n" for 1..100' | perl -e '$x{$_}=1 while <>;print sort {$a $b} keys %x' for($i=0;$i<100;$i++) { $nb = int(rand(100)); $hash{$nb} = 1; } print sort {$a $b} keys %hash;

15 In-liners: extract FASTA from SP cat /db/proteome/ECOLI.dat | perl -ne ‘if (/^ID +(\w+)/) {print">$1\n";} if(/^ /) {s/ //g; print}’ open (READ, “/db/proteome/ECOLI.dat”); # open file while ($line= ) { # read line by line until the end if($line=~ /^ID +(\w+)/) { print “>$1\n”; } # print fasta header if($line=~ /^ /) { $line =~ s/ //g; # remove spaces print $line; # print sequence line } close(READ);

16 In-liners: your turn… Create an In-liner that extracts non-redundant FASTA format sequences from a redundant database in SwissProt format cat /db/proteome/ECOLI.dat | perl -ne ' if (/^ID +(\w+)/) {print ">$1\n”;} if(/^ /) {s/ //g; print}' | perl -e 'while(<>) { if (/>/) { $i=$_; $x{$i}=""} $x{$i}.=$_} print values %x’


Download ppt "Regular Expressions A simple and powerful way to match characters Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node."

Similar presentations


Ads by Google