Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regular Expressions A simple and powerful way to match characters Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node.

Similar presentations


Presentation on theme: "Regular Expressions A simple and powerful way to match characters Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node."— Presentation transcript:

1 Regular Expressions A simple and powerful way to match characters Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node

2 Regular Expressions What is a regular expression? Literal (or normal characters) Alphanumeric abc…ABC… Punctuation -_,.;:=()/+ Metacharacters Ex: ls *.java Flavors… awk, egrep, Emacs, grep, Perl, POSIX, Tcl, PROSITE !

3 Example: PROSITE Patterns are regular expressions Pattern: { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/3440842/12/slides/slide_2.jpg", "name": "Example: PROSITE Patterns are regular expressions Pattern:

4 Regular Expressions (1) In Perl: /…/  Start and End of line ^ start, $ end  Match any of several […] or (…|…)  Match 0, 1 or more. 1 of any ? 0 or or more * 0 or more {m,n} range ! negation Examples Match every instance of a SwissProt AC m/[OPQ][0-9][A-Z0-9]{3}[0-9]/; m/ [OPQ]\d[A-Z0-9]{3}\d/; Match every instance of a SwissProt ID m/[A-Z0-9]{2,5}_[A-Z0-9]{3,5}/;

5 Regular Expressions (2) Escape character or back reference  \char or \num Shorthand  \d digit [0-9]  \s whitespace [space\f\n\r\t]  \w character [a-zA-Z0-9_]  \D\S\W complement of \d\s\w Byte notation  \num character in octal  \xnum character in hexadecimal  \cchar control character Match operator  m/…/ $var =~ m/colou?r/; $var !~ m/colou?r/; Substitution operator  s/…/…/ $var =~ s/colou?r/couleur/; Translate operator  tr/…/…/ $revcomp =~ tr/ACGT/tgca/; Modifiers /…/# /i case insensitive /g global match Many other /s,/m,/o,/x...

6 Regular Expressions (3) Grouping  External reference $var =~ s/sp\:(\w\d{5})/swissprot AC=$1/;  Internal reference $var =~ s/tr\:(\w\d{5})\|\1/trembl AC=$1/;  Numbering $1 to $9 $10 to more if needed... Exercises  Create a regexp to recognize any pseudo IP address:  Create a regexp to recognize any address:  Create a regexp to change any HTML tag to another ->  On sib-dea: use visual_regexp-1.2.tcl to check your regular expressions (requires X-windows)

7 Regular Expressions (4)

8 Solution RegExp /[\d{1,3}\.]{3}\d{1,3}/ /\ /\ / generalized:  address = \w+

9 Perl In-liners

10 In-liners: some options -a autosplit (only with -n or -p) -c check syntax -d debugger -e pass script lines -h help -i direct editing of a file -n loop without print -p loop with print -v version … Example: perl -e 'print qq(hello world\n);'

11 In-liners: -n and -p  perl -pe ‘s/\r/\n/g’ is equivalent to: open READ, “file”; while ( ) { s/\r/\n/g; print; } close(READ); perl -i -pe ‘s/\r/\n/g’ Warning: the -i option modifies the file directly perl -ne is the same without the “print”

12 In-liners: -a (only with -n or -p) perl -ane “\n”;’ is equivalent to: open READ, “file”; while ( ) = split(‘ ‘); “\n”; } close(READ); Example: hits -b 'sw' -o pff2 prf:CARD | perl -ane 'print join("\t",

13 In-liners: -a (only with -n or -p) sw:ICEA_XENLA 1 90 prf:CARD sw:RIK2_MOUSE prf:CARD sw:CARC_HUMAN 1 88 prf:CARD sw:NAL1_HUMAN prf:CARD sw:ASC_HUMAN prf:CARD sw:CAR8_HUMAN prf:CARD sw:CARF_HUMAN prf:CARD hits -b 'sw' -o pff2 prf:CARD | perl -ane 'print join("\t", hits -b 'sw' -o pff2 prf:CARD prf:CARD 90 1 sw:ICEA_XENLA prf:CARD sw:RIK2_MOUSE prf:CARD 88 1 sw:CARC_HUMAN prf:CARD sw:NAL1_HUMAN prf:CARD sw:ASC_HUMAN prf:CARD sw:CAR8_HUMAN prf:CARD sw:CARF_HUMAN

14 In-liners: examples perl -e ‘print int(rand(100)),"\n" for ' | perl -e '$x{$_}=1 while <>;print sort {$a $b} keys %x' for($i=0;$i<100;$i++) { $nb = int(rand(100)); $hash{$nb} = 1; } print sort {$a $b} keys %hash;

15 In-liners: extract FASTA from SP cat /db/proteome/ECOLI.dat | perl -ne ‘if (/^ID +(\w+)/) {print">$1\n";} if(/^ /) {s/ //g; print}’ open (READ, “/db/proteome/ECOLI.dat”); # open file while ($line= ) { # read line by line until the end if($line=~ /^ID +(\w+)/) { print “>$1\n”; } # print fasta header if($line=~ /^ /) { $line =~ s/ //g; # remove spaces print $line; # print sequence line } close(READ);

16 In-liners: your turn… Create an In-liner that extracts non-redundant FASTA format sequences from a redundant database in SwissProt format cat /db/proteome/ECOLI.dat | perl -ne ' if (/^ID +(\w+)/) {print ">$1\n”;} if(/^ /) {s/ //g; print}' | perl -e 'while(<>) { if (/>/) { $i=$_; $x{$i}=""} $x{$i}.=$_} print values %x’


Download ppt "Regular Expressions A simple and powerful way to match characters Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node."

Similar presentations


Ads by Google