Download presentation
Presentation is loading. Please wait.
Published byKennedi Somerset Modified over 9 years ago
1
Regular Expressions A simple and powerful way to match characters Laurent Falquet, EPFL March, 2005 Swiss Institute of Bioinformatics Swiss EMBnet node
2
Regular Expressions What is a regular expression? Literal (or normal characters) Alphanumeric abc…ABC…0123... Punctuation -_,.;:=()/+ *%&{}[]?!$’^|\<>"@# Metacharacters Ex: ls *.java Flavors… awk, egrep, Emacs, grep, Perl, POSIX, Tcl, PROSITE !
3
Example: PROSITE Patterns are regular expressions Pattern: <A-x-[ST](2)-x(0,1)-{V} Perl Regexp: ^A.[ST]{2}.?[^V] Text: The sequence must start with an alanine, followed by any amino acid, followed by a serine or a threonine, two times, followed by any amino acid or nothing, followed by any amino acid except a valine. Simply the syntax differ…
4
Regular Expressions (1) In Perl: /…/ Start and End of line ^ start, $ end Match any of several […] or (…|…) Match 0, 1 or more. 1 of any ? 0 or 1 + 1 or more * 0 or more {m,n} range ! negation Examples Match every instance of a SwissProt AC m/[OPQ][0-9][A-Z0-9]{3}[0-9]/; m/ [OPQ]\d[A-Z0-9]{3}\d/; Match every instance of a SwissProt ID m/[A-Z0-9]{2,5}_[A-Z0-9]{3,5}/;
5
Regular Expressions (2) Escape character or back reference \char or \num Shorthand \d digit [0-9] \s whitespace [space\f\n\r\t] \w character [a-zA-Z0-9_] \D\S\W complement of \d\s\w Byte notation \num character in octal \xnum character in hexadecimal \cchar control character Match operator m/…/ $var =~ m/colou?r/; $var !~ m/colou?r/; Substitution operator s/…/…/ $var =~ s/colou?r/couleur/; Translate operator tr/…/…/ $revcomp =~ tr/ACGT/tgca/; Modifiers /…/# /i case insensitive /g global match Many other /s,/m,/o,/x...
6
Regular Expressions (3) Grouping External reference $var =~ s/sp\:(\w\d{5})/swissprot AC=$1/; Internal reference $var =~ s/tr\:(\w\d{5})\|\1/trembl AC=$1/; Numbering $1 to $9 $10 to more if needed... Exercises Create a regexp to recognize any pseudo IP address: 012.345.678.912 Create a regexp to recognize any email address: Jean.Dupond@isb-sib.ch Create a regexp to change any HTML tag to another -> On sib-dea: use visual_regexp-1.2.tcl to check your regular expressions (requires X-windows)
7
Regular Expressions (4)
8
Solution RegExp /[\d{1,3}\.]{3}\d{1,3}/ /\w+\.\w+\@\w+\-?\w+\.[a-z]{2,4}/ /\ /\ / generalized: address = \w+
9
Perl In-liners
10
In-liners: some options -a autosplit (only with -n or -p) -c check syntax -d debugger -e pass script lines -h help -i direct editing of a file -n loop without print -p loop with print -v version … Example: perl -e 'print qq(hello world\n);'
11
In-liners: -n and -p perl -pe ‘s/\r/\n/g’ is equivalent to: open READ, “file”; while ( ) { s/\r/\n/g; print; } close(READ); perl -i -pe ‘s/\r/\n/g’ Warning: the -i option modifies the file directly perl -ne is the same without the “print”
12
In-liners: -a (only with -n or -p) perl -ane ‘print @F, “\n”;’ is equivalent to: open READ, “file”; while ( ) { @F = split(‘ ‘); print @F, “\n”; } close(READ); Example: hits -b 'sw' -o pff2 prf:CARD | perl -ane 'print join("\t", reverse(@F)),"\n";'
13
In-liners: -a (only with -n or -p) sw:ICEA_XENLA 1 90 prf:CARD 5 -1 18.553 sw:RIK2_MOUSE 435 513 prf:CARD 5 -11 15.058 sw:CARC_HUMAN 1 88 prf:CARD 6 -1 15.395 sw:NAL1_HUMAN 1380 1463 prf:CARD 7 -1 15.058 sw:ASC_HUMAN 113 195 prf:CARD 7 -2 15.374 sw:CAR8_HUMAN 347 430 prf:CARD 8 -1 18.343 sw:CARF_HUMAN 134 218 prf:CARD 9 -1 12.932 hits -b 'sw' -o pff2 prf:CARD | perl -ane 'print join("\t", reverse(@F)),"\n";' hits -b 'sw' -o pff2 prf:CARD 18.553 -1 5 prf:CARD 90 1 sw:ICEA_XENLA 15.058 -11 5 prf:CARD 513 435 sw:RIK2_MOUSE 15.395 -1 6 prf:CARD 88 1 sw:CARC_HUMAN 15.058 -1 7 prf:CARD 1463 1380 sw:NAL1_HUMAN 15.374 -2 7 prf:CARD 195 113 sw:ASC_HUMAN 18.343 -1 8 prf:CARD 430 347 sw:CAR8_HUMAN 12.932 -1 9 prf:CARD 218 134 sw:CARF_HUMAN
14
In-liners: examples perl -e ‘print int(rand(100)),"\n" for 1..100' | perl -e '$x{$_}=1 while <>;print sort {$a $b} keys %x' for($i=0;$i<100;$i++) { $nb = int(rand(100)); $hash{$nb} = 1; } print sort {$a $b} keys %hash;
15
In-liners: extract FASTA from SP cat /db/proteome/ECOLI.dat | perl -ne ‘if (/^ID +(\w+)/) {print">$1\n";} if(/^ /) {s/ //g; print}’ open (READ, “/db/proteome/ECOLI.dat”); # open file while ($line= ) { # read line by line until the end if($line=~ /^ID +(\w+)/) { print “>$1\n”; } # print fasta header if($line=~ /^ /) { $line =~ s/ //g; # remove spaces print $line; # print sequence line } close(READ);
16
In-liners: your turn… Create an In-liner that extracts non-redundant FASTA format sequences from a redundant database in SwissProt format cat /db/proteome/ECOLI.dat | perl -ne ' if (/^ID +(\w+)/) {print ">$1\n”;} if(/^ /) {s/ //g; print}' | perl -e 'while(<>) { if (/>/) { $i=$_; $x{$i}=""} $x{$i}.=$_} print values %x’
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.