Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( Draft – Xu Zhenya(

Similar presentations


Presentation on theme: "Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( Draft – Xu Zhenya("— Presentation transcript:

1 Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( xzy@buaa.edu.cn )xzy@buaa.edu.cn Draft – Xu Zhenya( 2002/10/01 ) Rev1.0 – Xu Zhenya( 2002/10/09 )

2 Unix Programming Environment Dept. of CSE, BUAA Agenda  1. An Introduction to Regular Expression in UNIX  BRE & ERE, GNU-RE  2. grep, egrep, fgrep  3. sed  Chapter 4( 4.1 & 4.2 )

3 Unix Programming Environment Dept. of CSE, BUAA An Introduction to Regular Expression(1)  UNIX commands combined with REs allow us perform three tasks:  Pattern matching  Search a particular pattern. RE specifies the pattern.  UNIX command searches: ed ex sed vi grep awk  Modify  Search for a particular pattern and change it. RE specifies the pattern and sometimes how to change.  UNIX command searches and modifies: ed ex sed vi  Programming  awk provides a programming language which can use REs.  Perl, python, tcl, etc  lex ( flex in GNU tools )  POSIX defines a standard regular expression library include the libc:  regcomp, regexec…  Three different regular expression definitions:  the shells : for filename/pathname expansion  Simple/Basic ( BRE )  grep, sed, vi  Extended ( ERE )  egrep, awk, perl, etc

4 Unix Programming Environment Dept. of CSE, BUAA Meta-characters  Meta-characters can be divided into three categories:  matching characters: the primary building blocks of REs  grouping and repeat characters:  Tagging and back-referencing :  Some of the RE meta characters are also shell meta characters. For example:  $ egrep r* /etc/passwd # “filename expansion  => so REs meta characters should be quoted.

5 Unix Programming Environment Dept. of CSE, BUAA Matching Characters ( 1 ) CharacterWhat is matches c any single character (which isn't one of the RE special characters) will match itself \ removes the special meaning from a RE special character. matches any single character ^ matches the start of the line $ matches the end of the line [chars] matches any ONE character within the square brackets [^chars] matches any ONE character NOT within the square brackets

6 Unix Programming Environment Dept. of CSE, BUAA Matching Characters ( 2 )  Example REs  hello Match the string hello ...\.... Match any three letters, followed by a., followed by three more letters  ^...\.... Match the same as the previous one but it must appear at the start of the line.  ^hello$ Match any line which contains hello ONLY  [a-z][^a-z] Match any two characters where the first one is between a-z and the second isn't.  \[\]\\ The characters '[', ']', '\' in order and contiguous.

7 Unix Programming Environment Dept. of CSE, BUAA Grouping and Repetition Characters( 1 ) Character(s)Purpose * match 0 or more of the previous RE + match one or more of the previous RE ? match zero or one occurences of the previous RE \{n\} match exactly N occurrences of the previous RE \{n,\} match at least N occurences of the previous RE \{n,m\} match between N and M occurences of the preious RE | match one of two different REs (alternation) ( ) used to group a collection of REs

8 Unix Programming Environment Dept. of CSE, BUAA Grouping and Repetition Characters( 2 )  Examples  OO+ Match two or more Os  /bin/(tcsh|bash)$ Match /bin/tcsh or /bin/bash which occur at the end of the line.  ^[^:]*:[^:]*:[0-9]  the start of a line (^), followed by  0 or more characters which aren't :s ([^:]*), followed by  a : (:), followed by  0 or more characters which aren't :s ([^:]*), followed by  a : (:), followed by  a single number ([0-9]), followed by  a : (:)  (\+|-)?[0-9]+  [+-]?[0-9]+  an optionally signed integer (a plus or minus or nothing followed by an integer).

9 Unix Programming Environment Dept. of CSE, BUAA Tagging and back-referencing (1)  \( \) is used to tag/remember the RE you wish to back reference.  Contents are placed into a numeric register 1, 2,...  access the contents of a register using \N where N is the number of the register  Examples  \(hello\) \1 Matches hello hello  \([0-9]*\),\([0-9]*\),\([0-9]*\) \3,\2,\1 Matches patterns like 12,55,34 34,55,12 1023,5321,934 934,5321,1023

10 Unix Programming Environment Dept. of CSE, BUAA Tagging and back-referencing (2) $ sed -e 's/\([a-zA-Z]\+\) \([a-zA-Z]\+\)/\2, \1/' | | | | | | | | | | | extract memory 1 | | | extract memory 1 | | extract memory 2 | | extract memory 2 | Put “UNIX" to memory 2 | Put “UNIX" to memory 2 put “Programming" to memory 1 put “Programming" to memory 1 Programming UNIX UNIX, Programming

11 Unix Programming Environment Dept. of CSE, BUAA BRE BRE Precedence (from high to low) escaped characters \ bracket expression [ ] subexpressions/back-references \( \) \n Repetition single-character-BRE Repetition * \{m,n\} concatenation anchoring ^ $

12 Unix Programming Environment Dept. of CSE, BUAA ERE ERE Precedence (from high to low) escaped characters \ bracket expression [ ] grouping ( ) single-character-ERE duplication * + ? {m,n} concatenation anchoring ^ $ alternation |

13 Unix Programming Environment Dept. of CSE, BUAA Conclusion  1. There are a few metacharacters common, both in representation and meaning, to all three definitions.  The back slash (\)  The square brackets ([])  There are also two metacharacters within the brackets which are common among all three forms of RE.  If the caret (^) is the first character within the brackets, the complement of the set of characters given is meant.  If the hyphen (-) occurs within the brackets, it indicates a range of characters.  2. In addition to BRE, extended REs add  the parentheses (()) for grouping:  the vertical bar (|) (also called pipe) for alternation.  the plus sign (+) meaning repeat the preceding item at least once.  the question mark (?) meaning match the preceding item either zero or one times.

14 Unix Programming Environment Dept. of CSE, BUAA Conclusion  In basic regular expressions the metacharacters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, \+, \{, \|, \(, and \).  Large repetition counts in the {m,n} construct may cause grep to use lots of memory. In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory.  Backreferences are very slow, and may require exponential time.

15 Unix Programming Environment Dept. of CSE, BUAA Overview of Text Manipulation Utils  ed was a very early UNIX line editor. It included a number of commands to manipulate the file being edited.  Other UNIX commands like sed, ex, vi were built on ed and use the same commands.  Ex/vi command syntax: [ from_address [, to_address ] command [ parameters ]  Read the textbook – Appendix A  Supplementary Reading:  An Introduction to Display Editing with Vi, William Joy, Mark Horton An Introduction to Display Editing with Vi, An Introduction to Display Editing with Vi,  VI (and Clones) Editor Reference Manual, Miles O'Neal VI (and Clones) Editor Reference Manual VI (and Clones) Editor Reference Manual

16 Unix Programming Environment Dept. of CSE, BUAA 2. Grep, egrep, fgrep  grep: BRE  egrep: ERE  fgrep: f = fixed string  Option: -r recursively  Reading the textbook: p72

17 Unix Programming Environment Dept. of CSE, BUAA 3. sed  sed(1) is a (s)tream (ed)itor, which manipulates the data according given rules.  The sed command line syntax is:  $ sed [OPTIONS] -e 'INSTRUCTION' [-e 'INSTRUCTION'..] FILE  $ sed [OPTIONS] -f SCRIPT.sed FILE  $ cat FILE | sed -f SCRIPT.sed "COMMANDS"  Some most used options:  -f SCRIPT.sed : Read commands from file SCEIPT.sed  -e "SED-EXPRESSION" : Expression follows immediately. You can give this option multiple times.  -n : Do not print lines unless p command used.

18 Unix Programming Environment Dept. of CSE, BUAA sed (2) sed (2)  The INSTRUCTION choices can be in format [OPTION]/RE/COMMAND [OPTION]/RE/COMMAND | | | | | | | | what to do | | what to do | | p = print | | p = print | | d = delete | | d = delete | | | | | | regular expression to search | | regular expression to search | Optional, can be left out. Optional, can be left out. g = global option. Do the COMMAND for all lines g = global option. Do the COMMAND for all lines

19 Unix Programming Environment Dept. of CSE, BUAA sed (3)  commands and options  1. The delete line command  $ sed -e '/this/d' text1.txt  2. The print lines options  $ sed -n -e '/this/p' text1.txt

20 Unix Programming Environment Dept. of CSE, BUAA sed (4)  3. The substitution of text in the line  The command that sed uses most is the (s)ubstitute and the syntax is bit different: [address]s/RE/replacement/[flag] [address]s/RE/replacement/[flag] | substitute command here substitute command here  Examples:  s/this-word//g  s/UPE/UNIX Programming Environment/g  The (g)lobal flag causes regular expression search to continue to the end of line, so all words on the line will be replaced.

21 Unix Programming Environment Dept. of CSE, BUAA sed (5)  Address in substitute command  The [address] says, where the command does its work. It can be 1.numeric address 2.special marker; like $ which denotes the end of file 3.regular expression to delimit the the commands to certain lines. # 1. We refer to explicit lines by a number here: 1s/BJ/Beijing/g Do substitution only at line 1. 10,20s/BJ/Beijing/g RANGE: from 10 to 20. #2. We can mix the number with the other address markers: 50,$s/BJ/Beijing/g from line 50 to the end of file. 1,/^$/s/BJ/Beijing/g from line 1 to next empty line. #3. Or use only regular expression to delimit the line area $ sed -e '/BEGIN/,/END/s/variable1/variable2/g' some.code BEGIN line1; line2 variable1 = variable1 + variableX; END

22 Unix Programming Environment Dept. of CSE, BUAA sed (6)  Examples:  $ sed –e ‘s/sweeping \(.*\) of \(.*\) steel/sweeping \2 \1 of/g’  sweeping blade of flashing steel  sweeping flashing blade of steel  $ sed –e ‘s/sweeping \(.*\) of \(.*\) steel/evil &/g’  sweeping blade of flashing steel  evil sweeping blade of flashing steel  & : specifys the entire expression

23 Unix Programming Environment Dept. of CSE, BUAA Supplementary Readings  Regular Expression in Unix  Regular Expression in Unix Regular Expression in Unix Regular Expression in Unix  Including two parts. The first part is an simple and incomplete introduction to regular expressions in UNIX. The other is the RE specification intercepted from SUSv2, including the formal grammar for BER and ERE.  External Filters, Programms and Commands in Unix, Mendel Cooper External Filters, Programms and Commands in Unix External Filters, Programms and Commands in Unix


Download ppt "Unix Programming Environment Part 3-4 Regular Expression and Pattern Matching Prepared by Xu Zhenya( Draft – Xu Zhenya("

Similar presentations


Ads by Google